Yolos-small Open-source Object Detection Model - Free Deployment for Accurate Identification of Objects in COCO Dataset

Yolos Small

Developed by hustvl

A vision Transformer (ViT)-based object detection model trained with DETR loss function, achieving excellent performance on the COCO dataset.

Object Detection

Transformers

Open Source License:Apache-2.0 #Object Detection #Vision Transformer #COCO Dataset

Downloads 154.46k

Release Time : 4/26/2022

Model Overview

YOLOS is a concise and efficient vision Transformer model specifically designed for object detection tasks. It employs DETR-style bipartite matching loss for training and achieves detection accuracy comparable to DETR and Faster R-CNN on the COCO dataset.

Model Features

Transformer Architecture

Utilizes a pure vision Transformer structure, enabling efficient object detection without traditional CNN components.

Bipartite Matching Loss

Employs the Hungarian algorithm for optimal matching between predictions and annotations, combining cross-entropy and bounding box loss for end-to-end training.

Concise Design

Simple yet powerful structure, with the base-size model achieving 42 AP on COCO.

Model Capabilities

Multi-object detection in images

Bounding box prediction

Object classification

Use Cases

Scene Understanding

Surveillance Video Analysis

Real-time detection of targets such as pedestrians and vehicles in surveillance footage.

Autonomous Driving Perception

Identifying traffic participants and obstacles in road environments.

Content Analysis

Image Content Moderation

Detecting specific objects or sensitive content in images.

🚀 YOLOS (small-sized) model

A fine-tuned YOLOS model on COCO 2017 object detection for efficient object recognition.

🚀 Quick Start

The YOLOS model is fine-tuned on COCO 2017 object detection, which includes 118k annotated images. It was introduced in the paper You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection by Fang et al. and initially released in this repository.

Disclaimer: The team that released YOLOS didn't write a model card for this model. This model card is written by the Hugging Face team.

✨ Features

High Performance: A base-sized YOLOS model can achieve 42 AP on COCO validation 2017, comparable to DETR and more complex frameworks like Faster R-CNN.
Unique Training Loss: Trained using a "bipartite matching loss" with the Hungarian matching algorithm for optimal one-to-one mapping.

📦 Installation

This model is used through the transformers library. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import YolosFeatureExtractor, YolosForObjectDetection
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = YolosFeatureExtractor.from_pretrained('hustvl/yolos-small')
model = YolosForObjectDetection.from_pretrained('hustvl/yolos-small')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)

# model predicts bounding boxes and corresponding COCO classes
logits = outputs.logits
bboxes = outputs.pred_boxes

Currently, both the feature extractor and model support PyTorch.

📚 Documentation

Intended uses & limitations

You can use the raw model for object detection. Check the model hub to find all available YOLOS models.

Model description

YOLOS is a Vision Transformer (ViT) trained with the DETR loss. Despite its simplicity, it performs well on COCO validation 2017.

The model is trained using a "bipartite matching loss". One compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N. The Hungarian matching algorithm is used to create an optimal one-to-one mapping. Then, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the model parameters.

🔧 Technical Details

Training data

The YOLOS model was pre-trained on ImageNet-1k and fine-tuned on COCO 2017 object detection, a dataset with 118k/5k annotated images for training/validation respectively.

Training

The model was pre-trained for 200 epochs on ImageNet-1k and fine-tuned for 150 epochs on COCO.

Evaluation results

This model achieves an AP (average precision) of 36.1 on COCO 2017 validation. For more details, refer to table 1 of the original paper.

📄 License

This model is licensed under the Apache-2.0 license.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2106-00666,
  author    = {Yuxin Fang and
               Bencheng Liao and
               Xinggang Wang and
               Jiemin Fang and
               Jiyang Qi and
               Rui Wu and
               Jianwei Niu and
               Wenyu Liu},
  title     = {You Only Look at One Sequence: Rethinking Transformer in Vision through
               Object Detection},
  journal   = {CoRR},
  volume    = {abs/2106.00666},
  year      = {2021},
  url       = {https://arxiv.org/abs/2106.00666},
  eprinttype = {arXiv},
  eprint    = {2106.00666},
  timestamp = {Fri, 29 Apr 2022 19:49:16 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2106-00666.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Additional Information

Property	Details
Model Type	YOLOS (small-sized)
Training Data	Pre-trained on ImageNet-1k, fine-tuned on COCO 2017 object detection

You can try out this model using the following sample images:

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご