đ MobileViTv2 (mobilevitv2-1.0-imagenet1k-256)
MobileViTv2 is the second - generation model of MobileViT, designed for image - classification tasks. It offers an efficient solution for classifying images with high performance.
đ Quick Start
MobileViTv2 is the second version of MobileViT. It was proposed in Separable Self - attention for Mobile Vision Transformers by Sachin Mehta and Mohammad Rastegari, and first released in [this](https://github.com/apple/ml - cvnets) repository. The license used is [Apple sample code license](https://github.com/apple/ml - cvnets/blob/main/LICENSE).
Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team.
⨠Features
- MobileViTv2 is constructed by replacing the multi - headed self - attention in MobileViT with separable self - attention.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
from transformers import MobileViTImageProcessor, MobileViTV2ForImageClassification
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = MobileViTImageProcessor.from_pretrained("shehan97/mobilevitv2-1.0-imagenet1k-256")
model = MobileViTV2ForImageClassification.from_pretrained("shehan97/mobilevitv2-1.0-imagenet1k-256")
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
Currently, both the feature extractor and model support PyTorch.
đ Documentation
Intended uses & limitations
You can use the raw model for image classification. See the model hub to look for fine - tuned versions on a task that interests you.
Technical Details
The MobileViT model was pretrained on [ImageNet - 1k](https://huggingface.co/datasets/imagenet - 1k), a dataset consisting of 1 million images and 1,000 classes.
BibTeX entry and citation info
@inproceedings{vision-transformer,
title = {Separable Self-attention for Mobile Vision Transformers},
author = {Sachin Mehta and Mohammad Rastegari},
year = {2022},
URL = {https://arxiv.org/abs/2206.02680}
}
đ License
The license used is [Apple sample code license](https://github.com/apple/ml - cvnets/blob/main/LICENSE).
Property |
Details |
Model Type |
MobileViTv2 (mobilevitv2 - 1.0 - imagenet1k - 256) |
Training Data |
[ImageNet - 1k](https://huggingface.co/datasets/imagenet - 1k) |