đ Dog Breeds Multiclass Image Classification with Vision Transformer
This project uses the Vision Transformer to classify dog images into 120 different breeds, offering a more flexible and scalable solution for computer vision tasks.
đ Quick Start
To quickly start using this model for dog breed classification, you can follow the code example below:
from transformers import AutoImageProcessor, AutoModelForImageClassification
import PIL
import requests
url = "https://upload.wikimedia.org/wikipedia/commons/5/55/Beagle_600.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")
model = AutoModelForImageClassification.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")
inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
⨠Features
- Advanced Architecture: Utilizes the Vision Transformer, a state - of - the - art computer vision architecture, which offers greater flexibility and scalability compared to traditional CNNs.
- Large - scale Pre - training: Based on a Google Vision Transformer pre - trained on the ImageNet - 21k dataset, bypassing the data limitation issue to some extent.
- Multiclass Classification: Capable of classifying dog images into 120 different breeds.
đĻ Installation
No specific installation steps are provided in the original README. If you want to use the model, you need to have the transformers
, PIL
, and requests
libraries installed. You can install them using pip
:
pip install transformers pillow requests
đģ Usage Examples
Basic Usage
from transformers import AutoImageProcessor, AutoModelForImageClassification
import PIL
import requests
url = "https://upload.wikimedia.org/wikipedia/commons/5/55/Beagle_600.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")
model = AutoModelForImageClassification.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")
inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
đ Documentation
Model Motivation
Recently, there was a demand to classify dog images into their respective breeds, rather than just distinguishing between cats and dogs. To address this complex problem, the project uses the Vision Transformer, introduced in a 2020 Google paper.
The Vision Transformer treats an image as a sequence of patches with positional embeddings and self - attention, while a CNN uses convolutions and pooling layers. This allows the Vision Transformer to attend to any part of the image globally, making it more flexible and scalable. However, it has weaker inductive biases than CNNs, often requiring more data for pre - trained models.
Model Description
This model is fine - tuned using the Google Vision Transformer (vit - base - patch16 - 224 - in21k
) on the Stanford Dogs dataset in Kaggle to classify dog images into 120 types of dog breeds.
Intended Uses & Limitations
This fine - tuned model can only be used to classify images of dogs and dog breeds that are in the dataset.
đ§ Technical Details
The Vision Transformer differs from traditional CNNs in how it processes images. In Vision Transformers, an input image is divided into patches (e.g., 16x16), which are then fed into the Transformer as a sequence with positional embeddings and self - attention. In contrast, CNNs use convolutions and pooling layers as inductive biases.
The Vision Transformer's self - attention mechanism allows it to attend to any patch of the image globally, without the need for local centering, cropping, or bounding boxes as in CNNs. This makes it more flexible and scalable, enabling the creation of foundation models in computer vision.
đ License
This project is released under the MIT license.
đ Model Metrics
Model Training Metrics
Epoch |
Top - 1 Accuracy |
Top - 3 Accuracy |
Top - 5 Accuracy |
Macro F1 |
1 |
79.8% |
95.1% |
97.5% |
77.2% |
2 |
83.8% |
96.7% |
98.2% |
81.9% |
3 |
84.8% |
96.7% |
98.3% |
83.4% |
Model Evaluation Metrics
Top - 1 Accuracy |
Top - 3 Accuracy |
Top - 5 Accuracy |
Macro F1 |
84.0% |
97.1% |
98.7% |
83.0% |