Open-source model for dog breed image classification - Freely support accurate recognition of 120 dog breed images

Dog Breeds Multiclass Image Classification With Vit

Developed by wesleyacheng

A dog breed classification model fine-tuned using Google's Vision Transformer architecture, supporting image recognition of 120 dog breeds

Image Classification

Transformers

Open Source License:MIT #Fine-grained Dog Breed Classification #Vision Transformer Architecture #ImageNet-21k Pretraining

Downloads 584

Release Time : 7/9/2023

Model Overview

This model is based on Google's Vision Transformer (vit-base-patch16-224-in21k) architecture, fine-tuned on the Stanford Dogs dataset, specifically designed for image classification tasks of 120 dog breeds.

Model Features

Advanced Vision Architecture

Utilizes Google's Vision Transformer architecture with self-attention mechanism for global image perception

High-precision Classification

Achieves 84% Top-1 accuracy and 97.1% Top-3 accuracy in 120 dog breed classification tasks

Pretraining Advantage

Fine-tuned from ImageNet-21k large-scale pretrained model, effectively overcoming data limitations

Model Capabilities

Dog Breed Image Classification

Multi-class Image Recognition

Use Cases

Pet Identification

Automatic Dog Breed Identification

Automatically identifies dog breeds by uploading photos

Top-1 accuracy 84%, Top-3 accuracy 97.1%

Pet Management

Pet Profile Creation

Automatically creates dog breed profiles for veterinary hospitals or shelters

🚀 Dog Breeds Multiclass Image Classification with Vision Transformer

This project uses the Vision Transformer to classify dog images into 120 different breeds, offering a more flexible and scalable solution for computer vision tasks.

🚀 Quick Start

To quickly start using this model for dog breed classification, you can follow the code example below:

from transformers import AutoImageProcessor, AutoModelForImageClassification
import PIL
import requests

url = "https://upload.wikimedia.org/wikipedia/commons/5/55/Beagle_600.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")
model = AutoModelForImageClassification.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")

inputs = image_processor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 120 Stanford dog breeds classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

✨ Features

Advanced Architecture: Utilizes the Vision Transformer, a state - of - the - art computer vision architecture, which offers greater flexibility and scalability compared to traditional CNNs.
Large - scale Pre - training: Based on a Google Vision Transformer pre - trained on the ImageNet - 21k dataset, bypassing the data limitation issue to some extent.
Multiclass Classification: Capable of classifying dog images into 120 different breeds.

📦 Installation

No specific installation steps are provided in the original README. If you want to use the model, you need to have the transformers, PIL, and requests libraries installed. You can install them using pip:

pip install transformers pillow requests

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, AutoModelForImageClassification
import PIL
import requests

url = "https://upload.wikimedia.org/wikipedia/commons/5/55/Beagle_600.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")
model = AutoModelForImageClassification.from_pretrained("wesleyacheng/dog-breeds-multiclass-image-classification-with-vit")

inputs = image_processor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 120 Stanford dog breeds classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

📚 Documentation

Model Motivation

Recently, there was a demand to classify dog images into their respective breeds, rather than just distinguishing between cats and dogs. To address this complex problem, the project uses the Vision Transformer, introduced in a 2020 Google paper.

The Vision Transformer treats an image as a sequence of patches with positional embeddings and self - attention, while a CNN uses convolutions and pooling layers. This allows the Vision Transformer to attend to any part of the image globally, making it more flexible and scalable. However, it has weaker inductive biases than CNNs, often requiring more data for pre - trained models.

Model Description

This model is fine - tuned using the Google Vision Transformer (vit - base - patch16 - 224 - in21k) on the Stanford Dogs dataset in Kaggle to classify dog images into 120 types of dog breeds.

Intended Uses & Limitations

This fine - tuned model can only be used to classify images of dogs and dog breeds that are in the dataset.

🔧 Technical Details

The Vision Transformer differs from traditional CNNs in how it processes images. In Vision Transformers, an input image is divided into patches (e.g., 16x16), which are then fed into the Transformer as a sequence with positional embeddings and self - attention. In contrast, CNNs use convolutions and pooling layers as inductive biases.

The Vision Transformer's self - attention mechanism allows it to attend to any patch of the image globally, without the need for local centering, cropping, or bounding boxes as in CNNs. This makes it more flexible and scalable, enabling the creation of foundation models in computer vision.

📄 License

This project is released under the MIT license.

📊 Model Metrics

Model Training Metrics

Epoch	Top - 1 Accuracy	Top - 3 Accuracy	Top - 5 Accuracy	Macro F1
1	79.8%	95.1%	97.5%	77.2%
2	83.8%	96.7%	98.2%	81.9%
3	84.8%	96.7%	98.3%	83.4%

Model Evaluation Metrics

Top - 1 Accuracy	Top - 3 Accuracy	Top - 5 Accuracy	Macro F1
84.0%	97.1%	98.7%	83.0%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご