Open-source Swin Transformer model - usable for image classification and dense recognition tasks

Swin Large Patch4 Window7 224

Developed by microsoft

Swin Transformer is a hierarchical vision Transformer that achieves linear computational complexity by computing self-attention within local windows, making it suitable for image classification and dense recognition tasks.

Image Classification

Transformers

Open Source License:Apache-2.0 #Hierarchical Vision Transformer #Local Window Attention #Image Classification Backbone

Downloads 2,079

Release Time : 3/2/2022

Model Overview

This model is a large-scale vision model based on the Swin Transformer architecture, trained on the ImageNet-1k dataset at 224x224 resolution, and can be used for image classification tasks.

Model Features

Hierarchical Feature Maps

Constructs hierarchical feature maps by merging image patches, suitable for processing visual information at different scales.

Local Window Attention

Computes self-attention only within local windows, making computational complexity linear with respect to input image size.

Efficient Architecture

Compared to traditional vision Transformers, it offers higher computational efficiency and is suitable as a general backbone network.

Model Capabilities

Image Classification

Visual Feature Extraction

Use Cases

Computer Vision

Image Classification

Classifies input images into one of the 1,000 categories in ImageNet.

Performs excellently on the ImageNet-1k dataset.

Visual Feature Extraction

Serves as a backbone network to extract image features for downstream vision tasks.

🚀 Swin Transformer (large-sized model)

The Swin Transformer model is trained on ImageNet-1k at a resolution of 224x224. It offers a novel approach for image classification and dense recognition tasks.

🚀 Quick Start

The Swin Transformer model is pre - trained on the ImageNet - 1k dataset at a resolution of 224x224. It was first introduced in the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. and initially released in [this repository](https://github.com/microsoft/Swin - Transformer).

Disclaimer: The team releasing Swin Transformer did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

The Swin Transformer is a type of Vision Transformer. It constructs hierarchical feature maps by merging image patches (shown in gray) in deeper layers. Due to the computation of self - attention only within each local window (shown in red), it has linear computation complexity in relation to the input image size. This allows it to serve as a general - purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers generate feature maps of a single low resolution and have quadratic computation complexity in relation to the input image size because of global self - attention computation.

model image

Source

📚 Documentation

Intended uses & limitations

You can use the raw model for image classification. Check out the model hub to find fine - tuned versions for tasks that interest you.

How to use

Here is a basic example of using this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import AutoFeatureExtractor, SwinForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swin-large-patch4-window7-224")
model = SwinForImageClassification.from_pretrained("microsoft/swin-large-patch4-window7-224")

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, refer to the documentation.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2103-14030,
  author    = {Ze Liu and
               Yutong Lin and
               Yue Cao and
               Han Hu and
               Yixuan Wei and
               Zheng Zhang and
               Stephen Lin and
               Baining Guo},
  title     = {Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  journal   = {CoRR},
  volume    = {abs/2103.14030},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.14030},
  eprinttype = {arXiv},
  eprint    = {2103.14030},
  timestamp = {Thu, 08 Apr 2021 07:53:26 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-14030.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

📄 License

This model is released under the Apache - 2.0 license.

Property	Details
Model Type	Swin Transformer (large - sized model)
Training Data	ImageNet - 1k
Tags	vision, image - classification
Widget Examples	Tiger, Teapot, Palace

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご