MIT-B5 Open-Source Semantic Segmentation Model - Free to Use ImageNet-1k Pretrained Encoder

Mit B5

Developed by nvidia

SegFormer is a Transformer-based semantic segmentation model. This version only includes the encoder pretrained on ImageNet-1k.

Image Segmentation

Transformers

Open Source License:Other #Semantic Segmentation #Transformer Architecture #Image Classification

Downloads 15.94k

Release Time : 3/2/2022

Model Overview

SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decoder head. This model only includes the pretrained hierarchical Transformer encoder, which can be fine-tuned for semantic segmentation tasks.

Model Features

Hierarchical Transformer Architecture

Adopts a hierarchical Transformer design to efficiently process image features at different scales

Lightweight Design

The model is concise and efficient, reducing computational resource requirements while maintaining performance

Pretrained Encoder

Provides an encoder pretrained on ImageNet-1k for easy fine-tuning on downstream tasks

Model Capabilities

Image Classification

Semantic Segmentation (requires fine-tuning)

Feature Extraction

Use Cases

Computer Vision

Semantic Segmentation

Can be used for scene understanding, autonomous driving, and other tasks requiring pixel-level classification

Performs excellently on benchmarks like ADE20K and Cityscapes

Image Classification

Can be directly used for 1000-class ImageNet image classification tasks

🚀 SegFormer (b5-sized) encoder pre-trained-only

A pre-trained SegFormer encoder fine-tuned on Imagenet-1k, designed for semantic segmentation.

🚀 Quick Start

SegFormer encoder is fine-tuned on Imagenet-1k. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.

Disclaimer: The team releasing SegFormer did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks such as ADE20K and Cityscapes. The hierarchical Transformer is first pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset.

This repository only contains the pre-trained hierarchical Transformer, hence it can be used for fine-tuning purposes.

💻 Usage Examples

Basic Usage

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import SegformerFeatureExtractor, SegformerForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/mit-b5")
model = SegformerForImageClassification.from_pretrained("nvidia/mit-b5")

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.

📚 Documentation

You can use the model for fine-tuning of semantic segmentation. See the model hub to look for fine-tuned versions on a task that interests you.

📄 License

The license for this model can be found here.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2105-15203,
  author    = {Enze Xie and
               Wenhai Wang and
               Zhiding Yu and
               Anima Anandkumar and
               Jose M. Alvarez and
               Ping Luo},
  title     = {SegFormer: Simple and Efficient Design for Semantic Segmentation with
               Transformers},
  journal   = {CoRR},
  volume    = {abs/2105.15203},
  year      = {2021},
  url       = {https://arxiv.org/abs/2105.15203},
  eprinttype = {arXiv},
  eprint    = {2105.15203},
  timestamp = {Wed, 02 Jun 2021 11:46:42 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2105-15203.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Additional Information

Property	Details
Tags	vision
Datasets	imagenet_1k
Widget Examples	House, Castle

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご