MobileViTv2-1.0-VOC-DeepLabV3 Open-Source Semantic Segmentation Model

Mobilevitv2 1.0 Voc Deeplabv3

Developed by apple

A semantic segmentation model based on MobileViTv2 architecture with DeepLabV3 head, pretrained on PASCAL VOC dataset at 512x512 resolution

Image Segmentation

Transformers

Open Source License:Other #Lightweight Image Segmentation #Separable Attention #Mobile Optimization

Downloads 29

Release Time : 6/6/2023

Model Overview

This model combines the efficient vision Transformer architecture of MobileViTv2 with the semantic segmentation capability of DeepLabV3, suitable for image segmentation tasks

Model Features

Efficient Vision Transformer

Uses separable self-attention mechanism instead of traditional multi-head self-attention to improve computational efficiency on mobile devices

DeepLabV3 Head

Incorporates DeepLabV3 segmentation head to enhance the model's ability to capture multi-scale features

Lightweight Design

Optimized for mobile and edge devices, balancing performance and computational resource requirements

Model Capabilities

Image Segmentation

Semantic Segmentation

Pixel-level Classification

Use Cases

Computer Vision

Scene Understanding

Identify and segment different objects and regions in images

Performs well on the PASCAL VOC dataset

Autonomous Driving

Road scene segmentation, identifying vehicles, pedestrians, roads, etc.

🚀 MobileViTv2 + DeepLabv3 (shehan97/mobilevitv2-1.0-voc-deeplabv3)

This is a MobileViTv2 model pre-trained on PASCAL VOC at a resolution of 512x512, which can be used for semantic segmentation.

🚀 Quick Start

The MobileViTv2 model in this repository is pre-trained on PASCAL VOC at a resolution of 512x512. It was introduced in Separable Self-attention for Mobile Vision Transformers by Sachin Mehta and Mohammad Rastegari, and first released in this repository. The license used is Apple sample code license.

Disclaimer: The team releasing MobileViT did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

MobileViTv2 is constructed by replacing the multi - headed self - attention in MobileViT with separable self - attention.
The model in this repo adds a DeepLabV3 head to the MobileViT backbone for semantic segmentation.

💻 Usage Examples

Basic Usage

from transformers import MobileViTv2FeatureExtractor, MobileViTv2ForSemanticSegmentation
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTv2FeatureExtractor.from_pretrained("shehan97/mobilevitv2-1.0-voc-deeplabv3")
model = MobileViTv2ForSemanticSegmentation.from_pretrained("shehan97/mobilevitv2-1.0-voc-deeplabv3")

inputs = feature_extractor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits

predicted_mask = logits.argmax(1).squeeze(0)

Currently, both the feature extractor and model support PyTorch.

📚 Documentation

Intended uses & limitations

You can use the raw model for semantic segmentation. See the model hub to look for fine - tuned versions on a task that interests you.

🔧 Technical Details

The MobileViT + DeepLabV3 model was pretrained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes, and then fine - tuned on the PASCAL VOC2012 dataset.

BibTeX entry and citation info

@inproceedings{vision-transformer,
title = {Separable Self-attention for Mobile Vision Transformers},
author = {Sachin Mehta and Mohammad Rastegari},
year = {2022},
URL = {https://arxiv.org/abs/2206.02680}
}

📄 License

The license used for this model is Apple sample code license.

Property	Details
Model Type	MobileViTv2 + DeepLabV3
Training Data	Pretrained on ImageNet - 1k, fine - tuned on PASCAL VOC2012

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご