Deeplabv3-MobileVit-Small Open-Source Model - Enabling Lightweight Semantic Segmentation Tasks on Mobile Devices

Deeplabv3 Mobilevit Small

Developed by apple

Lightweight vision Transformer model combining MobileNetV2 and Transformer modules, suitable for mobile semantic segmentation tasks

Image Segmentation

Transformers

#Lightweight Semantic Segmentation #Mobile Optimization #Transformer-CNN Hybrid

Downloads 817

Release Time : 5/30/2022

Model Overview

This model adds a DeepLabV3 head to the MobileViT backbone, specifically designed for semantic segmentation tasks and pre-trained on the PASCAL VOC dataset

Model Features

Lightweight Design

Combines the lightweight characteristics of MobileNetV2 with the global processing capabilities of Transformers, ideal for mobile deployment

Efficient Segmentation

Utilizes DeepLabV3 head structure to achieve precise semantic segmentation while maintaining lightweight

Multi-scale Training

Employs a multi-scale sampling strategy from 160x160 to 320x320 during pre-training to enhance model adaptability

Model Capabilities

Image Semantic Segmentation

Mobile Image Processing

Real-time Scene Understanding

Use Cases

Computer Vision

Autonomous Driving Scene Understanding

Identifies different object categories in road scenes

Achieves 79.1 mIOU on PASCAL VOC

Mobile Image Editing

Enables real-time background replacement/object segmentation on mobile devices

🚀 MobileViT + DeepLabV3 (small-sized model)

A small-sized model combining MobileViT and DeepLabV3 for semantic segmentation, pre-trained on PASCAL VOC.

🚀 Quick Start

You can use the raw model for semantic segmentation. See the model hub to look for fine-tuned versions on a task that interests you.

Here is how to use this model:

from transformers import MobileViTFeatureExtractor, MobileViTForSemanticSegmentation
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/deeplabv3-mobilevit-small")
model = MobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small")

inputs = feature_extractor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits
predicted_mask = logits.argmax(1).squeeze(0)

Currently, both the feature extractor and model support PyTorch.

✨ Features

MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers.
The model in this repo adds a DeepLabV3 head to the MobileViT backbone for semantic segmentation.

📚 Documentation

Model description

MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings.

The model in this repo adds a DeepLabV3 head to the MobileViT backbone for semantic segmentation.

Intended uses & limitations

You can use the raw model for semantic segmentation. See the model hub to look for fine-tuned versions on a task that interests you.

Training data

The MobileViT + DeepLabV3 model was pretrained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes, and then fine-tuned on the PASCAL VOC2012 dataset.

Training procedure

Preprocessing

At inference time, images are center-cropped at 512x512. Pixels are normalized to the range [0, 1]. Images are expected to be in BGR pixel order, not RGB.

Pretraining

The MobileViT networks are trained from scratch for 300 epochs on ImageNet-1k on 8 NVIDIA GPUs with an effective batch size of 1024 and learning rate warmup for 3k steps, followed by cosine annealing. Also used were label smoothing cross-entropy loss and L2 weight decay. Training resolution varies from 160x160 to 320x320, using multi-scale sampling.

To obtain the DeepLabV3 model, MobileViT was fine-tuned on the PASCAL VOC dataset using 4 NVIDIA A100 GPUs.

Evaluation results

Model	PASCAL VOC mIOU	# params	URL
MobileViT-XXS	73.6	1.9 M	https://huggingface.co/apple/deeplabv3-mobilevit-xx-small
MobileViT-XS	77.1	2.9 M	https://huggingface.co/apple/deeplabv3-mobilevit-x-small
MobileViT-S	79.1	6.4 M	https://huggingface.co/apple/deeplabv3-mobilevit-small

BibTeX entry and citation info

@inproceedings{vision-transformer,
title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
author = {Sachin Mehta and Mohammad Rastegari},
year = {2022},
URL = {https://arxiv.org/abs/2110.02178}
}

📄 License

The license used is Apple sample code license.

Property	Details
Model Type	MobileViT + DeepLabV3 (small-sized model)
Training Data	Pretrained on ImageNet-1k, fine-tuned on PASCAL VOC2012

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご