Open-source Image-Text Model MobileCLIP-S0 - Multimodal Training for Efficiently Improving Speed and Size Efficiency

Mobileclip S0 Timm

Developed by apple

MobileCLIP-S0 is an efficient image-text model achieved through multimodal reinforcement training, significantly improving speed and size efficiency while maintaining high performance.

Text-to-Image #Zero-shot image-text matching #Low-latency inference #Multimodal training

Downloads 532

Release Time : 6/6/2024

Model Overview

MobileCLIP is a fast image-text model designed for multimodal tasks, capable of achieving high performance in tasks such as zero-shot classification.

Model Features

Efficient Performance

Maintains performance comparable to ViT-B/16 while being 4.8x faster and 2.8x smaller in size

Multimodal Reinforcement Training

Uses specialized training methods to enhance image-text matching capabilities

Lightweight Design

Model architecture optimized for mobile and edge devices

Model Capabilities

Zero-shot image classification

Image-text matching

Multimodal understanding

Use Cases

Computer Vision

Image Classification

Classify images without specific training

Achieves 67.8% zero-shot accuracy on ImageNet-1k

Multimodal Applications

Image-Text Retrieval

Enable cross-modal retrieval between images and text

Property	Details
Model Type	MobileCLIP - S0, MobileCLIP - S1, MobileCLIP - S2, MobileCLIP - B, MobileCLIP - B (LT)
# Seen Samples (B)	13 (for S0, S1, S2, B), 36 (for B (LT))
# Params (M) (img + txt)	11.4 + 42.4 (S0), 21.5 + 63.4 (S1), 35.7 + 63.4 (S2), 86.3 + 63.4 (B, B (LT))
Latency (ms) (img + txt)	1.5 + 1.6 (S0), 2.5 + 3.3 (S1), 3.6 + 3.3 (S2), 10.4 + 3.3 (B, B (LT))
IN - 1k Zero - Shot Top - 1 Acc. (%)	67.8 (S0), 72.6 (S1), 74.4 (S2), 76.8 (B), 77.2 (B (LT))
Avg. Perf. (%) on 38 datasets	58.1 (S0), 61.3 (S1), 63.7 (S2), 65.2 (B), 65.8 (B (LT))

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Mobileclip S0 Timm

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

🚀 Quick Start

✨ Features

📚 Documentation

Checkpoints

📄 License