SmolVLM2-2.2B-Instruct Open-Source Multimodal Model - Free Deployment Facilitates Intelligent Analysis of Video Content

Smolvlm2 2.2B Instruct

Developed by HuggingFaceTB

SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Lightweight multimodal #Video understanding #Low-memory inference

Downloads 62.56k

Release Time : 2/8/2025

Model Overview

This model can answer questions about media files, compare visual content, or transcribe text from images. It is suitable for device-side applications with limited computing resources.

Model Features

Lightweight and efficient

Only 5.2GB of GPU memory is required for video inference, making it suitable for environments with limited resources

Multimodal support

It can process video, image, and text inputs simultaneously and support the interleaving of multiple media types

Suitable for device-side

Its small size makes it particularly suitable for running on devices with limited computing resources

Strong task performance

Despite its small size, it performs strongly on complex multimodal tasks

Model Capabilities

Visual question answering

Video content description

Image content description

Multi-image comparison and analysis

Text transcription

Storytelling based on visual content

Use Cases

Content analysis

Video highlight generation

Analyze video content and generate descriptions of key events

Can be used for automatic video summary generation

Visual question answering

Answer specific questions about image or video content

Achieved 51.5 points in the Mathvista benchmark test

Document processing

Text transcription

Extract and transcribe text content from images

Achieved 72.9 points in the OCRBench benchmark test

🚀 SmolVLM2 2.2B

SmolVLM2-2.2B is a lightweight multimodal model crafted for video content analysis. It processes videos, images, and text inputs to generate text outputs, such as answering media - related questions, comparing visual content, or transcribing text from images. Despite its small size, only requiring 5.2GB of GPU RAM for video inference, it offers robust performance on complex multimodal tasks. This efficiency makes it ideal for on - device applications with limited computational resources.

Image description

✨ Features

Multimodal Processing: Capable of handling images, multi - images, videos, and text inputs to generate text outputs.
Lightweight: Only needs 5.2GB of GPU RAM for video inference, suitable for on - device applications.
Robust Performance: Delivers strong performance on complex multimodal tasks.

📦 Installation

The original document does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

Advanced Usage

Simple Inference

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

Video Inference

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Multi - image Interleaved Inference

import torch


messages = [
    {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is the similarity between these two images?"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

📚 Documentation

Model Summary

Property	Details
Developed by	Hugging Face 🤗
Model Type	Multi - modal model (image/multi - image/video/text)
Language(s) (NLP)	English
License	Apache 2.0
Architecture	Based on Idefics3 (see technical summary)

Resources

Demo: Video Highlight Generator
Blog: Blog post

Uses

SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.

To fine - tune SmolVLM2 on a specific task, you can follow the fine - tuning tutorial.

Evaluation

Vision Evaluation

Model	Mathvista	MMMU	OCRBench	MMStar	AI2D	ChartQA_Test	Science_QA	TextVQA Val	DocVQA Val
SmolVLM2 2.2B	51.5	42	72.9	46	70	68.84	90	73.21	79.98
SmolVLM 2.2B	43.9	38.3	65.5	41.8	84.5	71.6	84.5	72.1	79.7

Video Evaluation

Size	Video - MME	MLVU	MVBench
2.2B	52.1	55.2	46.27
500M	42.2	47.3	39.73
256M	33.7	40.6	32.7

Model optimizations

The original document does not provide specific content about model optimizations, so this section is skipped.

Misuse and Out - of - scope Use

⚠️ Important Note

SmolVLM is not intended for high - stakes scenarios or critical decision - making processes that affect an individual's well - being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:

Prohibited Uses:

Evaluating or scoring individuals (e.g., in employment, education, credit)

Critical automated decision - making

Generating unreliable factual content

Malicious Activities:

Spam generation

Disinformation campaigns

Harassment or abuse

Unauthorized surveillance

License

SmolVLM2 is built upon the shape - optimized SigLIP as image encoder and SmolLM2 for text decoder part.

We release the SmolVLM2 checkpoints under the Apache 2.0 license.

Citation information

You can cite us in the following way:

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  author={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}

Training Data

SmolVLM2 used 3.3M samples for training originally from ten different datasets: LlaVa Onevision, M4 - Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista - 400K, MovieChat and ShareGPT4Video.

Data Split per modality

Data Type	Percentage
Image	34.4%
Text	20.2%
Video	33.0%
Multi - image	12.3%

Granular dataset slices per modality

Text Datasets

Dataset	Percentage
llava - onevision/magpie_pro_ft3_80b_mt	6.8%
llava - onevision/magpie_pro_ft3_80b_tt	6.8%
llava - onevision/magpie_pro_qwen2_72b_tt	5.8%
llava - onevision/mathqa	0.9%

Multi - image Datasets

Dataset	Percentage
m4 - instruct - data/m4_instruct_multiimage	10.4%
mammoth/multiimage - cap6	1.9%

Image Datasets

Dataset	Percentage
llava - onevision/other	17.4%
llava - onevision/vision_flan	3.9%
llava - onevision/mavis_math_metagen	2.6%
llava - onevision/mavis_math_rule_geo	2.5%
llava - onevision/sharegpt4o	1.7%
llava - onevision/sharegpt4v_coco	1.5%
llava - onevision/image_textualization	1.3%
llava - onevision/sharegpt4v_llava	0.9%
llava - onevision/mapqa	0.9%
llava - onevision/qa	0.8%
llava - onevision/textocr	0.8%

Video Datasets

Dataset	Percentage
llava - video - 178k/1 - 2m	7.3%
llava - video - 178k/2 - 3m	7.0%
other - video/combined	5.7%
llava - video - 178k/hound	4.4%
llava - video - 178k/0 - 30s	2.4%
video - star/starb	2.2%
vista - 400k/combined	2.2%
vript/long	1.0%
ShareGPT4Video/all	0.8%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご