SmolVLM2-500M-Video-Instruct: An Open-Source Lightweight Multimodal Model - Free for Video, Image, and Text Processing to Generate Text

Smolvlm2 500M Video Instruct

Developed by HuggingFaceTB

A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Lightweight Multimodal #Video Understanding #Low VRAM Inference

Downloads 17.89k

Release Time : 2/11/2025

Model Overview

SmolVLM2-500M-Video is an efficient multimodal model that can process video, image, and text inputs to generate text outputs. It is suitable for tasks such as visual question answering, caption generation, and storytelling, making it ideal for edge devices with limited computational resources.

Model Features

Lightweight and Efficient

The model is compact, requiring only 1.8GB of GPU VRAM for video inference, making it suitable for edge devices with limited computational resources.

Multimodal Support

Supports processing video, image, and text inputs to generate text outputs, applicable to various multimodal tasks.

High Performance

Despite its small size, it performs robustly on complex multimodal tasks such as visual question answering and caption generation.

Model Capabilities

Visual Question Answering

Caption Generation

Storytelling

Text Transcription

Video Analysis

Image Analysis

Use Cases

Media Analysis

Video Content Description

Analyze video content and generate detailed descriptions.

Generate accurate video content descriptions

Image Comparison

Compare similarities between multiple images.

Identify and describe similarities between images

Content Generation

Storytelling

Generate narrative stories based on visual content.

Generate coherent storytelling

Caption Generation

Generate captions for videos or images.

Generate accurate captions

🚀 SmolVLM2-500M-Video

SmolVLM2-500M-Video is a lightweight multimodal model crafted for video content analysis. It can process videos, images, and text inputs to generate text outputs, such as answering media - related questions, comparing visual content, or transcribing text from images. Despite its small size, only requiring 1.8GB of GPU RAM for video inference, it offers strong performance in complex multimodal tasks. This makes it ideal for on - device applications with limited computational resources.

✨ Features

Analyze video, image, and text inputs to generate text outputs.
Deliver robust performance on complex multimodal tasks with low GPU RAM requirements.
Well - suited for on - device applications.

📦 Installation

To use SmolVLM for inference and fine - tuning, ensure you have num2words, flash - attn, and the latest transformers installed. You can load the model as follows:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

💻 Usage Examples

Basic Usage

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

Advanced Usage - Video Inference

To use SmolVLM2 for video inference, make sure you have decord installed.

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Advanced Usage - Multi - image Interleaved Inference

You can interleave multiple media with text using chat templates.

import torch


messages = [
    {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is the similarity between these two images?"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

📚 Documentation

Model Summary

Property	Details
Developed by	Hugging Face 🤗
Model Type	Multi - modal model (image/multi - image/video/text)
Language(s) (NLP)	English
License	Apache 2.0
Architecture	Based on Idefics3 (see technical summary)

Resources

Demo: Video Highlight Generator
Blog: Blog post

Uses

SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.

To fine - tune SmolVLM2 on a specific task, you can follow the fine - tuning tutorial.

Evaluation

Size	Video - MME	MLVU	MVBench
2.2B	52.1	55.2	46.27
500M	42.2	47.3	39.73
256M	33.7	40.6	32.7

Misuse and Out - of - scope Use

SmolVLM is not intended for high - stakes scenarios or critical decision - making processes that affect an individual's well - being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:

Prohibited Uses:
- Evaluating or scoring individuals (e.g., in employment, education, credit).
- Critical automated decision - making.
- Generating unreliable factual content.
Malicious Activities:
- Spam generation.
- Disinformation campaigns.
- Harassment or abuse.
- Unauthorized surveillance.

Training Data

SmolVLM2 used 3.3M samples for training originally from ten different datasets: LlaVa Onevision, M4 - Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista - 400K, MovieChat and ShareGPT4Video.

Data Split per modality

Data Type	Percentage
Image	34.4%
Text	20.2%
Video	33.0%
Multi - image	12.3%

Granular dataset slices per modality

Text Datasets

Dataset	Percentage
llava - onevision/magpie_pro_ft3_80b_mt	6.8%
llava - onevision/magpie_pro_ft3_80b_tt	6.8%
llava - onevision/magpie_pro_qwen2_72b_tt	5.8%
llava - onevision/mathqa	0.9%

Multi - image Datasets

Dataset	Percentage
m4 - instruct - data/m4_instruct_multiimage	10.4%
mammoth/multiimage - cap6	1.9%

Image Datasets

Dataset	Percentage
llava - onevision/other	17.4%
llava - onevision/vision_flan	3.9%
llava - onevision/mavis_math_metagen	2.6%
llava - onevision/mavis_math_rule_geo	2.5%
llava - onevision/sharegpt4o	1.7%
llava - onevision/sharegpt4v_coco	1.5%
llava - onevision/image_textualization	1.3%
llava - onevision/sharegpt4v_llava	0.9%
llava - onevision/mapqa	0.9%
llava - onevision/qa	0.8%
llava - onevision/textocr	0.8%

Video Datasets

Dataset	Percentage
llava - video - 178k/1 - 2m	7.3%
llava - video - 178k/2 - 3m	7.0%
other - video/combined	5.7%
llava - video - 178k/hound	4.4%
llava - video - 178k/0 - 30s	2.4%
video - star/starb	2.2%
vista - 400k/combined	2.2%
vript/long	1.0%
ShareGPT4Video/all	0.8%

🔧 Technical Details

SmolVLM2 is built upon SigLIP as image encoder and SmolLM2 for text decoder part.

📄 License

We release the SmolVLM2 checkpoints under the Apache 2.0 license.

📚 Citation information

You can cite us in the following way:

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  author={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご