SmolVLM2-256M-Video-Instruct開源模型 - 輕鬆分析視頻內容，支持多模態輸入輸出

首頁

Smolvlm2 256M Video Instruct

由HuggingFaceTB開發

SmolVLM2-256M-Video是一款輕量級多模態模型，專為分析視頻內容而設計，能夠處理視頻、圖像和文本輸入並生成文本輸出。

圖像生成文本

Transformers

英語開源協議:Apache-2.0 #輕量級多模態 #視頻內容分析 #低顯存推理

下載量 22.16k

發布時間 : 2/11/2025

模型概述

該模型能夠處理視頻、圖像和文本輸入，生成文本輸出，適用於回答關於媒體文件的問題、比較視覺內容或從圖像中轉錄文本等任務。儘管體積小巧，進行視頻推理時僅需1.38GB的GPU內存，適合設備端應用。

模型特點

輕量高效

模型體積小巧，進行視頻推理時僅需1.38GB的GPU內存，適合計算資源有限的設備端應用。

多模態處理

能夠同時處理視頻、圖像和文本輸入，並生成文本輸出。

設備端適用

特別適合需要特定領域微調且計算資源可能有限的設備端應用。

模型能力

視頻內容分析

圖像內容分析

文本生成

視覺問答

字幕生成

基於視覺內容的故事講述

使用案例

媒體分析

視頻描述生成

分析視頻內容並生成詳細的文字描述。

圖像問答

回答關於圖像內容的特定問題。

內容創作

視覺故事講述

基於提供的圖像或視頻內容生成連貫的故事。

🚀 SmolVLM2-256M-Video

SmolVLM2-256M-Video 是一款輕量級的多模態模型，旨在分析視頻內容。該模型能夠處理視頻、圖像和文本輸入，並生成文本輸出，可用於回答有關媒體文件的問題、比較視覺內容或從圖像中轉錄文本。儘管模型體積小巧，但在進行視頻推理時僅需 1.38GB 的 GPU 顯存。這種高效性使其特別適合需要特定領域微調且計算資源可能有限的設備端應用。

🚀 快速開始

你可以使用 transformers 庫來加載、推理和微調 SmolVLM。請確保你已經安裝了 num2words、flash-attn 和最新版本的 transformers。以下是加載模型的示例代碼：

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

💻 使用示例

基礎用法

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

高級用法 - 視頻推理

若要使用 SmolVLM2 進行視頻推理，請確保你已經安裝了 decord。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

高級用法 - 多圖像交錯推理

你可以使用聊天模板將多個媒體與文本交錯使用。

import torch

messages = [
    {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is the similarity between these two images?"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"},            
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

✨ 主要特性

多模態處理：能夠處理視頻、圖像和文本輸入，生成文本輸出。
輕量級高效：僅需 1.38GB 的 GPU 顯存進行視頻推理，適合設備端應用。
廣泛應用：可用於回答媒體文件問題、比較視覺內容、圖像文本轉錄等。

📚 詳細文檔

模型概述

開發者：Hugging Face 🤗
模型類型：多模態模型（圖像/多圖像/視頻/文本）
支持語言（NLP）：英語
許可證：Apache 2.0
架構：基於 Idefics3（詳見技術總結）

資源鏈接

演示：視頻高光生成器
博客：博客文章

應用場景

SmolVLM2 可用於多模態（視頻/圖像/文本）任務的推理，輸入包括文本查詢以及視頻或一個或多個圖像。文本和媒體文件可以任意交錯，支持字幕生成、視覺問答和基於視覺內容的故事講述等任務。但該模型不支持圖像或視頻生成。

若要針對特定任務微調 SmolVLM2，可參考微調教程。

評估結果

我們在以下科學基準上評估了 SmolVLM2 系列的性能：

模型規模	Video-MME	MLVU	MVBench
2.2B	52.1	55.2	46.27
500M	42.2	47.3	39.73
256M	33.7	40.6	32.7

模型優化

文檔未提及具體優化內容。

不當使用和超出範圍使用

SmolVLM 不適用於高風險場景或影響個人福祉和生計的關鍵決策過程。該模型可能會生成看似事實但可能不準確的內容。不當使用包括但不限於：

禁止用途：
- 評估或評分個人（如就業、教育、信貸方面）
- 關鍵自動化決策
- 生成不可靠的事實內容
惡意活動：
- 垃圾郵件生成
- 虛假信息傳播
- 騷擾或濫用
- 未經授權的監控

許可證

SmolVLM2 以 SigLIP 作為圖像編碼器，SmolLM2 作為文本解碼器。我們在 Apache 2.0 許可證下發布 SmolVLM2 檢查點。

引用信息

你可以按以下方式引用我們的工作：

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  author={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}

訓練數據

SmolVLM2 最初使用來自十個不同數據集的 330 萬個樣本進行訓練：LlaVa Onevision、M4-Instruct、Mammoth、LlaVa Video 178K、FineVideo、VideoStar、VRipt、Vista-400K、MovieChat 和 ShareGPT4Video。

各模態數據拆分

數據類型	佔比
圖像	34.4%
文本	20.2%
視頻	33.0%
多圖像	12.3%

各模態數據集詳細佔比

文本數據集

數據集	佔比
llava-onevision/magpie_pro_ft3_80b_mt	6.8%
llava-onevision/magpie_pro_ft3_80b_tt	6.8%
llava-onevision/magpie_pro_qwen2_72b_tt	5.8%
llava-onevision/mathqa	0.9%

多圖像數據集

數據集	佔比
m4-instruct-data/m4_instruct_multiimage	10.4%
mammoth/multiimage-cap6	1.9%

圖像數據集

數據集	佔比
llava-onevision/other	17.4%
llava-onevision/vision_flan	3.9%
llava-onevision/mavis_math_metagen	2.6%
llava-onevision/mavis_math_rule_geo	2.5%
llava-onevision/sharegpt4o	1.7%
llava-onevision/sharegpt4v_coco	1.5%
llava-onevision/image_textualization	1.3%
llava-onevision/sharegpt4v_llava	0.9%
llava-onevision/mapqa	0.9%
llava-onevision/qa	0.8%
llava-onevision/textocr	0.8%

視頻數據集

數據集	佔比
llava-video-178k/1-2m	7.3%
llava-video-178k/2-3m	7.0%
other-video/combined	5.7%
llava-video-178k/hound	4.4%
llava-video-178k/0-30s	2.4%
video-star/starb	2.2%
vista-400k/combined	2.2%
vript/long	1.0%
ShareGPT4Video/all	0.8%