HyperCLOVAX-SEED-Vision-Instruct-3B開源多模態模型 - 支持圖文理解、文本及韓語處理

首頁

Hyperclovax SEED Vision Instruct 3B

由naver-hyperclovax開發

HyperCLOVAX-SEED-Vision-Instruct-3B是由NAVER開發的輕量化多模態模型，具備圖文理解和文本生成能力，特別優化了韓語處理能力。

文本生成圖像

Transformers

開源協議:其他 #韓語視覺問答 #輕量多模態 #視頻理解優化

下載量 160.75k

發布時間 : 4/22/2025

模型概述

該模型基於LLaVA架構，結合視覺編碼器和語言模塊，支持圖像問答、圖表解析和視頻內容理解等任務，是韓國首個開源的視覺語言模型。

模型特點

輕量化設計

優化計算效率，相比同規模模型能以更少的視覺令牌實現競爭力表現

韓語優化

專為韓語優化的帕累托最優模型，在韓語基準測試中超越同規模開源模型

高效視頻處理

通過動態幀採樣實現低令牌消耗的視頻理解，單視頻最大支持1856令牌/108幀

多模態能力

同時支持文本、圖像和視頻輸入，具備圖文理解和文本生成能力

模型能力

視覺問答

圖表解析

視頻內容理解

韓語文本生成

多模態推理

使用案例

內容理解

圖像問答

根據輸入的圖像回答相關問題

在TextVQA-Val基準測試中達到79.2分

視頻內容分析

理解視頻內容並回答相關問題

在VideoMME基準測試中達到48.2分

商業應用

產品識別

識別圖像中的產品並提供相關信息

支持OCR和實體識別輔助輸入

🚀 HyperCLOVAX-SEED-Vision-Instruct-3B

HyperCLOVAX-SEED-Vision-Instruct-3B 是由 NAVER 開發的模型，它基於專有骨幹模型構建，並通過後訓練進行微調。該模型能夠理解文本和圖像，並生成文本。其輕量級架構設計優化了計算效率，在視覺理解方面表現出色，可處理視覺問答、圖表解讀等任務。尤其在處理韓語輸入時具有優勢，有望為增強韓國的自主人工智能能力做出重要貢獻。

image/png

🚀 快速開始

使用該模型前，請確保安裝以下依賴：

以下是使用示例代碼：

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
chat = [
        {"role": "system", "content": "you are helpful assistant!"},
        {"role": "user", "content": "Hello, how are you?"},
        {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
        {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")

# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)

# VLM Example
# For image and video inputs, you can use url, local_path, base64, or bytes.
vlm_chat = [
        {"role": "system", "content": {"type": "text", "text": "System Prompt"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 1"}},
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff_sota.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
                        "ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.",
                        "lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
                        "lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
                }
        },
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
                }
        },
        {"role": "assistant", "content": {"type": "text", "text": "Assistant Text 1"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 2"}},
        {
                "role": "user",
                "content": {
                        "type": "video",
                        "filename": "rolling-mist-clouds.mp4",
                        "video": "freenaturestock-rolling-mist-clouds.mp4",
                }
        },
        {"role": "user", "content": {"type": "text", "text": "User Text 3"}},
]

new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
        new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)

output_ids = model.generate(
        input_ids=input_ids.to(device="cuda"),
        max_new_tokens=8192,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
        **preprocessed,
)
print(tokenizer.batch_decode(output_ids)[0])

⚠️ 重要提示

為確保最高水平的圖像理解性能，建議包含光學字符識別（OCR）結果和實體識別（Lens）等額外信息。提供的使用示例是在假設可以獲取 OCR 和 Lens 結果的情況下編寫的。如果以這種格式輸入數據，您可以期待輸出質量顯著提高。

✨ 主要特性

多模態理解：能夠理解文本、圖像和視頻，並生成文本響應。
輕量級架構：優化計算效率，適合資源受限的環境。
韓語優勢：在處理韓語輸入時表現出色，在相關基準測試中優於類似規模的開源模型。
視覺理解能力：可處理視覺問答、圖表解讀等任務，支持無 OCR 處理。

📚 詳細文檔

基本信息

屬性	詳情
模型架構	基於 LLaVA 的視覺語言模型，包括基於 Transformer 的大語言模型模塊、基於 SigLIP 的視覺編碼器和基於 C-Abstractor 的視覺語言連接器
參數數量	大語言模型模塊 32 億 + 視覺模塊 4.3 億
輸入/輸出格式	文本 + 圖像 + 視頻 / 文本
上下文長度	16k
知識截止日期	模型使用 2024 年 8 月之前收集的數據進行訓練

訓練

文本訓練

在後期訓練中，確保高質量數據至關重要。為克服人工創建或修訂大規模數據集的成本和資源限制，以及處理需要專業領域知識的任務時的困難和人為錯誤風險，使用了由 HyperCLOVA X 驅動的自動驗證系統，提高了數據質量和訓練效率，從而提升了模型在數學和編碼等有明確答案領域的性能。

HyperCLOVAX-SEED-Vision-Instruct-3B 基於 HyperCLOVAX-SEED-Text-Base-3B 開發，並應用了監督微調（SFT）和基於 GRPO 在線強化算法的人類反饋強化學習（RLHF）。

視覺訓練

視覺理解功能並非 HyperCLOVA X 初始設計的一部分，因此在設計模型架構時，在不影響現有大語言模型性能的前提下，添加了處理視覺相關任務的能力，如基於圖像的問答和圖表解讀。

該 3B 模型的一個關鍵重點是優化視頻輸入令牌的效率，通過仔細調整每幀提取的令牌數量，以儘可能少的令牌實現高效的視頻理解。此外，在 RLHF 訓練階段，使用了特定於視覺的 V-RLHF 數據來增強模型的學習能力。

基準測試

文本基準測試

模型	KMMLU (5-shot, acc)	HAE-RAE (5-shot, acc)	CLiCK (5-shot, acc)	KoBEST (5-shot, acc)
HyperCLOVAX-SEED-Text-Base-3B	0.4847	0.7635	0.6386	0.7792
HyperCLOVAX-SEED-Vision-Instruct-3B	0.4422	0.6499	0.5599	0.7180
Qwen2.5-3B-instruct	0.4451	0.6031	0.5649	0.7053
gemma-3-4b-it	0.3895	0.6059	0.5303	0.7262

視覺基準測試

模型名稱	每個視頻的最大令牌數	VideoMME (Ko)	NAVER-TV-CLIP (Ko)	VideoChatGPT (Ko)	PerceptionTest (En)	ActivityNet-QA (En)	KoNet (Ko)	MMBench-Val (En)	TextVQA-Val (En)	Korean VisIT-Bench (Ko)	圖像 (4 個基準測試)	視頻 (5 個基準測試)	全部 (9 個基準測試)
HyperCLOVAX-SEED-Vision-Instruct-3B	1856 個令牌，108 幀	48.2	61.0	53.6	55.2	50.6	69.2	81.8	79.2	37.0	46.68	53.70	59.54
HyperCLOVAX-SEED-Vision-Instruct-3B (無 OCR)	1856 個令牌，108 幀	48.2	61.0	53.6	55.2	50.6	36.6	80.7	76.0	43.5	56.74	53.70	55.05
Qwen-2.5-VL-3B	24576 個令牌，768 幀	55.1	48.3	45.6	66.9	55.7	58.3	84.3	79.6	81.5	59.35	54.31	56.55
Qwen-2.5-VL-3B (2000 個令牌)	2000 個令牌，128 幀	50.3	43.9	44.3	58.3	54.2	58.5	84.3	79.3	15.7	59.50	50.18	54.33
Qwen-2.5-VL-7B	24576 個令牌，768 幀	60.6	66.7	51.8	70.5	56.6	68.4	88.3	84.9	85.6	69.34	61.23	64.84
Gemma-3-4B	4096 個令牌，16 幀	45.4	36.8	57.1	50.6	46.3	25.0	79.2	58.9	32.3	48.91	47.24	47.98
GPT4V (gpt-4-turbo-2024-04-09)	未知，原始圖像，8 幀	49.1	75.0	55.5	57.4	45.7	38.7	84.2	60.4	52.0	58.88	51.59	54.83
GPT4o (gpt-4o-2024-08-06)	未知，512 調整大小，128 幀	61.6	66.6	61.8	50.2	41.7	60.6	84.2	73.2	50.5	67.15	56.42	61.19
InternV-2-2B	4096 個令牌，16 幀	28.9	21.1	40.2	50.5	50.3	3.3	79.3	75.1	51.1	39.74	38.19	38.88
InternV-2-4B	4096 個令牌，16 幀	33.8	36.0	22.8	54.2	52.0	22.7	83.0	76.9	51.6	46.11	39.75	42.58
InternV-2-8B	4096 個令牌，16 幀	43.7	41.2	32.4	58.5	53.2	28.5	86.6	79.0	97.0	50.32	45.79	47.81