Kimi-VL-A3B-Thinking-2506開源模型 - 多模態推理升級，處理視頻圖像超智能

首頁

Kimi VL A3B Thinking 2506

由moonshotai開發

Kimi-VL-A3B-Thinking-2506 是 Kimi-VL-A3B-Thinking 的升級版，在多模態推理、視覺感知與理解、視頻場景處理等方面有顯著提升，支持更高分辨率圖像，且能在消耗更少令牌的情況下實現更智能的思考。

圖像生成文本

Transformers

開源協議:MIT #多模態推理 #高分辨率圖像處理 #視頻場景理解

下載量 515

發布時間 : 6/21/2025

模型概述

這是一個多模態視覺語言模型，專注於圖像文本轉文本任務，具有強大的視覺理解和推理能力。

模型特點

更智能思考，更少令牌消耗

在多模態推理基準測試中達到更好的準確率，同時平均所需思考長度減少20%

視覺感知與理解能力提升

在一般視覺感知和理解方面達到相同甚至更好的能力，超越或匹配非思考模型的能力

視頻場景處理能力

在視頻推理和理解基準測試中有所改進，為開源模型設定了新的技術水平

高分辨率支持

支持單張圖像320萬總像素，是前一版本的4倍，在高分辨率感知和OS代理接地基準測試中帶來顯著改進

模型能力

多模態推理

視覺感知

圖像理解

視頻理解

高分辨率圖像處理

長文本處理

數學推理

文檔處理

使用案例

視覺問答

圖像內容識別

識別圖像中的物體或場景

如準確識別貓的品種

視頻理解

視頻內容分析

理解視頻中的場景和動作

在VideoMMMU基準測試中達到65.2的準確率

數學推理

視覺數學問題解答

解決包含視覺元素的數學問題

在MathVista_MINI基準測試中達到80.1的準確率

🚀 Kimi-VL-A3B-Thinking-2506

Kimi-VL-A3B-Thinking-2506 是 Kimi-VL-A3B-Thinking 的升級版，在多模態推理、視覺感知與理解、視頻場景處理等方面有顯著提升，同時支持更高分辨率圖像，且能在消耗更少令牌的情況下實現更智能的思考。

基礎信息

屬性	詳情
基礎模型	moonshotai/Kimi-VL-A3B-Instruct
許可證	MIT
任務類型	圖像文本轉文本
庫名稱	transformers

⚠️ 重要提示

這是 Kimi-VL-A3B-Thinking 的改進版本，請考慮使用此更新後的模型，而非之前的版本。

💡 使用建議

請訪問我們的技術博客，獲取此模型的推薦推理方案：Kimi-VL-A3B-Thinking-2506: A Quick Navigation

📄 技術報告 | 📄 Github | 💬 聊天網頁

✨ 主要特性

這是 Kimi-VL-A3B-Thinking 的更新版本，具有以下改進能力：

更智能思考，更少令牌消耗：2506 版本在多模態推理基準測試中達到了更好的準確率，如 MathVision 達到 56.9（提升 20.1）、MathVista 達到 80.1（提升 8.4）、MMMU-Pro 達到 46.3（提升 3.3）、MMMU 達到 64.0（提升 2.1），同時平均所需思考長度減少 20%。
思考助力，視覺更清晰：與專注于思考任務的前一版本不同，2506 版本在一般視覺感知和理解方面也能達到相同甚至更好的能力，如 MMBench-EN-v1.1（84.4）、MMStar（70.4）、RealWorldQA（70.0）、MMVet（78.4），超越或匹配了我們非思考模型（Kimi-VL-A3B-Instruct）的能力。
拓展至視頻場景：新的 2506 版本在視頻推理和理解基準測試中也有所改進。它在 VideoMMMU 上為開源模型設定了新的技術水平（65.2），同時在一般視頻理解方面也保留了良好的能力（Video-MME 達到 71.9，與 Kimi-VL-A3B-Instruct 相當）。
支持更高分辨率：新的 2506 版本支持單張圖像 320 萬總像素，是前一版本的 4 倍。這在高分辨率感知和 OS 代理接地基準測試中帶來了顯著改進，如 V* Benchmark（無額外工具）達到 83.2、ScreenSpot-Pro 達到 52.8、OSWorld-G（完整集，含拒絕）達到 52.5。

📈 性能表現

與高效模型及 Kimi-VL 前兩個版本的比較

基準測試（指標）	GPT-4o	Qwen2.5-VL-7B	Gemma3-12B-IT	Kimi-VL-A3B-Instruct	Kimi-VL-A3B-Thinking	Kimi-VL-A3B-Thinking-2506
通用多模態
MMBench-EN-v1.1（準確率）	83.1	83.2	74.6	82.9	76.0	84.4
RealWorldQA（準確率）	75.4	68.5	59.1	68.1	64.0	70.0
OCRBench（準確率）	815	864	702	864	864	869
MMStar（準確率）	64.7	63.0	56.1	61.7	64.2	70.4
MMVet（準確率）	69.1	67.1	64.9	66.7	69.5	78.1
推理能力
MMMU（驗證集，Pass@1）	69.1	58.6	59.6	57.0	61.7	64.0
MMMU-Pro（Pass@1）	51.7	38.1	32.1	36.0	43.2	46.3
數學能力
MATH-Vision（Pass@1）	30.4	25.0	32.1	21.7	36.8	56.9
MathVista_MINI（Pass@1）	63.8	68.0	56.1	68.6	71.7	80.1
視頻能力
VideoMMMU（Pass@1）	61.2	47.4	57.0	52.1	55.5	65.2
MMVU（Pass@1）	67.4	50.1	57.0	52.7	53.0	57.5
Video-MME（含字幕）	77.2	71.6	62.1	72.7	66.0	71.9
代理接地能力
ScreenSpot-Pro（準確率）	0.8	29.0	—	35.4	—	52.8
ScreenSpot-V2（準確率）	18.1	84.2	—	92.8	—	91.4
OSWorld-G（準確率）	-	31.5	—	41.6	—	52.5
長文檔處理能力
MMLongBench-DOC（準確率）	42.8	29.6	21.3	35.1	32.5	42.1

與 30B - 70B 開源模型的比較

基準測試（指標）	Kimi-VL-A3B-Thinking-2506	Qwen2.5-VL-32B	Qwen2.5-VL-72B	Gemma3-27B-IT
通用多模態
MMBench-EN-v1.1（準確率）	84.4	-	88.3	78.9
RealWorldQA（準確率）	70.0	-	75.7	62.5
OCRBench（準確率）	869	-	885	753
MMStar（準確率）	70.4	69.5	70.8	63.1
MMVet（準確率）	78.1	-	74.0	71.0
推理能力
MMMU（驗證集，Pass@1）	64.0	70.0	70.2	64.9
MMMU-Pro（Pass@1）	46.3	49.5	51.1	-
數學能力
MATH-Vision（Pass@1）	56.9	38.4	38.1	35.4
MathVista_MINI（Pass@1）	80.1	74.7	74.8	59.8
視頻能力
VideoMMMU（Pass@1）	65.2	-	60.2	61.8
MMVU（Pass@1）	57.5	-	62.9	61.3
Video-MME（含字幕）	71.9	70.5/77.9	73.3/79.1	-
代理接地能力
ScreenSpot-Pro（準確率）	52.8	39.4	43.6	-
ScreenSpot-V2（準確率）	91.4	-	-	-
OSWorld-G（準確率）	52.5	46.5	-	-
長文檔處理能力
MMLongBench-DOC（準確率）	42.1	-	38.8	-

💻 使用示例

基礎用法

使用 VLLM 進行推理（推薦）

作為一個長解碼模型，最多可生成 32K 令牌，我們推薦使用 VLLM 進行推理，它已經支持 Kimi-VL 系列。

MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation

⚠️ 重要提示

務必顯式安裝 flash-attn 以避免 CUDA 內存不足。

from transformers import AutoProcessor
from vllm import LLM, SamplingParams

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
llm = LLM(
    model_path,
    trust_remote_code=True,
    max_num_seqs=8,
    max_model_len=131072,
    limit_mm_per_prompt={"image": 256}
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)


import requests
from PIL import Image

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)

messages = [
    {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))

使用 🤗 Hugging Face Transformers 進行推理

我們介紹如何使用 transformers 庫在推理階段使用我們的模型。建議使用 python=3.10、torch>=2.1.0 和 transformers=4.48.2 作為開發環境。

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_paths = ["url"]
images = [Image.open(path) for path in image_paths]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
    },
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

📚 引用信息

@misc{kimiteam2025kimivltechnicalreport,
      title={{Kimi-VL} Technical Report}, 
      author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
      year={2025},
      eprint={2504.07491},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07491}, 
}