Kimi-VL-A3B-Thinking-2506オープンソースモデル - マルチモーダル推論のアップグレード、ビデオと画像の処理が超インテリジェント

Kimi VL A3B Thinking 2506

moonshotaiによって開発

Kimi-VL-A3B-Thinking-2506はKimi-VL-A3B-Thinkingのアップグレード版で、マルチモーダル推論、視覚認知と理解、ビデオシーン処理などの分野で著しい向上が見られ、より高解像度の画像をサポートし、より少ないトークンを消費しながらよりスマートな思考を実現します。

画像生成テキスト

Transformers

オープンソースライセンス:MIT #マルチモーダル推論 #高解像度画像処理 #ビデオシーン理解

ダウンロード数 515

リリース時間 : 6/21/2025

モデル概要

これはマルチモーダル視覚言語モデルで、画像テキストからテキストへのタスクに特化しており、強力な視覚理解と推論能力を持っています。

モデル特徴

よりスマートな思考と少ないトークン消費

マルチモーダル推論のベンチマークテストでより高い精度を達成し、同時に平均必要思考長を20％削減します。

視覚認知と理解能力の向上

一般的な視覚認知と理解の分野で同等以上の能力を発揮し、非思考モデルの能力を上回るか同等のレベルに達します。

ビデオシーン処理能力

ビデオ推論と理解のベンチマークテストで改善が見られ、オープンソースモデルに新しい技術レベルを設定します。

高解像度サポート

単一画像の総画素数320万をサポートし、前バージョンの4倍であり、高解像度認知とOSエージェント接地のベンチマークテストで著しい改善をもたらします。

モデル能力

マルチモーダル推論

視覚認知

画像理解

ビデオ理解

高解像度画像処理

長文処理

数学的推論

ドキュメント処理

使用事例

視覚質問応答

画像内容識別

画像内の物体やシーンを識別する

例えば猫の品種を正確に識別する

ビデオ理解

ビデオ内容分析

ビデオ内のシーンや動作を理解する

VideoMMMUベンチマークテストで65.2の精度を達成する

数学的推論

視覚数学問題の解決

視覚要素を含む数学問題を解く

MathVista_MINIベンチマークテストで80.1の精度を達成する

🚀 Kimi-VL-A3B-Thinking-2506

Kimi-VL-A3B-Thinking-2506 は Kimi-VL-A3B-Thinking のアップグレード版で、マルチモーダル推論、視覚認知と理解、ビデオシーン処理などの分野で著しい向上が見られます。同時に、より高解像度の画像をサポートし、より少ないトークンを消費しながら、より高度な思考を実現することができます。

基礎情報

属性	詳細
ベースモデル	moonshotai/Kimi-VL-A3B-Instruct
ライセンス	MIT
タスクタイプ	画像テキストからテキストへの変換
ライブラリ名	transformers

⚠️ 重要な注意事項

これは Kimi-VL-A3B-Thinking の改良版です。この更新されたモデルを使用することを検討してください。

💡 使用上の提案

このモデルの推奨推論方法については、弊社の技術ブログをご覧ください：Kimi-VL-A3B-Thinking-2506: A Quick Navigation

📄 技術レポート | 📄 Github | 💬 チャットウェブページ

✨ 主な機能

これは Kimi-VL-A3B-Thinking の更新版で、以下の改良された機能を備えています：

より少ないトークンで高度な思考：2506 バージョンは、多モーダル推論のベンチマークテストでより高い精度を達成しました。例えば、MathVision で 56.9（20.1 の向上）、MathVista で 80.1（8.4 の向上）、MMMU-Pro で 46.3（3.3 の向上）、MMMU で 64.0（2.1 の向上）となり、同時に平均思考長を 20% 削減しました。
思考支援と鮮明な視覚：前のバージョンが思考タスクに特化していたのとは異なり、2506 バージョンは一般的な視覚認知と理解においても同等以上の能力を持ちます。例えば、MMBench-EN-v1.1（84.4）、MMStar（70.4）、RealWorldQA（70.0）、MMVet（78.4）で、非思考モデル（Kimi-VL-A3B-Instruct）の能力を上回ったり、同等の性能を発揮しました。
ビデオシーンへの拡張：新しい 2506 バージョンは、ビデオ推論と理解のベンチマークテストでも改善されています。VideoMMMU でオープンソースモデルの新しい技術水準を設定（65.2）し、一般的なビデオ理解能力も維持しています（Video-MME で 71.9、Kimi-VL-A3B-Instruct と同等）。
高解像度のサポート：新しい 2506 バージョンは、単一画像の総画素数が 320 万をサポートし、前バージョンの 4 倍です。これにより、高解像度認知と OS エージェント接地ベンチマークテストで大幅な改善が見られます。例えば、V* Benchmark（追加ツールなし）で 83.2、ScreenSpot-Pro で 52.8、OSWorld-G（完全セット、拒否を含む）で 52.5 となりました。

📈 性能評価

効率的なモデルと Kimi-VL の前 2 つのバージョンとの比較

ベンチマークテスト（指標）	GPT-4o	Qwen2.5-VL-7B	Gemma3-12B-IT	Kimi-VL-A3B-Instruct	Kimi-VL-A3B-Thinking	Kimi-VL-A3B-Thinking-2506
汎用多モーダル
MMBench-EN-v1.1（正解率）	83.1	83.2	74.6	82.9	76.0	84.4
RealWorldQA（正解率）	75.4	68.5	59.1	68.1	64.0	70.0
OCRBench（正解率）	815	864	702	864	864	869
MMStar（正解率）	64.7	63.0	56.1	61.7	64.2	70.4
MMVet（正解率）	69.1	67.1	64.9	66.7	69.5	78.1
推論能力
MMMU（検証セット、Pass@1）	69.1	58.6	59.6	57.0	61.7	64.0
MMMU-Pro（Pass@1）	51.7	38.1	32.1	36.0	43.2	46.3
数学能力
MATH-Vision（Pass@1）	30.4	25.0	32.1	21.7	36.8	56.9
MathVista_MINI（Pass@1）	63.8	68.0	56.1	68.6	71.7	80.1
ビデオ能力
VideoMMMU（Pass@1）	61.2	47.4	57.0	52.1	55.5	65.2
MMVU（Pass@1）	67.4	50.1	57.0	52.7	53.0	57.5
Video-MME（字幕付き）	77.2	71.6	62.1	72.7	66.0	71.9
エージェント接地能力
ScreenSpot-Pro（正解率）	0.8	29.0	—	35.4	—	52.8
ScreenSpot-V2（正解率）	18.1	84.2	—	92.8	—	91.4
OSWorld-G（正解率）	-	31.5	—	41.6	—	52.5
長文書処理能力
MMLongBench-DOC（正解率）	42.8	29.6	21.3	35.1	32.5	42.1

30B - 70B のオープンソースモデルとの比較

ベンチマークテスト（指標）	Kimi-VL-A3B-Thinking-2506	Qwen2.5-VL-32B	Qwen2.5-VL-72B	Gemma3-27B-IT
汎用多モーダル
MMBench-EN-v1.1（正解率）	84.4	-	88.3	78.9
RealWorldQA（正解率）	70.0	-	75.7	62.5
OCRBench（正解率）	869	-	885	753
MMStar（正解率）	70.4	69.5	70.8	63.1
MMVet（正解率）	78.1	-	74.0	71.0
推論能力
MMMU（検証セット、Pass@1）	64.0	70.0	70.2	64.9
MMMU-Pro（Pass@1）	46.3	49.5	51.1	-
数学能力
MATH-Vision（Pass@1）	56.9	38.4	38.1	35.4
MathVista_MINI（Pass@1）	80.1	74.7	74.8	59.8
ビデオ能力
VideoMMMU（Pass@1）	65.2	-	60.2	61.8
MMVU（Pass@1）	57.5	-	62.9	61.3
Video-MME（字幕付き）	71.9	70.5/77.9	73.3/79.1	-
エージェント接地能力
ScreenSpot-Pro（正解率）	52.8	39.4	43.6	-
ScreenSpot-V2（正解率）	91.4	-	-	-
OSWorld-G（正解率）	52.5	46.5	-	-
長文書処理能力
MMLongBench-DOC（正解率）	42.1	-	38.8	-

💻 使用例

基本的な使用法

VLLM を使用した推論（推奨）

長いデコードモデルとして、最大 32K トークンを生成できます。VLLM を使用した推論をおすすめします。これは Kimi-VL シリーズをサポートしています。

MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation

⚠️ 重要な注意事項

CUDA メモリ不足を避けるために、flash-attn を明示的にインストールしてください。

from transformers import AutoProcessor
from vllm import LLM, SamplingParams

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
llm = LLM(
    model_path,
    trust_remote_code=True,
    max_num_seqs=8,
    max_model_len=131072,
    limit_mm_per_prompt={"image": 256}
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)


import requests
from PIL import Image

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)

messages = [
    {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))

🤗 Hugging Face Transformers を使用した推論

transformers ライブラリを使用して推論段階でモデルを使用する方法を説明します。開発環境として、python=3.10、torch>=2.1.0、transformers=4.48.2 の使用をおすすめします。

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_paths = ["url"]
images = [Image.open(path) for path in image_paths]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
    },
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

📚 引用情報

@misc{kimiteam2025kimivltechnicalreport,
      title={{Kimi-VL} Technical Report}, 
      author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
      year={2025},
      eprint={2504.07491},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07491}, 
}