CogVLM2-Llama3-Chat-19Bオープンソース多模态大規模モデル - 画像理解と会話をサポートし、処理能力が強い

ホーム

Cogvlm2 Llama3 Chat 19B

THUDMによって開発

CogVLM2はMeta-Llama-3-8B-Instructを基に構築されたマルチモーダル大規模モデルで、画像理解と対話タスクをサポートし、8Kのコンテキスト長と1344x1344の画像解像度処理能力を備えています。

テキスト生成画像

Transformers

英語オープンソースライセンス:その他 #マルチモーダル対話 #高解像度画像理解 #8K長文対応

ダウンロード数 7,805

リリース時間 : 5/16/2024

モデル概要

新世代の視覚言語モデルで、多数のベンチマークテストで優れた性能を発揮し、中英語のマルチモーダルインタラクションをサポートします。

モデル特徴

高性能マルチモーダル理解

TextVQA、DocVQAなどのベンチマークテストで前世代モデルを大幅に上回る性能

長文脈サポート

8K長のコンテキストメモリをサポート

高解像度画像処理

最大1344x1344ピクセルの画像入力をサポート

二言語サポート

中英語二言語バージョンを提供（cogvlm2-llama3-chinese-chat-19B）

モデル能力

画像内容理解

文書質問応答

図表解析

マルチターン対話

クロスモーダル推論

使用事例

文書処理

文書内容質問応答

PDF/画像文書を解析し関連質問に回答

DocVQAベンチマークテストで92.3点を達成

視覚質問応答

画像内容質問応答

画像内容に関する複雑な質問に回答

TextVQAベンチマークテストで84.2点を達成

教育支援

図表解析

各種データ図表の説明と分析

ChartQAベンチマークテストで81.0点を達成

🚀 CogVLM2

👋 このプロジェクトでは、新世代のCogVLM2シリーズのモデルを公開しています。これらのモデルは画像理解や対話モデルとして優れた性能を発揮し、多くのベンチマークで高いスコアを獲得しています。

👋 Wechat · 💡オンラインデモ · 🎈Githubページ · 📑 論文

📍ZhipuAIオープンプラットフォームで、より大規模なCogVLMモデルを体験できます。

✨ 主な機能

新世代のCogVLM2シリーズのモデルを公開し、Meta-Llama-3-8B-Instructをベースに構築された2つのモデルをオープンソースで提供しています。前世代のCogVLMオープンソースモデルと比較して、CogVLM2シリーズのオープンソースモデルには以下の改善点があります。

TextVQA、DocVQAなどの多くのベンチマークで大幅な改善。
8Kのコンテンツ長をサポート。
最大1344 * 1344の画像解像度をサポート。
中国語と英語の両方をサポートするオープンソースモデルバージョンを提供。

CogVLM2ファミリーのオープンソースモデルの詳細は、以下の表で確認できます。

モデル名	cogvlm2-llama3-chat-19B	cogvlm2-llama3-chinese-chat-19B
ベースモデル	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct
言語	英語	中国語、英語
モデルサイズ	19B	19B
タスク	画像理解、対話モデル	画像理解、対話モデル
テキスト長	8K	8K
画像解像度	1344 * 1344	1344 * 1344

📊 ベンチマーク

当社のオープンソースモデルは、前世代のCogVLMオープンソースモデルと比較して、多くのリストで良好な結果を達成しています。その優れた性能は、一部の非オープンソースモデルと競争できるレベルです。以下の表に示します。

モデル	オープンソース	LLMサイズ	TextVQA	DocVQA	ChartQA	OCRbench	VCR_EASY	VCR_HARD	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	73.9	34.6	37.3	52.0	65.8
LLaVA - 1.5	✅	13B	61.3	-	-	337	-	-	37.0	35.4	67.7
Mini - Gemini	✅	34B	74.1	-	-	-	-	-	48.0	59.3	80.6
LLaVA - NeXT - LLaMA3	✅	8B	-	78.2	69.5	-	-	-	41.7	-	72.1
LLaVA - NeXT - 110B	✅	110B	-	85.7	79.7	-	-	-	49.1	-	80.5
InternVL - 1.5	✅	20B	80.6	90.9	83.8	720	14.7	2.0	46.8	55.4	82.3
QwenVL - Plus	❌	-	78.9	91.4	78.1	726	-	-	51.4	55.7	67.0
Claude3 - Opus	❌	-	-	89.3	80.8	694	63.85	37.8	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	62.73	28.1	58.5	-	-
GPT - 4V	❌	-	78.0	88.4	78.5	656	52.04	25.8	56.8	67.7	75.0
CogVLM2 - LLaMA3	✅	8B	84.2	92.3	81.0	756	83.3	38.0	44.3	60.4	80.5
CogVLM2 - LLaMA3 - Chinese	✅	8B	85.0	88.4	74.7	780	79.9	25.1	42.8	60.5	78.9

すべての評価は、外部のOCRツールを使用せずに取得されました（「ピクセルのみ」）。

🚀 クイックスタート

CogVLM2モデルとチャットするためのモデルの使い方の簡単な例を以下に示します。その他の使用例は、githubで見つけることができます。

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

📄 ライセンス

このモデルはCogVLM2のLICENSEの下で公開されています。Meta Llama 3を使用して構築されたモデルの場合は、LLAMA3_LICENSEにも準拠してください。

📚 引用

当社の研究が役立った場合は、以下の論文を引用していただけると幸いです。

@misc{hong2024cogvlm2,
  title={CogVLM2: Visual Language Models for Image and Video Understanding},
  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
  year={2024}
  eprint={2408.16500},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}