CogVLM2-llama3-chinese-chat-19Bオープンソースマルチモーダル大規模モデル - 無料でデプロイ可能、中英両言語での会話と画像理解に卓越しています

Home

Cogvlm2 Llama3 Chinese Chat 19B

Developed by THUDM

CogVLM2はMeta-Llama-3-8B-Instructを基に構築されたマルチモーダル大規模モデルで、中英二言語をサポートし、強力な画像理解と対話能力を備えています。

テキスト生成画像

Transformers

EnglishOpen Source License:Other #マルチモーダル対話 #高解像度画像理解 #中英二言語対応

Downloads 118

Release Time : 5/16/2024

Model Overview

新世代CogVLM2シリーズモデルは、8Kコンテキスト長と1344*1344解像度の画像入力をサポートし、多数のベンチマークテストで優れた性能を発揮します。

Model Features

マルチモーダル能力

画像とテキストの統合理解と生成をサポート

高解像度サポート

最大1344*1344解像度の画像入力をサポート

長文コンテキスト処理

8K長のコンテキスト処理をサポート

二言語サポート

中国語と英語の対話と理解を同時にサポート

Model Capabilities

画像理解

テキスト生成

マルチモーダル対話

文書分析

図表理解

Use Cases

視覚的質問応答

画像内容の質問応答

画像内容に関する様々な質問に回答

TextVQAベンチマークテストで85.0点を達成

文書処理

文書理解と質問応答

文書内容を解析し関連質問に回答

DocVQAベンチマークテストで88.4点を達成

図表分析

図表データの解釈

図表内容を理解しキー情報を抽出

ChartQAベンチマークテストで74.7点を達成

🚀 CogVLM2

CogVLM2シリーズの新世代モデルを公開し、Meta-Llama-3-8B-Instructをベースに構築された2つのモデルをオープンソース化しています。前世代のCogVLMオープンソースモデルと比較して、多くのベンチマークで性能が向上し、長文や高解像度画像にも対応し、中英語対応のオープンソースモデルを提供しています。

👋 Wechat · 💡オンラインデモ · 🎈Githubページ · 📑 論文

📍ZhipuAIオープンプラットフォームで大規模なCogVLMモデルを体験できます。

🚀 クイックスタート

以下は、CogVLM2モデルとチャットするための簡単な使用例です。その他の使用例は、githubを参照してください。

基本的な使用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chinese-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

✨ 主な機能

新世代のCogVLM2シリーズのモデルを公開し、Meta-Llama-3-8B-Instructを使って構築された2つのモデルをオープンソース化しています。前世代のCogVLMオープンソースモデルと比べて、CogVLM2シリーズのオープンソースモデルには以下の改善点があります。

TextVQA、DocVQAなどの多くのベンチマークで大幅な改善。
8Kのコンテンツ長をサポート。
最大1344 * 1344の画像解像度をサポート。
中国語と英語の両方をサポートするオープンソースモデルバージョンを提供。

CogVLM2ファミリーのオープンソースモデルの詳細は、以下の表で確認できます。

モデル名	cogvlm2-llama3-chat-19B	cogvlm2-llama3-chinese-chat-19B
ベースモデル	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct
言語	英語	中国語、英語
モデルサイズ	19B	19B
タスク	画像理解、対話モデル	画像理解、対話モデル
テキスト長	8K	8K
画像解像度	1344 * 1344	1344 * 1344

📚 ドキュメント

当社のオープンソースモデルは、前世代のCogVLMオープンソースモデルと比較して、多くのリストで良好な結果を達成しています。その優れた性能は、一部の非オープンソースモデルと競争できるレベルです。以下の表に示す通りです。

モデル	オープンソース	LLMサイズ	TextVQA	DocVQA	ChartQA	OCRbench	VCR_EASY	VCR_HARD	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	73.9	34.6	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	-	-	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	-	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	-	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	14.7	2.0	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	-	-	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	63.85	37.8	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	62.73	28.1	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	52.04	25.8	56.8	67.7	75.0
CogVLM2-LLaMA3	✅	8B	84.2	92.3	81.0	756	83.3	38.0	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese	✅	8B	85.0	88.4	74.7	780	79.9	25.1	42.8	60.5	78.9

すべてのレビューは、外部のOCRツールを使用せずに取得されたものです（「pixel only」）。

📄 ライセンス

このモデルはCogVLM2のライセンスの下でリリースされています。Meta Llama 3を使って構築されたモデルについては、LLAMA3_LICENSEにも準拠してください。

📚 引用

当社の研究が役立った場合、以下の論文を引用していただけると幸いです。

@misc{hong2024cogvlm2,
  title={CogVLM2: Visual Language Models for Image and Video Understanding},
  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
  year={2024}
  eprint={2408.16500},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}