CogVLM2-llama3-chat-19B-int4オープンソース多モーダル対話モデル - バイリンガル対応、高解像度画像と長い会話の処理が可能

ホーム

Cogvlm2 Llama3 Chat 19B Int4

THUDMによって開発

CogVLM2はMeta-Llama-3-8B-Instructを基に構築されたマルチモーダル対話モデルで、中英語をサポートし、8Kのコンテキスト長と1344*1344解像度の画像処理能力を備えています。

テキスト生成画像

Transformers

英語オープンソースライセンス:その他 #マルチモーダル対話 #8K長文テキスト #高解像度画像理解

ダウンロード数 467

リリース時間 : 5/24/2024

モデル概要

新世代のCogVLM2シリーズオープンモデルで、複数のベンチマークテストで優れた性能を発揮し、高解像度画像理解と長文対話をサポートします。

モデル特徴

高性能マルチモーダル理解

TextVQA、DocVQAなどのベンチマークテストで優れた性能を発揮し、前世代モデルを凌駕

長文コンテキストサポート

8K長のコンテキスト対話をサポート

高解像度画像処理

最大1344*1344解像度の画像入力をサポート

バイリンガルサポート

中国語と英語のマルチモーダル対話を同時にサポート

モデル能力

マルチモーダル対話

画像内容理解

長文生成

ドキュメントQA

チャート理解

OCR能力

使用事例

ドキュメント処理

ドキュメントQA

アップロードされたドキュメントの内容理解とQA

DocVQAベンチマークテストで92.3点を達成

画像理解

画像内容QA

画像内容の説明とQA

TextVQAベンチマークテストで85.0点を達成

チャート分析

チャート理解

チャート内容を解析し質問に回答

ChartQAベンチマークテストで81.0点を達成

🚀 CogVLM2

CogVLM2は、新世代のモデルシリーズです。このモデルは、多くのベンチマークで優れた性能を発揮し、8Kのコンテンツ長をサポートし、最大1344 * 1344の画像解像度に対応しています。また、中国語と英語の両方をサポートするオープンソースモデルバージョンも提供しています。

👋 Wechat · 💡オンラインデモ · 🎈Githubページ

📍ZhipuAIオープンプラットフォームで、より大規模なCogVLMモデルを体験できます。

✨ 主な機能

我々は新世代のCogVLM2シリーズのモデルをリリースし、Meta-Llama-3-8B-Instructを使用して構築された2つのモデルをオープンソース化しています。前世代のCogVLMオープンソースモデルと比較して、CogVLM2シリーズのオープンソースモデルには以下の改善点があります。

TextVQA、DocVQAなどの多くのベンチマークで大幅な改善。
8Kのコンテンツ長をサポート。
最大1344 * 1344の画像解像度をサポート。
**中国語と英語の両方をサポートするオープンソースモデルバージョンを提供。

CogVlM2 Int4モデルには16GのGPUメモリが必要で、Nvidia GPUを搭載したLinuxでのみ実行できます。

モデル名	cogvlm2-llama3-chat-19B-int4	cogvlm2-llama3-chat-19B
必要なGPUメモリ	16G	42G
必要なシステム	Linux (Nvidia GPU搭載)	Linux (Nvidia GPU搭載)

📊 ベンチマーク

我々のオープンソースモデルは、前世代のCogVLMオープンソースモデルと比較して、多くのリストで良好な結果を達成しています。その優れた性能は、一部の非オープンソースモデルと競争することができます。以下の表に示す通りです。

モデル	オープンソース	LLMサイズ	TextVQA	DocVQA	ChartQA	OCRbench	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	56.8	67.7	75.0
CogVLM2-LLaMA3 (我々のモデル)	✅	8B	84.2	92.3	81.0	756	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese (我々のモデル)	✅	8B	85.0	88.4	74.7	780	42.8	60.5	78.9

すべてのレビューは、外部のOCRツールを使用せずに取得されました（「ピクセルのみ」）。

🚀 クイックスタート

以下は、CogVLM2モデルとチャットするためのモデルの使用方法の簡単な例です。その他の使用例は、githubを参照してください。

基本的な使用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
    0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

📄 ライセンス

このモデルは、CogVLM2のライセンスの下でリリースされています。Meta Llama 3を使用して構築されたモデルの場合、LLAMA3_LICENSEにも準拠する必要があります。

📚 引用

もし私たちの研究が役に立った場合は、以下の論文を引用していただけると幸いです。

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}