CogVLM2-Llama3-Chat-19B開源多模態大模型 - 支持圖像理解與對話，處理能力強

首頁

Cogvlm2 Llama3 Chat 19B

由THUDM開發

CogVLM2是基於Meta-Llama-3-8B-Instruct構建的多模態大模型，支持圖像理解和對話任務，具有8K上下文長度和1344x1344圖像分辨率處理能力。

文本生成圖像

Transformers

英語開源協議:其他 #多模態對話 #高分辨率圖像理解 #8K長文本支持

下載量 7,805

發布時間 : 5/16/2024

模型概述

新一代視覺語言模型，在多項基準測試中表現優異，支持中英文多模態交互。

模型特點

高性能多模態理解

在TextVQA、DocVQA等基準測試中顯著優於前代模型

長上下文支持

支持8K長度的上下文記憶

高分辨率圖像處理

支持最高1344x1344像素的圖像輸入

雙語支持

提供中英文雙語版本（cogvlm2-llama3-chinese-chat-19B）

模型能力

圖像內容理解

文檔問答

圖表解析

多輪對話

跨模態推理

使用案例

文檔處理

文檔內容問答

解析PDF/圖片文檔並回答相關問題

在DocVQA基準測試中達到92.3分

視覺問答

圖像內容問答

回答關於圖像內容的複雜問題

在TextVQA基準測試中達到84.2分

教育輔助

圖表解析

解釋和分析各類數據圖表

在ChartQA基準測試中達到81.0分

🚀 CogVLM2

我們推出了新一代的CogVLM2系列模型，該系列模型在圖像和文本理解等多個方面有顯著提升，支持大內容長度和高分辨率圖像，還提供了支持中英雙語的開源版本，能為圖像理解和對話等任務提供強大助力。

👋 微信 · 💡在線演示 · 🎈GitHub頁面 · 📑 論文

📍可在智譜AI開放平臺體驗更大規模的CogVLM模型。

✨ 主要特性

我們推出了新一代的 CogVLM2 系列模型，並開源了兩個基於 Meta-Llama-3-8B-Instruct 構建的模型。與上一代CogVLM開源模型相比，CogVLM2系列開源模型有以下改進：

在 TextVQA、DocVQA 等多個基準測試中取得顯著提升。
支持 8K 內容長度。
支持最高 1344 * 1344 的圖像分辨率。
提供支持 中文和英文 的開源模型版本。

你可以在下表中查看CogVLM2系列開源模型的詳細信息：

屬性	詳情
模型名稱	cogvlm2-llama3-chat-19B、cogvlm2-llama3-chinese-chat-19B
基礎模型	Meta-Llama-3-8B-Instruct
語言支持	英文、中文和英文
模型大小	19B
任務類型	圖像理解、對話模型
文本長度	8K
圖像分辨率	1344 * 1344

📚 詳細文檔

基準測試

與上一代CogVLM開源模型相比，我們的開源模型在多個榜單中取得了優異成績。其出色的性能可與一些非開源模型相媲美，如下表所示：

模型	是否開源	大語言模型大小	TextVQA	DocVQA	ChartQA	OCRbench	VCR_EASY	VCR_HARD	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	73.9	34.6	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	-	-	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	-	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	-	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	14.7	2.0	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	-	-	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	63.85	37.8	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	62.73	28.1	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	52.04	25.8	56.8	67.7	75.0
CogVLM2-LLaMA3	✅	8B	84.2	92.3	81.0	756	83.3	38.0	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese	✅	8B	85.0	88.4	74.7	780	79.9	25.1	42.8	60.5	78.9

所有評測均未使用任何外部OCR工具（“僅像素”）。

🚀 快速開始

以下是一個如何使用CogVLM2模型進行對話的簡單示例。更多用例可在我們的 GitHub 上找到。

基礎用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

📄 許可證

該模型遵循CogVLM2 許可證發佈。對於基於Meta Llama 3構建的模型，請同時遵守 LLAMA3許可證。

📑 引用

如果您覺得我們的工作有幫助，請考慮引用以下論文：

@misc{hong2024cogvlm2,
  title={CogVLM2: Visual Language Models for Image and Video Understanding},
  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
  year={2024},
  eprint={2408.16500},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}