CogVLM2-llama3-chat-19B-int4開源多模態對話模型 - 支持雙語，處理高清圖像及長對話

首頁

Cogvlm2 Llama3 Chat 19B Int4

由THUDM開發

CogVLM2是基於Meta-Llama-3-8B-Instruct構建的多模態對話模型，支持中英文，具備8K上下文長度和1344*1344分辨率圖像處理能力。

文本生成圖像

Transformers

英語開源協議:其他 #多模態對話 #8K長文本 #高分辨率圖像理解

下載量 467

發布時間 : 5/24/2024

模型概述

新一代CogVLM2系列開源模型，在多項基準測試中表現優異，支持高分辨率圖像理解和長文本對話。

模型特點

高性能多模態理解

在TextVQA、DocVQA等多項基準測試中表現優異，超越上一代模型

長上下文支持

支持8K長度的上下文對話

高分辨率圖像處理

支持最高1344*1344分辨率的圖像輸入

雙語支持

同時支持中文和英文的多模態對話

模型能力

多模態對話

圖像內容理解

長文本生成

文檔問答

圖表理解

OCR能力

使用案例

文檔處理

文檔問答

對上傳的文檔進行內容理解和問答

在DocVQA基準測試中達到92.3分

圖像理解

圖像內容問答

對圖像內容進行描述和問答

在TextVQA基準測試中達到85.0分

圖表分析

圖表理解

解析圖表內容並回答問題

在ChartQA基準測試中達到81.0分

🚀 CogVLM2

CogVLM2 是新一代的模型系列，基於 Meta-Llama-3-8B-Instruct 構建，在多個基準測試中表現優異，支持 8K 內容長度和高分辨率圖像，還提供了支持中英雙語的開源模型版本。

🚀 快速開始

以下是一個使用 CogVLM2 模型進行對話的簡單示例。更多用例可在我們的 GitHub 上找到。

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
    0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
    inputs = {
        'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
        'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
        'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
        'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
    }
    gen_kwargs = {
        "max_new_tokens": 2048,
        "pad_token_id": 128002,
    }
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        response = tokenizer.decode(outputs[0])
        response = response.split("<|end_of_text|>")[0]
        print("\nCogVLM2:", response)
    history.append((query, response))

✨ 主要特性

性能提升：在 TextVQA、DocVQA 等多個基準測試中相比上一代 CogVLM 開源模型有顯著提升。
長內容支持：支持 8K 內容長度。
高分辨率圖像支持：支持圖像分辨率高達 1344 * 1344。
雙語支持：提供支持 中文和英文 的開源模型版本。

🔧 技術細節

CogVlM2 Int4 模型需要 16G GPU 內存，並且必須在帶有 Nvidia GPU 的 Linux 系統上運行。

模型名稱	cogvlm2-llama3-chat-19B-int4	cogvlm2-llama3-chat-19B
所需 GPU 內存	16G	42G
所需系統	Linux（帶 Nvidia GPU）	Linux（帶 Nvidia GPU）

📚 詳細文檔

基準測試

與上一代 CogVLM 開源模型相比，我們的開源模型在多個榜單上取得了良好的成績。其出色的性能可與一些非開源模型相媲美，如下表所示：

模型	是否開源	大語言模型規模	TextVQA	DocVQA	ChartQA	OCRbench	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	37.3	52.0	65.8
LLaVA - 1.5	✅	13B	61.3	-	-	337	37.0	35.4	67.7
Mini - Gemini	✅	34B	74.1	-	-	-	48.0	59.3	80.6
LLaVA - NeXT - LLaMA3	✅	8B	-	78.2	69.5	-	41.7	-	72.1
LLaVA - NeXT - 110B	✅	110B	-	85.7	79.7	-	49.1	-	80.5
InternVL - 1.5	✅	20B	80.6	90.9	83.8	720	46.8	55.4	82.3
QwenVL - Plus	❌	-	78.9	91.4	78.1	726	51.4	55.7	67.0
Claude3 - Opus	❌	-	-	89.3	80.8	694	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	58.5	-	-
GPT - 4V	❌	-	78.0	88.4	78.5	656	56.8	67.7	75.0
CogVLM2 - LLaMA3（我們的模型）	✅	8B	84.2	92.3	81.0	756	44.3	60.4	80.5
CogVLM2 - LLaMA3 - Chinese（我們的模型）	✅	8B	85.0	88.4	74.7	780	42.8	60.5	78.9

所有評測均未使用任何外部 OCR 工具（“僅像素”）。

📄 許可證

本模型根據 CogVLM2 許可證發佈。對於基於 Meta Llama 3 構建的模型，請同時遵守 LLAMA3 許可證。

📚 引用

如果您認為我們的工作有幫助，請考慮引用以下論文：

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}