CogVLM2-Llama3-Chat-19B开源多模态大模型 - 支持图像理解与对话，处理能力强

首页

Cogvlm2 Llama3 Chat 19B

由 THUDM 开发

CogVLM2是基于Meta-Llama-3-8B-Instruct构建的多模态大模型，支持图像理解和对话任务，具有8K上下文长度和1344x1344图像分辨率处理能力。

文本生成图像

Transformers

英语开源协议:其他 #多模态对话 #高分辨率图像理解 #8K长文本支持

下载量 7,805

发布时间 : 5/16/2024

模型简介

新一代视觉语言模型，在多项基准测试中表现优异，支持中英文多模态交互。

模型特点

高性能多模态理解

在TextVQA、DocVQA等基准测试中显著优于前代模型

长上下文支持

支持8K长度的上下文记忆

高分辨率图像处理

支持最高1344x1344像素的图像输入

双语支持

提供中英文双语版本（cogvlm2-llama3-chinese-chat-19B）

模型能力

图像内容理解

文档问答

图表解析

多轮对话

跨模态推理

使用案例

文档处理

文档内容问答

解析PDF/图片文档并回答相关问题

在DocVQA基准测试中达到92.3分

视觉问答

图像内容问答

回答关于图像内容的复杂问题

在TextVQA基准测试中达到84.2分

教育辅助

图表解析

解释和分析各类数据图表

在ChartQA基准测试中达到81.0分

🚀 CogVLM2

我们推出了新一代的CogVLM2系列模型，该系列模型在图像和文本理解等多个方面有显著提升，支持大内容长度和高分辨率图像，还提供了支持中英双语的开源版本，能为图像理解和对话等任务提供强大助力。

👋 微信 · 💡在线演示 · 🎈GitHub页面 · 📑 论文

📍可在智谱AI开放平台体验更大规模的CogVLM模型。

✨ 主要特性

我们推出了新一代的 CogVLM2 系列模型，并开源了两个基于 Meta-Llama-3-8B-Instruct 构建的模型。与上一代CogVLM开源模型相比，CogVLM2系列开源模型有以下改进：

在 TextVQA、DocVQA 等多个基准测试中取得显著提升。
支持 8K 内容长度。
支持最高 1344 * 1344 的图像分辨率。
提供支持 中文和英文 的开源模型版本。

你可以在下表中查看CogVLM2系列开源模型的详细信息：

属性	详情
模型名称	cogvlm2-llama3-chat-19B、cogvlm2-llama3-chinese-chat-19B
基础模型	Meta-Llama-3-8B-Instruct
语言支持	英文、中文和英文
模型大小	19B
任务类型	图像理解、对话模型
文本长度	8K
图像分辨率	1344 * 1344

📚 详细文档

基准测试

与上一代CogVLM开源模型相比，我们的开源模型在多个榜单中取得了优异成绩。其出色的性能可与一些非开源模型相媲美，如下表所示：

模型	是否开源	大语言模型大小	TextVQA	DocVQA	ChartQA	OCRbench	VCR_EASY	VCR_HARD	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	73.9	34.6	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	-	-	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	-	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	-	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	14.7	2.0	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	-	-	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	63.85	37.8	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	62.73	28.1	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	52.04	25.8	56.8	67.7	75.0
CogVLM2-LLaMA3	✅	8B	84.2	92.3	81.0	756	83.3	38.0	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese	✅	8B	85.0	88.4	74.7	780	79.9	25.1	42.8	60.5	78.9

所有评测均未使用任何外部OCR工具（“仅像素”）。

🚀 快速开始

以下是一个如何使用CogVLM2模型进行对话的简单示例。更多用例可在我们的 GitHub 上找到。

基础用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

📄 许可证

该模型遵循CogVLM2 许可证发布。对于基于Meta Llama 3构建的模型，请同时遵守 LLAMA3许可证。

📑 引用

如果您觉得我们的工作有帮助，请考虑引用以下论文：

@misc{hong2024cogvlm2,
  title={CogVLM2: Visual Language Models for Image and Video Understanding},
  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
  year={2024},
  eprint={2408.16500},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}