Cogvlm2 Llama3 Chat 19B Int4

Developed by THUDM

CogVLM2 is a multimodal dialogue model based on Meta-Llama-3-8B-Instruct, supporting both Chinese and English, with 8K context length and 1344*1344 resolution image processing capabilities.

Text-to-Image

Transformers

EnglishOpen Source License:Other #Multimodal Dialogue #8K Long Context #High-Resolution Image Understanding

Downloads 467

Release Time : 5/24/2024

Model Overview

The new generation of CogVLM2 open-source models performs excellently in multiple benchmarks, supporting high-resolution image understanding and long-context dialogue.

Model Features

High-Performance Multimodal Understanding

Excels in benchmarks like TextVQA and DocVQA, surpassing previous-generation models

Long Context Support

Supports 8K-length contextual dialogue

High-Resolution Image Processing

Supports image inputs up to 1344*1344 resolution

Bilingual Support

Supports multimodal dialogue in both Chinese and English

Model Capabilities

Multimodal dialogue

Image content understanding

Long-text generation

Document QA

Chart understanding

OCR capabilities

Use Cases

Document Processing

Document QA

Understand and answer questions about uploaded documents

Achieved 92.3 points on the DocVQA benchmark

Image Understanding

Image Content QA

Describe and answer questions about image content

Achieved 85.0 points on the TextVQA benchmark

Chart Analysis

Chart Understanding

Parse chart content and answer questions

Achieved 81.0 points on the ChartQA benchmark

license: other license_name: cogvlm2 license_link: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B-int4/blob/main/LICENSE

language:

en pipeline_tag: text-generation tags:
chat
cogvlm2

inference: false

CogVLM2

👋 Wechat · 💡Online Demo · 🎈Github Page

📍Experience the larger-scale CogVLM model on the ZhipuAI Open Platform.

Model introduction

We launch a new generation of CogVLM2 series of models and open source two models built with Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:

Significant improvements in many benchmarks such as TextVQA, DocVQA.
Support 8K content length.
Support image resolution up to 1344 * 1344.
Provide an open source model version that supports both Chinese and English.

CogVlM2 Int4 model requires 16G GPU memory and Must be run on Linux with Nvidia GPU.

Model name	cogvlm2-llama3-chat-19B-int4	cogvlm2-llama3-chat-19B
GPU Memory Required	16G	42G
System Required	Linux (With Nvidia GPU)	Linux (With Nvidia GPU)

Benchmark

Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below:

Model	Open Source	LLM Size	TextVQA	DocVQA	ChartQA	OCRbench	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	56.8	67.7	75.0
CogVLM2-LLaMA3 (Ours)	✅	8B	84.2	92.3	81.0	756	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese (Ours)	✅	8B	85.0	88.4	74.7	780	42.8	60.5	78.9

All reviews were obtained without using any external OCR tools ("pixel only").

Quick Start

here is a simple example of how to use the model to chat with the CogVLM2 model. For More use case. Find in our github

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
    0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

License

This model is released under the CogVLM2 LICENSE. For models built with Meta Llama 3, please also adhere to the LLAMA3_LICENSE.

Citation

If you find our work helpful, please consider citing the following papers

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご