CogVLM Open-Source Vision-Language Model - Free Deployment, Leading Performance in Multi-Cross-Modal Benchmark Tests

Cogvlm Chat Hf

Developed by THUDM

CogVLM is a powerful open-source vision-language model that achieves leading performance in multiple cross-modal benchmarks

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Dialogue #Vision-Language Large Model #Cross-modal Reasoning

Downloads 4,816

Release Time : 11/16/2023

Model Overview

CogVLM is a Vision-Language Model (VLM) that combines visual and language processing capabilities, suitable for multimodal tasks

Model Features

Multimodal Fusion

Combines visual and language processing capabilities to achieve cross-modal understanding

High Performance

Achieves leading performance in 10 classic cross-modal benchmarks

Visual Expert Module

Unique visual expert module enhances visual understanding capabilities

Model Capabilities

Image caption generation

Visual question answering

Cross-modal understanding

Multimodal dialogue

Use Cases

Image Understanding

Image Caption Generation

Generates accurate natural language descriptions for images

Performs excellently in the Flicker30k caption generation task

Visual Question Answering

Image-based Question Answering

Answers natural language questions about image content

Ranked second in tasks such as VQAv2 and OKVQA

🚀 CogVLM

CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. It achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC. It ranks second on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. You can experience CogVLM's multimodal dialogue through the online demo.

The above weights are completely open for academic research. After filling out the questionnaire for registration, free commercial use is also permitted.

🚀 Quick Start

Hardware Requirement

Approximately 40GB of GPU VRAM is required for model inference. If you don't have a single GPU with more than 40GB of VRAM, you'll need to use the accelerate library to split the model across multiple GPUs with smaller VRAM.

Installation of Dependencies

pip install torch==2.1.0 transformers==4.35.0 accelerate==0.24.1 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.22.post7 triton==2.1.0

💻 Usage Examples

Basic Usage

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    'THUDM/cogvlm-chat-hf',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to('cuda').eval()


# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# This image captures a moment from a basketball game. Two players are prominently featured: one wearing a yellow jersey with the number
# 24 and the word 'Lakers' written on it, and the other wearing a navy blue jersey with the word 'Washington' and the number 34. The player
# in yellow is holding a basketball and appears to be dribbling it, while the player in navy blue is reaching out with his arm, possibly
# trying to block or defend. The background shows a filled stadium with spectators, indicating that this is a professional game.</s>


# vqa example
query = 'How many houses are there in this cartoon?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/3.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='vqa')   # vqa mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# 4</s>

Advanced Usage

When the VRAM of a single GPU is insufficient, you can split the model across multiple GPUs with smaller VRAM. Here is an example for when you have two 24GB GPUs and 16GB of CPU memory. You can modify the parameters of infer_auto_device_map according to your configuration. Note that a bit less GPU VRAM is specified here to reserve some VRAM for intermediate states during inference.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer'])
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

📚 Documentation

Method

The CogVLM model consists of four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. For more details, please refer to the Paper.

📄 License

The code in this repository is open source under the Apache-2.0 license. However, the use of the CogVLM model weights must comply with the Model License.

📖 Citation

If you find our work helpful, please consider citing the following papers:

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご