CogVLM Open-Source Vision-Language Model - Free Deployment, SOTA Performance in Multiple Cross-Modal Benchmark Tests

Cogvlm Grounding Generalist Hf

Developed by THUDM

CogVLM is a powerful open-source visual language model (VLM) that has achieved SOTA performance on multiple cross-modal benchmarks.

Image-to-Text

Transformers

#Visual Language Model #Multimodal Benchmark SOTA #Visual Expert Module

Downloads 702

Release Time : 11/17/2023

Model Overview

CogVLM is a visual language model capable of understanding and generating text descriptions related to images, supporting multimodal dialogue and object localization.

Model Features

Multimodal Understanding

Capable of processing both visual and linguistic information, enabling deep interaction between images and text.

High Performance

Achieves SOTA performance on 10 classic cross-modal benchmarks, surpassing PaLI-X 55B in some tasks.

Object Localization Capability

Can provide coordinate position information for mentioned objects in images.

Open-source Model

Code and model weights are open, facilitating research and applications.

Model Capabilities

Image caption generation

Visual question answering

Multimodal dialogue

Object detection and localization

Cross-modal understanding

Use Cases

Image Understanding

Automatic Image Annotation

Generates detailed descriptive text for images.

Performs excellently on benchmarks like COCO captioning.

Visual Question Answering

Answers natural language questions about image content.

Ranked second on benchmarks like VQAv2 and OKVQA.

Human-Computer Interaction

Multimodal Dialogue

Natural language dialogue based on image content.

Supports complex image-related conversational interactions.

Computer Vision Assistance

Object Localization

Identifies objects in images and provides their coordinates.

Can output object bounding box coordinates [[x0,y0,x1,y1]].

🚀 CogVLM

CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B comes with 10 billion vision parameters and 7 billion language parameters. It achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC. It ranks second on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. You can experience the multi-modal conversation of CogVLM through the online demo.

🚀 Quick Start

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    'THUDM/cogvlm-grounding-generalist-hf',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to('cuda').eval()

query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/4.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, images=[image])
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

✨ Features

The CogVLM model consists of four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. For more details, please refer to the Paper.

📄 License

The code in this repository is open source under the Apache-2.0 license. However, the use of the CogVLM model weights must comply with the Model License.

📚 Citation

If you find our work helpful, please consider citing the following papers:

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご