CogVLM Open-Source Vision-Language Model - Free Support for Object Detection and Visual Question-Answering Tasks

Cogvlm Grounding Generalist Hf Quant4

Developed by Rodeszones

CogVLM is a powerful open-source vision-language model supporting tasks like object detection and visual question answering, featuring 4-bit precision quantization.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Visual Grounding #Multimodal Dialogue #Open-source Large Model

Downloads 50

Release Time : 3/5/2024

Model Overview

CogVLM is a vision-language model with strong visual understanding and language generation capabilities, supporting tasks like object detection and image captioning.

Model Features

High-performance Cross-modal Capability

Achieves state-of-the-art performance on 10 classic cross-modal benchmarks, comparable to PaLI-X 55B

4-bit Quantization

Utilizes bitsandbytes 4-bit precision quantization to reduce hardware requirements

Object Grounding Capability

Can generate coordinate position information for objects in images

Model Capabilities

Object detection

Image captioning

Visual question answering

Cross-modal understanding

Use Cases

Image Analysis

Object Detection and Grounding

Identify objects in images and annotate their coordinate positions

Output format: Object description[[x0,y0,x1,y1]]

Intelligent Customer Service

Visual Question Answering

Answer natural language questions about image content

🚀 CogVLM

CogVLM is a powerful open - source visual language model (VLM) that offers advanced object - detection capabilities. This version is a quantized grounding generalist model using bitsandbytes with 4 - bit precision. CogVLM - 17B combines 10 billion vision parameters and 7 billion language parameters, achieving state - of - the - art performance on multiple cross - modal benchmarks, either surpassing or matching PaLI - X 55B.

✨ Features

Powerful Visual - Language Integration: CogVLM - 17B effectively fuses vision and language parameters, enabling high - quality cross - modal understanding.
Benchmark - Leading Performance: It ranks top on 10 classic cross - modal benchmarks such as NoCaps, Flicker30k captioning, etc., and second on VQAv2, OKVQA, etc.
Quantized for Efficiency: Quantized with 4 - bit precision using bitsandbytes, it offers a balance between performance and resource consumption.

📦 Installation

General Installation

pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 transformers==4.38.1 accelerate==0.27.2 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.24 protobuf==3.20.3 triton==2.1.0 bitsandbytes==0.43.0.dev0

Windows - Specific Installation

For triton and bitsandbytes on Windows, use the following commands:

pip install bitsandbytes-0.43.0.dev0-cp310-cp310-win_amd64.whl
pip install triton-2.1.0-cp310-cp310-win_amd64.whl

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

model_path = "'local/model/folder/path/here' or 'Rodeszones/CogVLM-grounding-generalist-hf-quant4'"

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval()

# chat example
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open("your/image/path/here").convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# example output 
# a room with a ladder [[378,107,636,998]] and a blue and white towel [[073,000,346,905]].</s>
# NOTE: The model's squares have dimensions of 1000 by 1000, which is important to consider.

📄 License

The code in this repository is open source under the Apache - 2.0 license. However, the use of the CogVLM model weights must comply with the Model License.

📚 Documentation

Citation

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご