🚀 CogVLM
CogVLM is a powerful open - source visual language model (VLM) that offers advanced object - detection capabilities. This version is a quantized grounding generalist model using bitsandbytes with 4 - bit precision. CogVLM - 17B combines 10 billion vision parameters and 7 billion language parameters, achieving state - of - the - art performance on multiple cross - modal benchmarks, either surpassing or matching PaLI - X 55B.
✨ Features
- Powerful Visual - Language Integration: CogVLM - 17B effectively fuses vision and language parameters, enabling high - quality cross - modal understanding.
- Benchmark - Leading Performance: It ranks top on 10 classic cross - modal benchmarks such as NoCaps, Flicker30k captioning, etc., and second on VQAv2, OKVQA, etc.
- Quantized for Efficiency: Quantized with 4 - bit precision using bitsandbytes, it offers a balance between performance and resource consumption.
📦 Installation
General Installation
pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 transformers==4.38.1 accelerate==0.27.2 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.24 protobuf==3.20.3 triton==2.1.0 bitsandbytes==0.43.0.dev0
Windows - Specific Installation
For triton and bitsandbytes on Windows, use the following commands:
pip install bitsandbytes-0.43.0.dev0-cp310-cp310-win_amd64.whl
pip install triton-2.1.0-cp310-cp310-win_amd64.whl
💻 Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
model_path = "'local/model/folder/path/here' or 'Rodeszones/CogVLM-grounding-generalist-hf-quant4'"
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).eval()
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open("your/image/path/here").convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
📄 License
The code in this repository is open source under the Apache - 2.0 license. However, the use of the CogVLM model weights must comply with the Model License.
📚 Documentation
Citation
@article{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}