CogVLM開源視覺語言模型 - 免費支持目標檢測與視覺問答任務

首頁

Cogvlm Grounding Generalist Hf Quant4

由Rodeszones開發

CogVLM是一款強大的開源視覺語言模型，支持目標檢測和視覺問答等任務，採用4位精度量化。

圖像生成文本

Transformers

開源協議:Apache-2.0 #視覺定位 #多模態對話 #開源大模型

下載量 50

發布時間 : 3/5/2024

模型概述

CogVLM是一款視覺語言模型，具備強大的視覺理解和語言生成能力，支持目標檢測、圖像描述生成等任務。

模型特點

高性能跨模態能力

在10個經典跨模態基準測試中達到最先進性能，媲美PaLI-X 55B

4位量化

採用bitsandbytes 4位精度量化，降低硬件需求

目標定位能力

可生成圖像中物體的座標位置信息

模型能力

目標檢測

圖像描述生成

視覺問答

跨模態理解

使用案例

圖像分析

物體檢測與定位

識別圖像中的物體並標註座標位置

輸出格式：物體描述[[x0,y0,x1,y1]]

智能客服

視覺問答

回答關於圖像內容的自然語言問題

🚀 CogVLM

CogVLM 是一個強大的 開源視覺語言模型（VLM）。CogVLM-17B 擁有 100 億視覺參數和 70 億語言參數。CogVLM-17B 在 10 個經典跨模態基準測試中取得了最先進的性能，包括 NoCaps、Flicker30k 圖像描述、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC，在 VQAv2、OKVQA、TextVQA、COCO 圖像描述等任務中排名第二，超越或媲美 PaLI-X 55B。本項目是使用 bitsandbytes 進行 4 位精度量化的 CogVLM 定位通用模型。

🚀 快速開始

環境安裝

pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 transformers==4.38.1 accelerate==0.27.2 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.24 protobuf==3.20.3 triton==2.1.0 bitsandbytes==0.43.0.dev0

在 Windows 系統上使用 triton 和 bitsandbytes，請使用以下文件進行安裝：

pip install bitsandbytes-0.43.0.dev0-cp310-cp310-win_amd64.whl

pip install triton-2.1.0-cp310-cp310-win_amd64.whl

代碼示例

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

model_path = "'local/model/folder/path/here' or 'Rodeszones/CogVLM-grounding-generalist-hf-quant4'"

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval()

# 對話示例
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open("your/image/path/here").convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # 對話模式
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# 示例輸出 
# a room with a ladder [[378,107,636,998]] and a blue and white towel [[073,000,346,905]].</s>
# 注意：模型的方塊尺寸為 1000x1000，這一點需要重點考慮。

📄 許可證

本倉庫中的代碼遵循 Apache-2.0 許可證開源，而 CogVLM 模型權重的使用必須遵守模型許可證。

📚 引用

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}