CogVLM开源视觉语言模型 - 免费支持目标检测与视觉问答任务

首页

Cogvlm Grounding Generalist Hf Quant4

由 Rodeszones 开发

CogVLM是一款强大的开源视觉语言模型，支持目标检测和视觉问答等任务，采用4位精度量化。

图像生成文本

Transformers

开源协议:Apache-2.0 #视觉定位 #多模态对话 #开源大模型

下载量 50

发布时间 : 3/5/2024

模型简介

CogVLM是一款视觉语言模型，具备强大的视觉理解和语言生成能力，支持目标检测、图像描述生成等任务。

模型特点

高性能跨模态能力

在10个经典跨模态基准测试中达到最先进性能，媲美PaLI-X 55B

4位量化

采用bitsandbytes 4位精度量化，降低硬件需求

目标定位能力

可生成图像中物体的坐标位置信息

模型能力

目标检测

图像描述生成

视觉问答

跨模态理解

使用案例

图像分析

物体检测与定位

识别图像中的物体并标注坐标位置

输出格式：物体描述[[x0,y0,x1,y1]]

智能客服

视觉问答

回答关于图像内容的自然语言问题

🚀 CogVLM

CogVLM 是一个强大的 开源视觉语言模型（VLM）。CogVLM-17B 拥有 100 亿视觉参数和 70 亿语言参数。CogVLM-17B 在 10 个经典跨模态基准测试中取得了最先进的性能，包括 NoCaps、Flicker30k 图像描述、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC，在 VQAv2、OKVQA、TextVQA、COCO 图像描述等任务中排名第二，超越或媲美 PaLI-X 55B。本项目是使用 bitsandbytes 进行 4 位精度量化的 CogVLM 定位通用模型。

🚀 快速开始

环境安装

pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 transformers==4.38.1 accelerate==0.27.2 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.24 protobuf==3.20.3 triton==2.1.0 bitsandbytes==0.43.0.dev0

在 Windows 系统上使用 triton 和 bitsandbytes，请使用以下文件进行安装：

pip install bitsandbytes-0.43.0.dev0-cp310-cp310-win_amd64.whl

pip install triton-2.1.0-cp310-cp310-win_amd64.whl

代码示例

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

model_path = "'local/model/folder/path/here' or 'Rodeszones/CogVLM-grounding-generalist-hf-quant4'"

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval()

# 对话示例
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open("your/image/path/here").convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # 对话模式
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# 示例输出 
# a room with a ladder [[378,107,636,998]] and a blue and white towel [[073,000,346,905]].</s>
# 注意：模型的方块尺寸为 1000x1000，这一点需要重点考虑。

📄 许可证

本仓库中的代码遵循 Apache-2.0 许可证开源，而 CogVLM 模型权重的使用必须遵守模型许可证。

📚 引用

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}