CogVLM開源視覺語言模型 - 免費部署，多跨模態基準測試SOTA性能

首頁

Cogvlm Grounding Generalist Hf

由THUDM開發

CogVLM 是一個強大的開源視覺語言模型（VLM），在多個跨模態基準測試上取得了SOTA性能。

圖像生成文本

Transformers

#視覺語言大模型 #多模態基準SOTA #視覺專家模塊

下載量 702

發布時間 : 11/17/2023

模型概述

CogVLM 是一個視覺語言模型，能夠理解和生成與圖像相關的文本描述，支持多模態對話和物體定位。

模型特點

多模態理解

能夠同時處理視覺和語言信息，實現圖像與文本的深度交互

高性能

在10個經典跨模態基準測試上取得SOTA性能，部分任務超越PaLI-X 55B

物體定位能力

可提供圖像中提及物體的座標位置信息

開源模型

代碼和模型權重開放，便於研究和應用

模型能力

圖像描述生成

視覺問答

多模態對話

物體檢測與定位

跨模態理解

使用案例

圖像理解

自動圖像標註

為圖像生成詳細描述文本

在COCO captioning等基準測試中表現優異

視覺問答

回答關於圖像內容的自然語言問題

在VQAv2、OKVQA等基準測試中排名第二

人機交互

多模態對話

基於圖像內容的自然語言對話

支持複雜的圖像相關對話交互

計算機視覺輔助

物體定位

識別圖像中物體並提供座標位置

可輸出物體邊界框座標[[x0,y0,x1,y1]]

🚀 CogVLM

CogVLM 是一個強大的開源視覺語言模型（VLM）。CogVLM - 17B 擁有 100 億視覺參數和 70 億語言參數，在 10 個經典跨模態基準測試上取得了最優（SOTA）性能，這些測試包括 NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC。在 VQAv2、OKVQA、TextVQA、COCO captioning 等方面則排名第二，超越或與 PaLI - X 55B 持平。您可以通過線上 demo 體驗 CogVLM 多模態對話。

🚀 快速開始

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    'THUDM/cogvlm-grounding-generalist-hf',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to('cuda').eval()

query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/4.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, images=[image])
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

📚 詳細文檔

CogVLM 模型包括四個基本組件：視覺變換器（ViT）編碼器、MLP 適配器、預訓練的大型語言模型（GPT）和一個視覺專家模塊。更多細節請參見Paper。

📄 許可證

此存儲庫中的代碼是根據 Apache - 2.0 許可開放源碼，而使用 CogVLM 模型權重必須遵循模型許可。

📖 引用

如果您覺得我們的工作有幫助，請考慮引用以下論文：

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}