CogVLM開源視覺語言模型 - 免費部署，多跨模態基準測試表現領先

首頁

Cogvlm Chat Hf

由THUDM開發

CogVLM是一個強大的開源視覺語言模型，在多個跨模態基準測試中取得領先性能

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #多模態對話 #視覺語言大模型 #跨模態推理

下載量 4,816

發布時間 : 11/16/2023

模型概述

CogVLM是一個視覺語言模型(VLM)，結合了視覺和語言處理能力，適用於多模態任務

模型特點

多模態融合

結合視覺和語言處理能力，實現跨模態理解

高性能

在10個經典跨模態基準測試中取得領先性能

視覺專家模塊

獨特的視覺專家模塊增強視覺理解能力

模型能力

圖像描述生成

視覺問答

跨模態理解

多模態對話

使用案例

圖像理解

圖像描述生成

為圖像生成準確的自然語言描述

在Flicker30k字幕生成任務中表現優異

視覺問答

基於圖像的問答

回答關於圖像內容的自然語言問題

在VQAv2、OKVQA等任務中位列第二

🚀 CogVLM

CogVLM 是一個強大的開源視覺語言模型（VLM）。CogVLM - 17B 擁有 100 億視覺參數和 70 億語言參數，在 10 個經典跨模態基準測試上取得了 SOTA 性能，包括 NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC，而在 VQAv2、OKVQA、TextVQA、COCO captioning 等方面則排名第二，超越或與 PaLI - X 55B 持平。您可以通過線上 demo 體驗 CogVLM 多模態對話。

以上權重對學術研究完全開放，在填寫問卷進行登記後亦允許免費商業使用。

🚀 快速開始

硬件需求

需要近 40GB GPU 顯存用於模型推理。如果沒有一整塊 GPU 顯存超過 40GB，則需要使用 accelerate 將模型切分到多個有較小顯存的 GPU 設備上。

安裝依賴

pip install torch==2.1.0 transformers==4.35.0 accelerate==0.24.1 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.22.post7 triton==2.1.0

代碼示例

基礎用法

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    'THUDM/cogvlm-chat-hf',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to('cuda').eval()

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# This image captures a moment from a basketball game. Two players are prominently featured: one wearing a yellow jersey with the number
# 24 and the word 'Lakers' written on it, and the other wearing a navy blue jersey with the word 'Washington' and the number 34. The player
# in yellow is holding a basketball and appears to be dribbling it, while the player in navy blue is reaching out with his arm, possibly
# trying to block or defend. The background shows a filled stadium with spectators, indicating that this is a professional game.</s>

# vqa example
query = 'How many houses are there in this cartoon?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/3.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='vqa')   # vqa mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# 4</s>

高級用法

當單卡顯存不足時，可以將模型切分到多個小顯存 GPU 上。以下是個當你有兩張 24GB 的 GPU，16GB CPU 內存的例子。你可以將 infer_auto_device_map 的參數改成你的配置。注意這裡將 GPU 顯存少寫了一點，這是為推理時中間狀態預留出一部分顯存。

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer'])
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

🔧 技術細節

CogVLM 模型包括四個基本組件：視覺變換器（ViT）編碼器、MLP 適配器、預訓練的大型語言模型（GPT）和一個視覺專家模塊。更多細節請參見Paper。

📄 許可證

此存儲庫中的代碼是根據 Apache - 2.0 許可開放源碼，而使用 CogVLM 模型權重必須遵循模型許可。

📖 引用

如果您覺得我們的工作有幫助，請考慮引用以下論文：

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}