CogVLMオープンソースビジュアル言語モデル - 無料でデプロイ可能、複数のクロスモーダルベンチマークテストでトップ性能

ホーム

Cogvlm Chat Hf

THUDMによって開発

CogVLMは強力なオープンソースの視覚言語モデルで、複数のクロスモーダルベンチマークでリーダーボード性能を達成

テキスト生成画像

Transformers

英語オープンソースライセンス:Apache-2.0 #マルチモーダル対話 #視覚言語大規模モデル #クロスモーダル推論

ダウンロード数 4,816

リリース時間 : 11/16/2023

モデル概要

CogVLMは視覚と言語処理能力を統合した視覚言語モデル(VLM)で、マルチモーダルタスクに適している

モデル特徴

マルチモーダル融合

視覚と言語処理能力を統合し、クロスモーダル理解を実現

高性能

10の主要なクロスモーダルベンチマークでリーダーボード性能を達成

視覚専門家モジュール

独自の視覚専門家モジュールが視覚理解能力を強化

モデル能力

画像キャプション生成

視覚的質問応答

クロスモーダル理解

マルチモーダル対話

使用事例

画像理解

画像キャプション生成

画像に対して正確な自然言語記述を生成

Flicker30k字幕生成タスクで優れた性能

視覚的質問応答

画像に基づく質問応答

画像内容に関する自然言語質問に回答

VQAv2、OKVQAなどのタスクで第2位

🚀 CogVLM

CogVLM は強力なオープンソースの視覚言語モデル（VLM）です。CogVLM-17B は100億の視覚パラメータと70億の言語パラメータを持ち、NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA、TDIUC といった10の古典的なクロスモーダルベンチマークで最先端の性能を達成し、VQAv2、OKVQA、TextVQA、COCO captioning などでは2位にランクインし、PaLI-X 55B を上回るか同等の性能を発揮します。また、CogVLM は画像に関するチャットも可能です。

以上のウェイトは学術研究に完全に開放されており、アンケートに記入して登録することで、無料で商用利用することも許可されています。

🚀 クイックスタート

ハードウェア要件

推論には約40GBのGPUメモリが必要です。40GBを超える単一のGPUメモリがない場合は、「accelerate」ライブラリを使用して、モデルをより小さなメモリを持つ複数のGPUに分散させる必要があります。

依存関係のインストール

pip install torch==2.1.0 transformers==4.35.0 accelerate==0.24.1 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.22.post7 triton==2.1.0

使用例

基本的な使用法

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    'THUDM/cogvlm-chat-hf',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to('cuda').eval()


# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# This image captures a moment from a basketball game. Two players are prominently featured: one wearing a yellow jersey with the number
# 24 and the word 'Lakers' written on it, and the other wearing a navy blue jersey with the word 'Washington' and the number 34. The player
# in yellow is holding a basketball and appears to be dribbling it, while the player in navy blue is reaching out with his arm, possibly
# trying to block or defend. The background shows a filled stadium with spectators, indicating that this is a professional game.</s>


# vqa example
query = 'How many houses are there in this cartoon?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/3.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='vqa')   # vqa mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# 4</s>

高度な使用法

単一のGPUメモリが不足している場合、モデルを複数の小さなメモリを持つGPUに分散させることができます。以下は、2つの24GBのGPUと16GBのCPUメモリを持つ場合の例です。infer_auto_device_map の引数を自分の設定に変更することができます。

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer'])
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

✨ 主な機能

CogVLM モデルは、4つの基本コンポーネントで構成されています：視覚トランスフォーマー（ViT）エンコーダ、MLPアダプタ、事前学習された大規模言語モデル（GPT）、および視覚エキスパートモジュール。詳細については、論文を参照してください。

📄 ライセンス

このリポジトリ内のコードは、Apache-2.0ライセンスの下でオープンソースとして公開されています。一方、CogVLMモデルのウェイトを使用する場合は、モデルライセンスに従う必要があります。

📚 引用

もし当社の成果が役に立った場合は、以下の論文を引用していただけると幸いです。

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}