CogVLMオープンソースビジュアル言語モデル - 目標検出とビジュアル質問応答タスクを無料でサポート

ホーム

Cogvlm Grounding Generalist Hf Quant4

Rodeszonesによって開発

CogVLMは強力なオープンソースの視覚言語モデルで、物体検出や視覚的質問応答などのタスクをサポートし、4ビット精度の量子化を採用しています。

画像生成テキスト

Transformers

オープンソースライセンス:Apache-2.0 #視覚位置特定 #マルチモーダル対話 #オープンソース大規模モデル

ダウンロード数 50

リリース時間 : 3/5/2024

モデル概要

CogVLMは視覚言語モデルで、強力な視覚理解と言語生成能力を備え、物体検出、画像キャプション生成などのタスクをサポートします。

モデル特徴

高性能クロスモーダル能力

10の古典的なクロスモーダルベンチマークテストで最先端の性能を達成し、PaLI-X 55Bに匹敵します

4ビット量子化

bitsandbytesの4ビット精度量子化を採用し、ハードウェア要件を低減

物体位置特定能力

画像内の物体の座標位置情報を生成可能

モデル能力

物体検出

画像キャプション生成

視覚的質問応答

クロスモーダル理解

使用事例

画像分析

物体検出と位置特定

画像内の物体を識別し、座標位置を注釈

出力形式：物体記述[[x0,y0,x1,y1]]

インテリジェントカスタマーサポート

視覚的質問応答

画像内容に関する自然言語質問に回答

🚀 CogVLM

CogVLM は強力な オープンソースの視覚言語モデル（VLM）です。CogVLM-17Bには100億の視覚パラメータと70億の言語パラメータがあります。CogVLM-17Bは、NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA、TDIUCなどの10の古典的なクロスモーダルベンチマークで最先端の性能を達成し、VQAv2、OKVQA、TextVQA、COCO captioningなどでは2位にランクインしており、PaLI-X 55Bを上回るか匹敵する結果を得ています。

🚀 クイックスタート

環境構築

Linux環境

pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 transformers==4.38.1 accelerate==0.27.2 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.24 protobuf==3.20.3 triton==2.1.0 bitsandbytes==0.43.0.dev0

Windows環境

tritonとbitsandbytesについては、以下のファイルを使用します。

pip install bitsandbytes-0.43.0.dev0-cp310-cp310-win_amd64.whl

pip install triton-2.1.0-cp310-cp310-win_amd64.whl

コードによる利用例

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

model_path = "'local/model/folder/path/here' or 'Rodeszones/CogVLM-grounding-generalist-hf-quant4'"


tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval()


# chat example
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open("your/image/path/here").convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

# example output 
# a room with a ladder [[378,107,636,998]] and a blue and white towel [[073,000,346,905]].</s>
# NOTE: The model's squares have dimensions of 1000 by 1000, which is important to consider.

📄 ライセンス

このリポジトリ内のコードは Apache-2.0ライセンスの下でオープンソースとなっています。一方、CogVLMモデルの重みの使用については、モデルライセンスに従う必要があります。

📚 引用

@article{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}