cogagent-vqa-hf開源視覺語言模型 - 免費部署支持單輪視覺問答任務

首頁

Cogagent Vqa Hf

由THUDM開發

CogAgent是基於CogVLM改進的開源視覺語言模型，專注於單輪視覺問答任務

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #超高清視覺理解 #GUI代理操作 #多輪視覺對話

下載量 238

發布時間 : 12/16/2023

模型概述

CogAgent是一個強大的視覺語言模型，特別優化了單輪視覺問答能力，支持1120x1120高分辨率圖像輸入，在多個VQA基準測試上表現優異

模型特點

高分辨率圖像處理

支持1120x1120超高分辨率圖像輸入，能捕捉更精細的視覺細節

卓越的VQA性能

在9個跨模態基準測試中達到頂尖水平，包括VQAv2、MM-Vet等

優化的單輪問答

專門針對單輪視覺問答任務進行優化，相比chat版本在VQA任務上表現更優

模型能力

視覺問答

圖像理解

文本生成

高分辨率圖像處理

使用案例

教育

教材圖像問答

回答關於教材圖表、插圖的各類問題

準確理解圖表內容並生成正確回答

商業

商業圖表分析

分析商業報告中的各類圖表數據

準確提取圖表信息並生成分析結果

🚀 CogAgent

CogAgent 是一個基於 CogVLM 改進的開源視覺語言模型。它在圖像理解和 GUI 代理方面表現出色，支持高分辨率視覺輸入和對話問答，可用於多種跨模態基準測試和 GUI 操作任務。

🚀 快速開始

使用以下 Python 代碼在 cli_demo.py 中快速開始：

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")

args = parser.parse_args()
MODEL_PATH = args.from_pretrained
TOKENIZER_PATH = args.local_tokenizer
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
if args.bf16:
    torch_type = torch.bfloat16
else:
    torch_type = torch.float16

print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))

if args.quant:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch_type,
        low_cpu_mem_usage=True,
        load_in_4bit=True,
        trust_remote_code=True
    ).eval()
else:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch_type,
        low_cpu_mem_usage=True,
        load_in_4bit=args.quant is not None,
        trust_remote_code=True
    ).to(DEVICE).eval()

while True:
    image_path = input("image path >>>>> ")
    if image_path == "stop":
        break

    image = Image.open(image_path).convert('RGB')
    history = []
    while True:
        query = input("Human:")
        if query == "clear":
            break
        input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]],
        }
        if 'cross_images' in input_by_model and input_by_model['cross_images']:
            inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]

        # add any transformers params here.
        gen_kwargs = {"max_length": 2048,
                      "temperature": 0.9,
                      "do_sample": False}
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("</s>")[0]
            print("\nCog:", response)
        history.append((query, response))

然後運行：

python cli_demo_hf.py --bf16

✨ 主要特性

多版本選擇

我們開源了 2 個版本的 CogAgent 檢查點，你可以根據需求選擇：

cogagent-chat：該模型在 GUI 代理、視覺多輪對話、視覺定位 等方面具有強大能力。如果你需要 GUI 代理和視覺定位功能，或需要與給定圖像進行多輪對話，建議使用此版本模型。
cogagent-vqa：該模型在 單輪視覺對話 方面具有更強的能力。如果你需要 處理 VQA 基準測試（如 MMVET、VQAv2），建議使用此模型。

強大性能表現

跨模態基準測試：CogAgent - 18B 在 9 個跨模態基準測試中取得了最先進的通用性能，包括 VQAv2、MM - Vet、POPE、ST - VQA、OK - VQA、TextVQA、ChartQA、InfoVQA、DocVQA。
GUI 操作數據集：CogAgent - 18B 在 GUI 操作數據集（包括 AITW 和 Mind2Web）上顯著超越了現有模型。

新增特性

高分辨率支持：支持更高分辨率的視覺輸入和對話問答，支持 1120x1120 的超高分辨率圖像輸入。
視覺 Agent 能力：具備視覺 Agent 的能力，能夠針對任何給定的 GUI 截圖任務返回計劃、下一步行動和帶座標的具體操作。
增強的 GUI 問答能力：增強了與 GUI 相關的問答能力，能夠處理關於任何 GUI 截圖（如網頁、PC 應用程序、移動應用程序等）的問題。
OCR 任務能力提升：通過改進預訓練和微調，增強了在 OCR 相關任務中的能力。

模型參數

CogAgent - 18B 擁有 110 億視覺參數和 70 億語言參數。

模型使用說明

本倉庫中的模型權重可免費用於學術研究。希望將模型用於 商業目的 的用戶必須在此處註冊。註冊用戶可以免費將模型用於商業活動，但必須遵守本許可證的所有條款和條件。許可證聲明應包含在軟件的所有副本或重要部分中。

📄 許可證

本倉庫中的代碼根據 Apache - 2.0 許可證開源，而 CogAgent 和 CogVLM 模型權重的使用必須遵守模型許可證。

📚 詳細文檔

引用與致謝

如果你覺得我們的工作有幫助，請考慮引用以下論文：

@misc{hong2023cogagent,
      title={CogAgent: A Visual Language Model for GUI Agents}, 
      author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2312.08914},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

在 CogVLM 的指令微調階段，使用了來自 [MiniGPT - 4](https://github.com/Vision - CAIR/MiniGPT - 4)、[LLAVA](https://github.com/haotian - liu/LLaVA)、[LRV - Instruction](https://github.com/FuxiaoLiu/LRV - Instruction)、[LLaVAR](https://github.com/SALT - NLP/LLaVAR) 和 Shikra 項目的一些英文圖像文本數據，以及許多經典的跨模態工作數據集。我們衷心感謝他們的貢獻。