cogagent-chat-hf開源視覺語言模型 - 支持GUI智能體、多輪對話及視覺定位

首頁

Cogagent Chat Hf

由THUDM開發

CogAgent是基於CogVLM改進的開源視覺語言模型，具備GUI智能體、視覺多輪對話和視覺定位等能力。

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #超高清圖像理解 #GUI智能體操作 #多輪視覺對話

下載量 503

發布時間 : 12/15/2023

模型概述

CogAgent是一個高性能的視覺語言模型，專注於GUI智能體任務和視覺對話，支持1120x1120高分辨率圖像輸入。

模型特點

高分辨率視覺處理

支持1120x1120超高分辨率圖像輸入，提供更精細的視覺理解能力

GUI智能體功能

能夠理解和操作各種GUI界面，包括網頁、PC和移動應用

增強的視覺定位

在圖像中精確定位和描述對象位置

多輪視覺對話

支持基於圖像的深入多輪對話

模型能力

視覺問答

GUI操作規劃

圖像內容描述

視覺定位

多輪對話

OCR增強

使用案例

GUI自動化

網頁自動化操作

根據網頁截圖生成操作步驟

在AITW和Mind2Web數據集上表現優異

視覺問答

複雜圖像理解

回答關於複雜圖像的問題

在9個跨模態基準測試中達到頂尖水平

🚀 CogAgent

CogAgent 是一個基於 CogVLM 改進的開源視覺語言模型。它在圖像理解和 GUI 代理方面表現出色，支持高分辨率視覺輸入和對話問答，具備視覺 Agent 能力，還增強了 GUI 相關問答和 OCR 相關任務的能力。

🚀 快速開始

你可以使用以下 Python 代碼在 cli_demo.py 中快速開始：

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")

args = parser.parse_args()
MODEL_PATH = args.from_pretrained
TOKENIZER_PATH = args.local_tokenizer
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
if args.bf16:
    torch_type = torch.bfloat16
else:
    torch_type = torch.float16

print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))

if args.quant:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch_type,
        low_cpu_mem_usage=True,
        load_in_4bit=True,
        trust_remote_code=True
    ).eval()
else:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch_type,
        low_cpu_mem_usage=True,
        load_in_4bit=args.quant is not None,
        trust_remote_code=True
    ).to(DEVICE).eval()

while True:
    image_path = input("image path >>>>> ")
    if image_path == "stop":
        break

    image = Image.open(image_path).convert('RGB')
    history = []
    while True:
        query = input("Human:")
        if query == "clear":
            break
        input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]],
        }
        if 'cross_images' in input_by_model and input_by_model['cross_images']:
            inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]

        # add any transformers params here.
        gen_kwargs = {"max_length": 2048,
                      "temperature": 0.9,
                      "do_sample": False}
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("</s>")[0]
            print("\nCog:", response)
        history.append((query, response))

然後運行：

python cli_demo_hf.py --bf16

✨ 主要特性

🔥 新聞

新版本 CogAgent - 9B - 20241220 已發佈！歡迎訪問 CogAgent GitHub 和技術報告來探索和使用我們的最新模型。

模型版本選擇

我們開源了 2 個版本的 CogAgent 檢查點，你可以根據需求選擇：

cogagent - chat：該模型在 GUI 代理、視覺多輪對話、視覺定位 等方面具有強大的能力。如果你需要 GUI 代理和視覺定位功能，或者需要與給定圖像進行多輪對話，我們建議使用此版本的模型。
cogagent - vqa：該模型在 單輪視覺對話 方面具有更強的能力。如果你需要 處理 VQA 基準測試（如 MMVET、VQAv2），我們建議使用此模型。

模型性能

CogAgent - 18B 具有 110 億視覺參數和 70 億語言參數。
圖像理解和 GUI 代理表現出色：
- CogAgent - 18B 在 9 個跨模態基準測試中達到了最先進的通用性能，包括：VQAv2、MM - Vet、POPE、ST - VQA、OK - VQA、TextVQA、ChartQA、InfoVQA、DocVQA。
- CogAgent - 18B 在 GUI 操作數據集（如 AITW 和 Mind2Web）上 顯著超越了現有模型。

新增特性

除了 CogVLM 已有的所有特性（視覺多輪對話、視覺定位）之外，CogAgent 還具備以下特性：

支持更高分辨率的視覺輸入和對話問答：支持 1120x1120 的超高分辨率圖像輸入。
具備視覺 Agent 能力：能夠針對任何給定的 GUI 截圖上的任何任務返回計劃、下一步行動和帶有座標的具體操作。
增強的 GUI 相關問答能力：能夠處理關於任何 GUI 截圖（如網頁、PC 應用程序、移動應用程序等）的問題。
增強的 OCR 相關任務能力：通過改進的預訓練和微調實現。

模型使用說明

此倉庫中的模型權重用於學術研究是 免費的。希望將模型用於 商業目的 的用戶必須在此處註冊。註冊用戶可以免費將模型用於商業活動，但必須遵守本許可證的所有條款和條件。許可證聲明應包含在軟件的所有副本或重要部分中。

📚 詳細文檔

論文：https://arxiv.org/abs/2312.08914
技術報告：https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report-en
GitHub：https://github.com/THUDM/CogAgent
模型頁面：https://huggingface.co/THUDM/cogagent-9b-20241220

關於 THUDM/cogagent - chat - hf 的演示、微調以及查詢提示等更多信息，請參考此 GitHub。

📄 許可證

此倉庫中的代碼根據 Apache - 2.0 許可證開源，而 CogAgent 和 CogVLM 模型權重的使用必須遵守模型許可證。

🔗 引用與致謝

如果你覺得我們的工作有幫助，請考慮引用以下論文：

@misc{hong2023cogagent,
      title={CogAgent: A Visual Language Model for GUI Agents}, 
      author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2312.08914},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

在 CogVLM 的指令微調階段，使用了來自 [MiniGPT - 4](https://github.com/Vision - CAIR/MiniGPT - 4)、[LLAVA](https://github.com/haotian - liu/LLaVA)、[LRV - Instruction](https://github.com/FuxiaoLiu/LRV - Instruction)、[LLaVAR](https://github.com/SALT - NLP/LLaVAR) 和 Shikra 項目的一些英文圖像文本數據，以及許多經典的跨模態工作數據集。我們衷心感謝他們的貢獻。