Ferret-UI-Gemma2b開源多模態大模型 - 助力UI引用、定位與推理任務

首頁

Ferret UI Gemma2b

由jadechoghari開發

Ferret-UI是首個專注於用戶界面的多模態大語言模型，基於Gemma-2B構建，專為UI引用、定位和推理任務設計。

圖像生成文本

Transformers

#UI多模態理解 #界面元素定位 #屏幕內容推理

下載量 302

發布時間 : 10/9/2024

模型概述

Ferret-UI是一個多模態大語言模型，專注於用戶界面(UI)的理解和分析，能夠執行復雜的UI任務，如引用、定位和推理。

模型特點

UI專用多模態模型

首個專注於用戶界面理解的多模態大語言模型

精準定位能力

能夠精確定位UI元素並提供邊界框座標

複雜推理能力

可執行復雜的UI相關推理任務

模型能力

UI元素識別

UI元素定位

UI界面描述

UI元素交互分析

UI佈局理解

使用案例

移動應用界面分析

應用界面元素識別

識別並描述移動應用界面中的各種元素

準確識別按鈕、文本區域等UI組件

界面導航分析

分析應用界面的導航結構和流程

理解界面間的跳轉關係和用戶操作路徑

UI自動化測試

UI元素驗證

驗證UI元素的存在和位置

確保界面元素按設計規範呈現

🚀 Ferret-UI（Gemma-2B版本）

Ferret-UI是首個以用戶界面（UI）為中心的多模態大語言模型（MLLM），專為指稱、定位和推理任務而設計。它基於Gemma-2B和Llama-3-8B構建，能夠執行復雜的UI任務。此為Ferret-UI的Gemma-2B版本，其靈感來源於蘋果公司的這篇論文。

🚀 快速開始

📦 安裝指南

你需要先將builder.py、conversation.py、inference.py、model_UI.py和mm_utils.py下載到本地。

wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py

💻 使用示例

基礎用法

from inference import inference_and_run
image_path = "appstore_reminders.png"
prompt = "Describe the image in details"

# Call the function without a box
inference_text = inference_and_run(image_path, prompt, conv_mode="ferret_gemma_instruct", model_path="jadechoghari/Ferret-UI-Gemma2b")

# Output processed text
print("Inference Text:", inference_text)

高級用法

# Task with bounding boxes
image_path = "appstore_reminders.png"
prompt = "What's inside the selected region?"
box = [189, 906, 404, 970]

inference_text = inference_and_run(
    image_path=image_path, 
    prompt=prompt, 
    conv_mode="ferret_gemma_instruct", 
    model_path="jadechoghari/Ferret-UI-Gemma2b", 
    box=box
)
# you could also pass process_image=True
# to output: processed_image, inference_text = inference_and_run(...., process_image=True)

print("Inference Text:", inference_text)

定位提示

# GROUNDING PROMPTS
GROUNDING_TEMPLATES = [
    '\nProvide the bounding boxes of the mentioned objects.',
    '\nInclude the coordinates for each mentioned object.',
    '\nLocate the objects with their coordinates.',
    '\nAnswer in [x1, y1, x2, y2] format.',
    '\nMention the objects and their locations using the format [x1, y1, x2, y2].',
    '\nDraw boxes around the mentioned objects.',
    '\nUse boxes to show where each thing is.',
    '\nTell me where the objects are with coordinates.',
    '\nList where each object is with boxes.',
    '\nShow me the regions with boxes.'
]