Open Source Ferret-UI-Llama8b Model - Perform Complex UI Tasks Such as Reference Localization Reasoning

Ferret UI Llama8b

Developed by jadechoghari

Ferret-UI is the first multimodal large language model (MLLM) focused on user interfaces, built on Llama-3-8B, capable of performing complex UI tasks such as referencing, localization, and reasoning.

Image-to-Text

Transformers

#UI Multimodal Understanding #UI Element Localization #Screen Content Reasoning

Downloads 256

Release Time : 10/9/2024

Model Overview

Ferret-UI is a multimodal large language model specifically designed for handling user interface-related tasks, including referencing, localization, and reasoning. It is based on the Llama-3-8B architecture and can understand and analyze UI images, providing detailed descriptions and localization information.

Model Features

Multimodal Capability

Combines visual and language processing abilities to understand and analyze UI images.

UI Task Optimization

Designed specifically for UI-related referencing, localization, and reasoning tasks, capable of efficiently handling complex UI analysis.

High-Precision Localization

Supports bounding box localization, enabling precise marking of UI element positions.

Model Capabilities

UI Image Analysis

Text Generation

Bounding Box Localization

Multimodal Reasoning

Use Cases

UI Automated Testing

UI Element Localization

Automatically identifies and locates specific elements in the UI, such as buttons, text boxes, etc.

Improves testing efficiency and accuracy.

Accessibility Features

UI Description Generation

Generates detailed descriptions of UIs for visually impaired users.

Enhances accessibility experience.

🚀 Ferret-UI

Ferret-UI is the first UI-centric multimodal large language model (MLLM) designed for referring, grounding, and reasoning tasks. Built on Gemma-2B and Llama-3-8B, it can execute complex UI tasks. This is the Llama-3-8B version of ferret-ui, following this paper by Apple.

🚀 Quick Start

📦 Installation

You will need first to download builder.py, conversation.py, inference.py, model_UI.py, and mm_utils.py locally.

wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py

💻 Usage Examples

🔍 Basic Usage

from inference import inference_and_run
image_path = "appstore_reminders.png"
prompt = "Describe the image in details"

# Call the function without a box
inference_text = inference_and_run(image_path, prompt)

print("Inference Text:", inference_text)

📦 Advanced Usage

# Task with bounding boxes
image_path = "appstore_reminders.png"
prompt = "What's inside the selected region?"
box = [189, 906, 404, 970]

inference_text = inference_and_run(
    image_path=image_path, 
    prompt=prompt, 
    conv_mode="ferret_llama_3", 
    model_path="jadechoghari/Ferret-UI-Llama8b", 
    box=box
)

print("Inference Text:", inference_text)

# GROUNDING PROMPTS
GROUNDING_TEMPLATES = [
    '\nProvide the bounding boxes of the mentioned objects.',
    '\nInclude the coordinates for each mentioned object.',
    '\nLocate the objects with their coordinates.',
    '\nAnswer in [x1, y1, x2, y2] format.',
    '\nMention the objects and their locations using the format [x1, y1, x2, y2].',
    '\nDraw boxes around the mentioned objects.',
    '\nUse boxes to show where each thing is.',
    '\nTell me where the objects are with coordinates.',
    '\nList where each object is with boxes.',
    '\nShow me the regions with boxes.'
]

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご