Fuyu-8B Open-Source Multi-Modal Text-Image Transformer - Fast Response and Simple Architecture, Suitable for Digital Agents

Fuyu 8b

Developed by adept

Fuyu-8B is a multimodal text-image transformer developed by Adept AI, designed for digital agents, supporting arbitrary image resolutions with swift responses and a streamlined architecture.

Image-to-Text

Transformers

#Multimodal Agent #Arbitrary Resolution Processing #Chart Parsing

Downloads 14.22k

Release Time : 10/17/2023

Model Overview

Fuyu-8B is a multimodal model capable of receiving image and text inputs to generate text outputs, particularly suited for digital agent applications such as parsing charts and answering user interface-based questions.

Model Features

Streamlined Architecture

Utilizes a decoder-only Transformer design without a separate image encoder. Image patches are directly input into the first Transformer layer via linear projection, making the architecture simple to understand, scale, and deploy.

Arbitrary Image Resolution Support

Supports arbitrary image resolutions by treating image token sequences like text token sequences, removing image-specific positional embeddings, and inputting the required number of image tokens in raster scan order.

Fast Response

Processes large-sized images with response times under 100 milliseconds, making it suitable for real-time applications.

Multi-Scenario Optimization

Although optimized for digital agent scenarios, it still performs excellently in standard image understanding benchmarks, supporting few-shot learning and multi-scenario fine-tuning.

Model Capabilities

Image Understanding

Text Generation

Chart Parsing

User Interface Question Answering

Fine-Grained Screen Image Localization

Use Cases

Digital Agent

Chart Parsing

Parse chart data and answer related questions

Scored 64.5 on the AI2D chart parsing test

User Interface Interaction

Answer user interface-based questions

Image Understanding

Visual Question Answering

Answer natural language questions about image content

Scored 74.2 on the VQAv2 test

Image Caption Generation

Generate COCO-style image captions

Scored 141 on the COCO caption generation test

🚀 Fuyu-8B Model Card

Fuyu-8B is a compact version of the multimodal model powering our product. It's now available on HuggingFace. Here's why we're excited about it:

It features a much simpler architecture and training process compared to other multimodal models. This simplicity makes it easier to understand, scale, and deploy.
It's built from the ground up for digital agents. It can handle arbitrary image resolutions, answer questions about graphs and diagrams, respond to UI - related queries, and perform fine - grained localization on screen images.
It's incredibly fast, capable of delivering responses for large images in under 100 milliseconds.
Despite being optimized for our specific use - case, it performs well on standard image understanding benchmarks like visual question - answering and natural - image - captioning.

Please note that the released model is a base model. You'll likely need to fine - tune it for specific use cases, such as verbose captioning or multimodal chat. In our experience, the model responds well to few - shot learning and fine - tuning for various use cases.

✨ Features

Simplified Architecture: A vanilla decoder - only transformer without an image encoder, allowing support for arbitrary image resolutions.
Fast Inference: Can generate responses for large images in less than 100 milliseconds.
Versatile Application: Suitable for digital agents, with the ability to handle various image - related tasks.
Good Benchmark Performance: Performs well on standard image understanding benchmarks.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import FuyuProcessor, FuyuForCausalLM
from PIL import Image
import requests

# load model and processor
model_id = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_id)
model = FuyuForCausalLM.from_pretrained(model_id, device_map="cuda:0")

# prepare inputs for the model
text_prompt = "Generate a coco-style caption.\n"
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")

# autoregressively generate text
generation_output = model.generate(**inputs, max_new_tokens=7)
generation_text = processor.batch_decode(generation_output[:, -7:], skip_special_tokens=True)
assert generation_text == ['A blue bus parked on the side of a road.']

Advanced Usage

# Question answering on natural images and charts/diagrams
text_prompt = "What color is the bus?\n"
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")

generation_output = model.generate(**inputs, max_new_tokens=6)
generation_text = processor.batch_decode(generation_output[:, -6:], skip_special_tokens=True)
assert generation_text == ["The bus is blue.\n"]


text_prompt = "What is the highest life expectancy at birth of male?\n"
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/chart.png"
image = Image.open(requests.get(url, stream=True).raw)

model_inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")

generation_output = model.generate(**model_inputs, max_new_tokens=16)
generation_text = processor.batch_decode(generation_output[:, -16:], skip_special_tokens=True)
assert generation_text == ["The life expectancy at birth of males in 2018 is 80.7.\n"]

💡 Usage Tip

For best performance, it's recommended to end questions with \n, as shown in the examples above!

📚 Documentation

Model

Fuyu-8B is a multi-modal text and image transformer trained by Adept AI.

Architecturally, Fuyu is a vanilla decoder - only transformer with no image encoder. Image patches are linearly projected into the first layer of the transformer, bypassing the embedding lookup. We treat the transformer decoder like an image transformer (albeit with no pooling and causal attention). See the diagram below for more details.

architecture

This simplification enables support for arbitrary image resolutions. We treat image token sequences like text token sequences, remove image - specific position embeddings, and feed in as many image tokens as needed in raster - scan order. A special image - newline character indicates line breaks. The model can use existing position embeddings to handle different image sizes, and we can use images of any size during training, eliminating the need for separate high and low - resolution training stages.

Model Description

Property	Details
Developed by	Adept - AI
Model Type	Decoder - only multi - modal transformer model
License	[CC - BY - NC](https://creativecommons.org/licenses/by - nc/4.0/deed.en)
Model Description	This is a multi - modal model that can consume images and text and produce text.
Resources for more information	Check out our blog post.

Evaluation

Although not the main focus of this model, we evaluated it on standard image understanding benchmarks:

Eval Task	Fuyu-8B	Fuyu-Medium	LLaVA 1.5 (13.5B)	QWEN-VL (10B)	PALI-X (55B)	PALM-e-12B	PALM-e-562B
VQAv2	74.2	77.4	80	79.5	86.1	76.2	80.0
OKVQA	60.6	63.1	n/a	58.6	66.1	55.5	66.1
COCO Captions	141	138	n/a	n/a	149	135	138
AI2D	64.5	73.7	n/a	62.3	81.2	n/a	n/a

Uses

Direct Use

The model is for research purposes only. As this is a raw model release, we haven't added further fine - tuning, post - processing, or sampling strategies to control for undesirable outputs. You'll need to fine - tune the model for your use case.

Possible research areas and tasks include:

Applications in computer control or digital agents.
General research on multi - modal models.

Excluded uses are described below.

Out-of-Scope Use

The model wasn't trained to provide factual or accurate representations of people or events. So, using it to generate such content is beyond its capabilities.

Limitations and Bias

Limitations

Faces and people in general may not be generated properly.

Bias

While these models have impressive capabilities, they can also reinforce or exacerbate social biases.

📄 License

The model is released under the [CC - BY - NC](https://creativecommons.org/licenses/by - nc/4.0/deed.en) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご