FireLLaVA-13b Open-Source Vision-Language Model - Free Deployment for Image Understanding and Text Generation

Firellava 13b

Developed by fireworks-ai

FireLLaVA-13B is a vision-language model trained on instruction data generated by open-source large language models, supporting image understanding and text generation tasks.

Image-to-Text

Transformers

#Multimodal Visual Question Answering #Car Brand Recognition #Fine-tuning Open-source Large Models

Downloads 59

Release Time : 1/5/2024

Model Overview

This is a multimodal model combining visual and linguistic capabilities, capable of understanding image content and generating relevant textual responses.

Model Features

Multimodal Understanding

Capable of processing both image and text inputs simultaneously, understanding image content and generating relevant responses.

Large Language Model Foundation

Built upon the powerful LLaMA 2 language model, possessing excellent text generation capabilities.

Multi-image Support

Theoretically supports multiple images in a single prompt (though not specifically optimized during training).

Model Capabilities

Image content understanding

Visual Question Answering

Multimodal dialogue

Image caption generation

Use Cases

Image Understanding

Object Recognition

Identify objects in images and answer related questions

Correctly identified Volkswagen cars in examples

Scene Description

Generate detailed textual descriptions of images

Capable of describing scenes and object relationships in images

Intelligent Assistant

Visual QA Assistant

Answer various user questions about image content

🚀 FireLLaVA-13b Model

FireLLaVA-13b is a vision-language model that combines the power of text and image understanding, enabling a wide range of multimodal applications.

🚀 Quick Start

The use of this model is governed by the Meta license. To download the model weights and tokenizer, visit the website and accept the Llama 2 Community License Agreement before requesting access here.

The model is hosted on Fireworks.ai. You can test it here: FireLLaVA-13b on Fireworks.ai. API endpoints are also available, and the instructions can be found here: Querying Vision-Language Models.

If you want to run the model locally using the Hugging Face Transformers library, follow the instructions below.

✨ Features

Multimodal Capability: The model supports multi-image and multi-prompt generation, allowing you to pass multiple images in your prompt.
Prompt Template: It requires following the correct prompt template (USER: xxx\nASSISTANT:) and adding the token <image> to the location where you want to query images.

📦 Installation

If you choose to run the model locally, make sure you have transformers >= 4.35.3 installed.

💻 Usage Examples

Basic Usage

You can use the pipeline from the transformers library to interact with the model easily.

from transformers import pipeline
from PIL import Image    
import requests

model_id = "fireworks-ai/FireLLaVA-13b"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'USER:  \nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT: Volkswagen'}]

Advanced Usage

You can also use the pure transformers approach for more customized interactions.

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "fireworks-ai/FireLLaVA-13b"

prompt = "USER: <image>\nWhat is this?\n\nASSISTANT:"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
>>> "This is an early Volkswagen Beetle car, also known as a VW bug, parked on a brick street and next to a building with doors ..."

📚 Documentation

Model type: LLaVA vision-language model trained on OSS LLM generated instruction following data.
Model state: FireLLaVA 13B was trained in December 2023.
More information: For more details, refer to the official website: LLaVA.

Property	Details
Model Type	LLaVA vision-language model trained on OSS LLM generated instruction following data
Training Time	December 2023
Reference	LLaVA

📄 License

This model is under the Llama 2 license.

⚠️ Important Note

Model performance with multiple images in the input may degrade since it is not trained with multiple images in the input.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご