๐ FireLLaVA-13b Model
FireLLaVA-13b is a vision-language model that combines the power of text and image understanding, enabling a wide range of multimodal applications.
๐ Quick Start
The use of this model is governed by the Meta license. To download the model weights and tokenizer, visit the website and accept the Llama 2 Community License Agreement before requesting access here.
The model is hosted on Fireworks.ai. You can test it here: FireLLaVA-13b on Fireworks.ai. API endpoints are also available, and the instructions can be found here: Querying Vision-Language Models.
If you want to run the model locally using the Hugging Face Transformers library, follow the instructions below.
โจ Features
- Multimodal Capability: The model supports multi-image and multi-prompt generation, allowing you to pass multiple images in your prompt.
- Prompt Template: It requires following the correct prompt template (USER: xxx\nASSISTANT:) and adding the token <image> to the location where you want to query images.
๐ฆ Installation
If you choose to run the model locally, make sure you have transformers >= 4.35.3
installed.
๐ป Usage Examples
Basic Usage
You can use the pipeline
from the transformers
library to interact with the model easily.
from transformers import pipeline
from PIL import Image
import requests
model_id = "fireworks-ai/FireLLaVA-13b"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'USER: \nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT: Volkswagen'}]
Advanced Usage
You can also use the pure transformers
approach for more customized interactions.
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "fireworks-ai/FireLLaVA-13b"
prompt = "USER: <image>\nWhat is this?\n\nASSISTANT:"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
>>> "This is an early Volkswagen Beetle car, also known as a VW bug, parked on a brick street and next to a building with doors ..."
๐ Documentation
- Model type: LLaVA vision-language model trained on OSS LLM generated instruction following data.
- Model state: FireLLaVA 13B was trained in December 2023.
- More information: For more details, refer to the official website: LLaVA.
Property |
Details |
Model Type |
LLaVA vision-language model trained on OSS LLM generated instruction following data |
Training Time |
December 2023 |
Reference |
LLaVA |
๐ License
This model is under the Llama 2 license.
โ ๏ธ Important Note
Model performance with multiple images in the input may degrade since it is not trained with multiple images in the input.