🚀 Model Card for Aya Vision 8B
Cohere Labs Aya Vision 8B is an open weights research release of an 8 - billion parameter model. It has advanced capabilities optimized for various vision - language use cases, such as OCR, captioning, visual reasoning, summarization, question answering, and code processing. This multilingual model is trained to perform well in 23 languages for both vision and language tasks.
This model card pertains to the 8 - billion version of the Aya Vision model. We also released a 32 - billion version, which can be found here.
🚀 Quick Start
Try it: Aya Vision in Action
Before downloading the weights, you can try Aya Vision chat in the Cohere playground or our dedicated Hugging Face Space for interactive exploration.
WhatsApp Integration
You can also communicate with Aya Vision via the popular messaging service WhatsApp. Use this link to open a WhatsApp chatbox with Aya Vision.
If you don't have WhatsApp installed on your device, you may need to install it. If you have it on your phone, you can follow the on - screen instructions to link your phone and WhatsApp Web. Eventually, you'll see a text window to chat with the model. More details about our WhatsApp integration are available here.
Example Notebook
You can also check out the following notebook to understand how to use Aya Vision for different use cases.
✨ Features
- Multilingual Capability: Trained to excel in 23 languages, including English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese (Simplified and Traditional), Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.
- Advanced Vision - Language Use Cases: Optimized for a variety of tasks such as OCR, captioning, visual reasoning, summarization, question answering, and code processing.
📦 Installation
Please install transformers
from the source repository that includes the necessary changes for this model:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "CohereLabs/aya-vision-8b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
)
💻 Usage Examples
Basic Usage
messages = [
{"role": "user",
"content": [
{"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
{"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
]},
]
inputs = processor.apply_chat_template(
messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)
gen_tokens = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.3,
)
print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
Advanced Usage
from transformers import pipeline
pipe = pipeline(model="CohereLabs/aya-vision-8b", task="image-text-to-text", device_map="auto")
messages = [
{"role": "user",
"content": [
{"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
{"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
]},
]
outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
print(outputs)
📚 Documentation
Model Details
Property |
Details |
Input |
Model accepts input text and images. |
Output |
Model generates text. |
Model Architecture |
This is a vision - language model that uses a multilingual language model based on Command R7B and further post - trained with the Aya Expanse recipe, paired with SigLIP2 - patch14 - 384 vision encoder through a multimodal adapter for vision - language understanding. |
Image Processing |
We use 169 visual tokens to encode an image tile with a resolution of 364x364 pixels. Input images of arbitrary sizes are mapped to the nearest supported resolution based on the aspect ratio. Aya Vision uses up to 12 input tiles and a thumbnail (resized to 364x364) (2197 image tokens). |
Languages covered |
The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese (Simplified and Traditional), Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. |
Context length |
Aya Vision 8B supports a context length of 16K. |
For more details about how the model was trained, check out our blogpost.
Evaluation
We evaluated Aya Vision 8B against Pangea 7B, Llama - 3.2 11B Vision, Molmo - D 7B, Qwen2.5 - VL 7B, Pixtral 12B, and Gemini Flash 1.5 8B using Aya Vision Benchmark and m - WildVision. Win - rates were determined using claude - 3 - 7 - sonnet - 20250219 as a judge, based on the superior judge performance compared to other models.
We also evaluated Aya Vision 8B’s performance for text - only input against the same models using m - ArenaHard, a challenging open - ended generation evaluation, measured using win - rates using gpt - 4o - 2024 - 11 - 20 as a judge.

Model Card Contact
For errors or additional questions about details in this model card, contact labs@cohere.com
Terms of Use
We hope that the release of this model will make community - based research efforts more accessible by releasing the weights of a highly performant 8 billion parameter Vision - Language Model to researchers all over the world.
This model is governed by a CC - BY - NC, requires also adhering to Cohere Lab's Acceptable Use Policy
📄 License
This model is released under the CC - BY - NC license and requires adhering to Cohere Lab's Acceptable Use Policy.