Aya Vision 32B Open-Source Multimodal Model - Visual Language Task Applications Supporting 23 Languages

Aya Vision 32b

Developed by CohereLabs

Aya Vision 32B is an open-weight 32B parameter multimodal model developed by Cohere Labs, supporting vision-language tasks in 23 languages.

Image-to-Text

Transformers

Supports Multiple Languages#Multilingual visual understanding #High-precision OCR #Cross-modal reasoning

Downloads 387

Release Time : 3/2/2025

Model Overview

A multilingual model optimized for various vision-language tasks, including OCR, image captioning, visual reasoning, summarization, Q&A, code generation, etc.

Model Features

Multilingual support

Supports vision-language task processing in 23 languages

High-resolution image processing

Supports 364x364 pixel resolution with up to 2197 image tokens

Long context support

16K context length suitable for complex tasks

Multimodal adapter

Innovative architecture combining advanced text models with visual encoders

Model Capabilities

Image caption generation

Visual question answering

Multilingual OCR

Visual reasoning

Text summarization

Code generation

Cross-modal understanding

Use Cases

Content understanding

Multilingual image captioning

Generate descriptive text for images in different languages

Accurate descriptions in 23 languages

Document OCR

Extract multilingual text content from images

High-precision text recognition

Intelligent interaction

Visual question answering

Answer complex questions about image content

Supports multilingual Q&A

Educational assistance

Explain educational content in images

Multilingual teaching support

🚀 Aya Vision 32B

Cohere Labs Aya Vision 32B is an open weights research release of a 32 - billion parameter model. It has advanced capabilities optimized for various vision - language use cases, such as OCR, captioning, visual reasoning, summarization, question answering, and code processing. This multilingual model is trained to perform well in 23 languages for both vision and language tasks.

🚀 Quick Start

Try it: Aya Vision in Action

Before downloading the weights, you can try Aya Vision 32B chat in the Cohere playground or our dedicated Hugging Face Space for interactive exploration.

WhatsApp Integration

You can also talk to Aya Vision through the popular messaging service WhatsApp. Use this link to open a WhatsApp chatbox with Aya Vision.

If you don’t have WhatsApp downloaded on your machine, you might need to do that. Or, if you have it on your phone, you can follow the on - screen instructions to link your phone and WhatsApp Web. By the end, you should see a text window which you can use to chat with the model. More details about our WhatsApp integration are available here.

Example Notebook

You can check out the following notebook to understand how to use Aya Vision for different use cases.

✨ Features

Developed by: Cohere Labs
Point of Contact: Cohere Labs
License: CC - BY - NC, requires also adhering to Cohere Lab's Acceptable Use Policy
Model: Cohere Labs - aya - vision - 32b
Model Size: 32 billion parameters
Context length: 16K

📦 Installation

Please install transformers from the source repository that includes the necessary changes for this model:

# pip install 'git+https://github.com/huggingface/transformers.git@v4.49.0-AyaVision'
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "CohereLabs/aya-vision-32b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)

💻 Usage Examples

Basic Usage

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
    ]},
    ]

inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=300, 
    do_sample=True, 
    temperature=0.3,
)

print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

Advanced Usage

from transformers import pipeline

pipe = pipeline(model="CohereLabs/aya-vision-32b", task="image-text-to-text", device_map="auto")

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
        {"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
    ]},
    ]
outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)

print(outputs)

📚 Documentation

Model Details

Property	Details
Input	Model accepts input text and images.
Output	Model generates text.
Model Architecture	This is a vision - language model that uses a state - of - the - art multilingual language model, Aya Expanse 32B, which is trained with Aya Expanse recipe, paired with [SigLIP2 - patch14 - 384](https://huggingface.co/google/siglip2 - so400m - patch14 - 384) vision encoder through a multimodal adapter for vision - language understanding.
Image Processing	We use 169 visual tokens to encode an image tile with a resolution of 364x364 pixels. Input images of arbitrary sizes are mapped to the nearest supported resolution based on the aspect ratio. Aya Vision uses up to 12 input tiles and a thumbnail (resized to 364x364) (2197 image tokens).
Languages covered	The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese (Simplified and Traditional), Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.
Context length	Aya Vision 32B supports a context length of 16K.

For more details about how the model was trained, check out our blogpost.

Evaluation

We evaluated Aya Vision 32B against [Llama - 3.2 90B Vision](https://huggingface.co/meta - llama/Llama - 3.2 - 90B - Vision), [Molmo 72B](https://huggingface.co/allenai/Molmo - 72B - 0924), [Qwen2.5 - VL 72B](https://huggingface.co/Qwen/Qwen2.5 - VL - 72B - Instruct) using Aya Vision Benchmark and [m - WildVision](https://huggingface.co/datasets/CohereLabs/m - WildVision). Win - rates were determined using claude - 3 - 7 - sonnet - 20250219 as a judge, based on the superior judge performance compared to other models.

We also evaluated Aya Vision 32B’s performance for text - only input against the same models using [m - ArenaHard](https://huggingface.co/datasets/CohereLabs/m - ArenaHard), a challenging open - ended generation evaluation, measured using win - rates using gpt - 4o - 2024 - 11 - 20 as a judge.

Combined Win Rates Step by Step Improvement Efficiency vs Performance

Model Card Contact

For errors or additional questions about details in this model card, contact labs@cohere.com

Terms of Use

We hope that the release of this model will make community - based research efforts more accessible by releasing the weights of a highly performant 32 billion parameter Vision - Language Model to researchers all over the world.

This model is governed by a CC - BY - NC, requires also adhering to Cohere Lab's Acceptable Use Policy

📄 License

This model is released under the CC - BY - NC license and requires adhering to Cohere Lab's Acceptable Use Policy.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご