EuroVLM-9B-Preview Open-Source Multimodal Model - Free Support for Multilingual Visual Task Applications!

Eurovlm 9B Preview

Developed by utter-project

EuroVLM-9B-Preview is a multimodal vision-language model based on the long-context version of EuroLLM-9B, supporting multiple languages and visual tasks. It is currently in the preview version.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Visual Question Answering #High-Resolution Image Understanding #Multimodal for European Languages

Downloads 156

Release Time : 6/9/2025

Model Overview

EuroVLM-9B-Preview is a multimodal model that combines text and visual processing capabilities, focusing on European language support and suitable for tasks such as image caption generation and visual question answering.

Model Features

Multilingual Support

Supports over 30 European and other languages, covering major European languages and some Asian languages.

Multimodal Processing

Can process text and image inputs simultaneously to perform cross-modal tasks.

Long Context Support

Expands the context size to support long text processing of up to 32K tokens.

Efficient Inference

Adopts Grouped Query Attention (GQA) and SwiGLU activation function to optimize inference efficiency.

Model Capabilities

Multilingual Image Caption Generation

Visual Question Answering

Visual Instruction Execution

Multimodal Translation

Document Understanding

Use Cases

Education

Multilingual Learning Assistance

Helps students understand descriptions in different languages through images to assist language learning.

Provides multilingual image captions to enhance the language learning experience.

Content Creation

Multilingual Content Generation

Generates multilingual descriptions or stories based on images for content creation.

Rapidly generates multilingual content to improve creation efficiency.

Customer Service

Multilingual Visual Support

Answers customers' cross - language questions about product images.

Provides multilingual visual question answering to improve the customer experience.

🚀 EuroVLM-9B-Preview

EuroVLM-9B-Preview is a multimodal vision-language model based on the long-context version of EuroLLM-9B, offering support for a wide range of languages and various vision - language tasks.

⚠️ Important Note

This is a preview version of EuroVLM-9B. The model is still under development and may have limitations in performance and stability. Use with caution in production environments.

🚀 Quick Start

This is the model card for EuroVLM-9B-Preview, a multimodal vision-language model based on long-context version of EuroLLM-9B.

Developed by: Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université.
Funded by: European Union.
Model type: A 9B+400M parameter multilingual multimodal transformer VLM (Vision-Language Model).
Language(s) (NLP): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.
Modalities: Text and Vision (images).
License: Apache License 2.0.

✨ Features

Multilingual Image Captioning: Generate detailed descriptions of images in any of the supported languages
Visual Question Answering: Answer questions about image content in multilingual contexts
Visual Instruction Following: Execute complex instructions that involve both visual analysis and text generation
Multimodal Translation: Translate image captions and descriptions between supported languages
Document Understanding: Process and analyze documents, charts, and diagrams with multilingual text

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

To use the model with HuggingFace's Transformers library

from PIL import Image
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
    
model_id = "utter-project/EuroVLM-9B-Preview"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(model_id)

# Load an image
image = Image.open("/path/to/image.jpg")
    
messages = [
    {
        "role": "system",
        "content": "You are EuroVLM --- a multimodal AI assistant specialized in European languages that provides safe, educational and helpful answers about images and text.",
    },
    {
        "role": "user", 
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do you see in this image? Please describe it in Portuguese."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

You can also run EuroVLM with vLLM!

from vllm import LLM, SamplingParams

# Initialize the model
model_id = "utter-project/EuroVLM-9B-Preview"
llm = LLM(model=model_id)

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)

# Image and prompt
image_url = "/url/of/image.jpg"

messages = [
    {
        "role": "system",
        "content": "You are EuroVLM --- a multimodal AI assistant specialized in European languages that provides safe, educational and helpful answers about images and text.",
    },
    {
        "role": "user", 
        "content": [
            {"type": "image_url", "image_url": {"url": image_url}},
            {"type": "text", "text": "What do you see in this image? Please describe it in Portuguese in one sentence."}
        ]
    },
]

# Generate response
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

📚 Documentation

Model Details

EuroVLM-9B is a 9B+400M parameter vision-language model that combines the multilingual capabilities of EuroLLM-9B with vision encoding components.

EuroVLM-9B was (visually) instruction tuned on a combination of multilingual vision-language datasets, including image captioning, visual question answering, and multimodal reasoning tasks across the supported languages.

Model Description

EuroVLM uses a multimodal architecture combining a vision encoder with the EuroLLM language model:

Language Model Component:

Based on the standard, dense Transformer architecture from EuroLLM-9B
Grouped query attention (GQA) with 8 key-value heads for efficient inference
Pre-layer normalization with RMSNorm for training stability
SwiGLU activation function for optimal downstream performance
Rotary positional embeddings (RoPE) in every layer
Extended context size supporting up to 32K tokens

Vision Component:

Vision Transformer (ViT) encoder, based on google/siglip2-so400m-patch14-384
Multimodal projector mapping vision representations to token embeddings
Support for high-resolution image inputs

🔧 Technical Details

EuroVLM-9B has not been fully aligned to human preferences, so the model may generate problematic outputs in both text and image understanding contexts (e.g., hallucinations about image content, harmful content, biased interpretations, or false statements about visual information).

Additional considerations for multimodal models include:

Potential biases in visual interpretation across different cultural contexts
Limitations in understanding complex visual scenes or unusual image compositions
Possible inconsistencies between visual understanding and textual generation across languages
Privacy considerations when processing images that may contain personal information

Users should exercise caution and implement appropriate safety measures when deploying this model in production environments.

📄 License

The model is licensed under the Apache License 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご