Granite-vision-3.3-2b-embedding open-source model - Used for multimodal document retrieval, handling complex layout materials

Granite Vision 3.3 2b Embedding

Developed by ibm-granite

An efficient embedding model built on granite-vision-3.3-2b, designed for multimodal document retrieval and capable of processing documents containing tables, charts, infographics, and complex layouts.

Multimodal Fusion

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal document retrieval #ColBERT-style embedding #OCR-free processing

Downloads 205

Release Time : 6/3/2025

Model Overview

This model generates ColBERT-style multi-vector page representations without OCR-based text extraction, simplifying and accelerating the RAG pipeline.

Model Features

Multimodal document processing

Capable of processing documents containing tables, charts, infographics, and complex layouts

ColBERT-style representation

Generates ColBERT-style multi-vector representations of pages to improve retrieval efficiency

No OCR requirement

No OCR-based text extraction is required, simplifying the RAG pipeline

Efficient retrieval

Optimized for accelerating multimodal document retrieval

Model Capabilities

Multimodal document embedding

Image-text similarity calculation

Complex layout document processing

Cross-modal retrieval

Use Cases

Document retrieval

Financial report retrieval

Retrieve relevant information from financial reports containing tables and charts

NDCG@5 reaches 70 on the FinReport dataset

Technical document retrieval

Retrieve specific information from technical reports and slides

NDCG@5 reaches 84 and 93 on the TechReport and TechSlides datasets respectively

Cross-modal search

Image-text matching

Calculate the similarity between an image and a text description

🚀 granite-vision-3.3-2b-embedding

Granite-vision-3.3-2b-embedding is an efficient embedding model based on granite-vision-3.3-2b. It's designed for multimodal document retrieval, enabling queries on complex - structured documents. By eliminating OCR - based text extractions, it simplifies and accelerates RAG pipelines.

🚀 Quick Start

Installation

pip install -q torch torchvision torchaudio
pip install transformers==4.50

Basic Usage

from io import BytesIO

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ibm-granite/granite-vision-3.3-2b-embedding"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to(device).eval()
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# ─────────────────────────────────────────────
# Inputs: Image + Text
# ─────────────────────────────────────────────
image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
print("\nFetching image...")
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")

text = "A photo of a tiger"
print(f"Image and text inputs ready.")

# Process both inputs
print("Processing inputs...")
image_inputs = processor.process_images([image])
text_inputs = processor.process_queries([text])

# Move to correct device
image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}

# ─────────────────────────────────────────────
# Run Inference
# ─────────────────────────────────────────────
with torch.no_grad():
    print("🔍 Getting image embedding...")
    img_emb = model(**image_inputs)

    print("✍️ Getting text embedding...")
    txt_emb = model(**text_inputs)

# ─────────────────────────────────────────────
# Score the similarity
# ─────────────────────────────────────────────
print("Scoring similarity...")
similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)

print("\n" + "=" * 50)
print(f"📊 Similarity between image and text: {similarity.item():.4f}")
print("=" * 50)

Advanced Usage

For an example of MM - RAG using granite-vision-3.3-2b-embedding refer to this notebook.

✨ Features

Specifically designed for multimodal document retrieval, supporting queries on documents with tables, charts, infographics, and complex layouts.
Generates ColBERT - style multi - vector representations of pages.
Simplifies and accelerates RAG pipelines by removing the need for OCR - based text extractions.

📚 Documentation

Evaluations

We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi - modal embedding models in the 1B - 4B parameter range using two benchmarks: Vidore2 and Real-MM-RAG-Bench which aim to specifically address complex multimodal document retrieval tasks.

NDCG@5 - ViDoRe V2

Collection \ Model	ColPali-v1.3	ColQwen2.5-v0.2	ColNomic-3b	ColSmolvlm-v0.1	granite-vision-3.3-2b-embedding
ESG Restaurant Human	51.1	68.4	65.8	62.4	62.3
Economics Macro Multilingual	49.9	56.5	55.4	47.4	48.3
MIT Biomedical	59.7	63.6	63.5	58.1	60.0
ESG Restaurant Synthetic	57.0	57.4	56.6	51.1	54.0
ESG Restaurant Synthetic Multilingual	55.7	57.4	57.2	47.6	53.5
MIT Biomedical Multilingual	56.5	61.1	62.5	50.5	53.6
Economics Macro	51.6	59.8	60.2	60.9	60.0
Avg (ViDoRe2)	54.5	60.6	60.2	54.0	56.0

NDCG@5 - REAL-MM-RAG

Collection \ Model	ColPali-v1.3	ColQwen2.5-v0.2	ColNomic-3b	ColSmolvlm-v0.1	granite-vision-3.3-2b-embedding
FinReport	55	66	78	65	70
FinSlides	68	79	81	55	74
TechReport	78	86	88	83	84
TechSlides	90	93	92	91	93
Avg (REAL-MM-RAG)	73	81	85	74	80

Model Architecture

The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:

(1) Vision - Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).

(2) Projection layer: linear layer that projects the hidden layer dimension of Vision - Language model to 128 and outputs 729 embedding vectors per image.

The scoring is computed using MaxSim - based late interaction mechanism.

Training Data

Our training data is entirely comprised from DocFM. DocFM is a large - scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance) reports.

Infrastructure

We train granite-vision-3.3-2b-embedding on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.

Ethical Considerations and Limitations

The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision - making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way.

Resources

📄 Granite Vision technical report here
📄 Real-MM-RAG-Bench paper (ACL 2025) here
📄 Vidore 2 paper here
⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

📄 License

This project is licensed under the Apache 2.0 license.

🔧 Technical Details

The model is based on granite-vision-3.3-2b and uses a ColPali - style architecture. It has a projection layer that projects the hidden layer dimension of the Vision - Language model to 128 and outputs 729 embedding vectors per image. The scoring is computed using a MaxSim - based late interaction mechanism. The model is trained on IBM’s cognitive computing cluster with NVIDIA A100 GPUs, using training data from DocFM.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご