๐ ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy
ColPali is a model leveraging a novel architecture and training strategy based on Vision Language Models (VLMs). It can efficiently index documents using their visual features. It extends PaliGemma-3B to generate ColBERT-style multi-vector representations of text and images. This model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository. The HuggingFace transformers
๐ค implementation was contributed by Tony Wu (@tonywu71) and Yoni Gozlan (@yonigozlan).

๐ Quick Start
โ ๏ธ Important Note
This version of ColPali should be loaded with the transformers ๐ค
release, not with colpali-engine
. It was converted using the convert_colpali_weights_to_hf.py
script from the vidore/colpali-v1.3-merged
checkpoint.
โจ Features
- Based on Vision Language Models (VLMs), it can efficiently index documents from visual features.
- It is an extension of PaliGemma-3B, generating ColBERT-style multi-vector representations of text and images.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers import ColPaliForRetrieval, ColPaliProcessor
model_name = "vidore/colpali-v1.3-hf"
model = ColPaliForRetrieval.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
).eval()
processor = ColPaliProcessor.from_pretrained(model_name)
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"What is the organizational structure for our R&D department?",
"Can you provide a breakdown of last yearโs financial performance?",
]
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_retrieval(query_embeddings.embeddings, image_embeddings.embeddings)
๐ Documentation
Model Description
Read the transformers
๐ค model card: https://huggingface.co/docs/transformers/en/model_doc/colpali.
Model Training
Dataset
Our training dataset consists of 127,460 query - page pairs. It includes train sets from openly available academic datasets (63%) and a synthetic dataset composed of pages from web - crawled PDF documents, augmented with VLM - generated (Claude - 3 Sonnet) pseudo - questions (37%). The training set is fully English, allowing us to study zero - shot generalization to non - English languages. We ensure no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created with 2% of the samples for hyperparameter tuning.
Note: Multilingual data is present in the pretraining corpus of the language model (Gemma - 2B) and may occur during PaliGemma - 3B's multimodal training.
Parameters
All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in bfloat16
format, use low - rank adapters (LoRA) with alpha = 32
and r = 32
on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a paged_adamw_8bit
optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.
๐ง Technical Details
The model is based on Vision Language Models (VLMs) and uses a novel architecture and training strategy. It extends PaliGemma-3B and generates ColBERT-style multi-vector representations of text and images.
๐ License
ColPali's vision language backbone model (PaliGemma) is under gemma
license as specified in its model card. ColPali inherits from this gemma
license.
Resources
- The ColPali arXiv paper can be found here. ๐
- The official blog post detailing ColPali can be found here. ๐
- The original model implementation code for the ColPali model and for the
colpali-engine
package can be found here. ๐
- Cookbooks for learning to use the transformers - native version of ColPali, fine - tuning, and similarity maps generation can be found here. ๐
Limitations
- Focus: The model primarily focuses on PDF - type documents and high - resources languages, potentially limiting its generalization to other document types or less represented languages.
- Support: The model relies on multi - vector retrieving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi - vector support.
Contact
- Manuel Faysse: manuel.faysse@illuin.tech
- Hugues Sibille: hugues.sibille@illuin.tech
- Tony Wu: tony.wu@illuin.tech
Citation
If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Cรฉline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}