๐ ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy
ColPali is a model leveraging a novel architecture and training strategy based on Vision Language Models (VLMs). It efficiently indexes documents using their visual features, offering a powerful solution for visual document retrieval.
โ ๏ธ Important Note
This version of ColPali should be loaded with the transformers ๐ค
release, not with colpali-engine
.
It was converted using the convert_colpali_weights_to_hf.py
script
from the vidore/colpali-v1.2-merged
checkpoint.
โจ Features
- Novel Architecture: Based on PaliGemma-3B, it generates ColBERT-style multi-vector representations of text and images.
- Multilingual Potential: Although trained on English data, multilingual data is present in the pretraining corpus of the language model, allowing for potential zero-shot generalization to non-English languages.

๐ฆ Installation
No specific installation steps were provided in the original README.
๐ป Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers import ColPaliForRetrieval, ColPaliProcessor
model_name = "vidore/colpali-v1.2-hf"
model = ColPaliForRetrieval.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
).eval()
processor = ColPaliProcessor.from_pretrained(model_name)
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"What is the organizational structure for our R&D department?",
"Can you provide a breakdown of last yearโs financial performance?",
]
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_retrieval(query_embeddings.embeddings, image_embeddings.embeddings)
๐ Documentation
Model Description
This model is built iteratively starting from an off-the-shelf SigLIP model. It was finetuned to create BiSigLIP and then fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali.
Model Training
Dataset
Our training dataset of 127,460 query-page pairs consists of train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). The training set is fully English, enabling zero-shot generalization studies. A validation set with 2% of the samples is used for hyperparameter tuning.
Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and may occur during PaliGemma-3B's multimodal training.
Parameters
All models are trained for 1 epoch on the train set. We train models in bfloat16
format, use low-rank adapters (LoRA) with alpha=32
and r=32
on the transformer layers from the language model, as well as the final randomly initialized projection layer. We use a paged_adamw_8bit
optimizer. Training is done on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay and 2.5% warmup steps, and a batch size of 32.
Resources
- The ColPali arXiv paper can be found here. ๐
- The official blog post detailing ColPali can be found here. ๐
- The original model implementation code for the ColPali model and for the
colpali-engine
package can be found here. ๐
- Cookbooks for learning to use the transformers-native version of ColPali, fine-tuning, and similarity maps generation can be found here. ๐
Limitations
- Focus: The model mainly focuses on PDF-type documents and high-resource languages, which may limit its generalization to other document types or less represented languages.
- Support: The model relies on multi-vector retrieval from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks without native multi-vector support.
๐ง Technical Details
The model is based on a novel architecture and training strategy. It uses the ColBERT strategy to compute interactions between text tokens and image patches, which significantly improves performance compared to BiPali. The training dataset is carefully designed to prevent evaluation contamination and enable zero-shot generalization studies.
๐ License
ColPali's vision language backbone model (PaliGemma) is under gemma
license as specified in its model card. ColPali inherits from this gemma
license.
Contact
- Manuel Faysse: manuel.faysse@illuin.tech
- Hugues Sibille: hugues.sibille@illuin.tech
- Tony Wu: tony.wu@illuin.tech
Citation
If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Cรฉline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
Property |
Details |
Library Name |
transformers |
Tags |
colpali |
License |
gemma |
Datasets |
vidore/colpali_train_set |
Language |
en |
Base Model |
vidore/colpaligemma-3b-pt-448-base |
New Version |
vidore/colpali-v1.3-hf |
Pipeline Tag |
visual-document-retrieval |