🚀 ColQwen2.5: Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy
ColQwen is a model that leverages a novel model architecture and training strategy based on Vision Language Models (VLMs). It can efficiently index documents from their visual features. It extends Qwen2.5-VL-3B to generate ColBERT-style multi-vector representations of text and images. This model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

✨ Features
- Dynamic Image Resolution: This model accepts dynamic image resolutions as input without resizing them, which avoids changing their aspect ratio as in ColPali. The maximal resolution is set to create at most 768 image patches. Experiments show that using a larger number of image patches can lead to clear improvements, although it comes at the cost of increased memory requirements.
- Specific Training Version: This version is trained with
colpali-engine==0.3.7
.
- Same Training Data: The training data is the same as the ColPali data described in the paper.
📦 Installation
Make sure colpali-engine
is installed from source or with a version superior to 0.3.1. The transformers
version must be > 4.45.0.
pip install git+https://github.com/illuin-tech/colpali
💻 Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
model = ColQwen2_5.from_pretrained(
"vidore/colqwen2.5-v0.2",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2_5_Processor.from_pretrained("vidore/colqwen2.5-v0.2")
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"Is attention really all you need?",
"What is the amount of bananas farmed in Salvador?",
]
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
📚 Documentation
Model Training
Dataset
Our training dataset consists of 127,460 query - page pairs. It is composed of train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is made up of pages from web - crawled PDF documents and augmented with VLM - generated (Claude - 3 Sonnet) pseudo - questions (37%). By design, our training set is fully English, which allows us to study zero - shot generalization to non - English languages. We explicitly ensure that no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created using 2% of the samples to tune hyperparameters.
Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.
Parameters
All models are trained for 1 epoch on the train set. Unless otherwise specified, we train models in bfloat16
format. We use low - rank adapters (LoRA) with alpha = 32
and r = 32
on the transformer layers from the language model, as well as the final randomly initialized projection layer. We use a paged_adamw_8bit
optimizer. The training is conducted on an 8 - GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.
🔧 Technical Details
- Model Architecture: Based on Vision Language Models (VLMs), it extends Qwen2.5-VL-3B and generates ColBERT-style multi-vector representations of text and images.
- Input Handling: Accepts dynamic image resolutions without resizing, with a maximum resolution set to create at most 768 image patches.
📄 License
ColQwen2.5's vision language backbone model (Qwen2.5-VL) is under the Qwen RESEARCH LICENSE AGREEMENT
license. The adapters attached to the model are under the MIT license.
Contact
- Manuel Faysse: manuel.faysse@illuin.tech
- Hugues Sibille: hugues.sibille@illuin.tech
- Tony Wu: tony.wu@illuin.tech
Citation
If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
Limitations
⚠️ Important Note
- Focus: The model primarily focuses on PDF - type documents and high - resources languages, which may limit its generalization to other document types or less represented languages.
- Support: The model relies on multi - vector retrieval derived from the ColBERT late interaction mechanism. This may require engineering efforts to adapt it to widely used vector retrieval frameworks that lack native multi - vector support.