🚀 ColQwen2.5: Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy
ColQwen is a model based on a novel model architecture and training strategy using Vision Language Models (VLMs) to efficiently index documents from their visual features. It's an extension of Qwen2.5-VL-3B, generating ColBERT-style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

✨ Features
- Dynamic Image Resolution: This model accepts dynamic image resolutions as input without resizing, thus maintaining the aspect ratio, unlike ColPali. The maximal resolution is set to create at most 768 image patches. Experiments show that more image patches lead to better performance, though it increases memory requirements.
- Trained with Specific Version: This version is trained with
colpali-engine==0.3.7
.
- Same Training Data: The training data is the same as the ColPali data described in the paper.
📦 Installation
Make sure colpali-engine
is installed from source or with a version superior to 0.3.1. Also, transformers
version must be > 4.45.0.
pip install git+https://github.com/illuin-tech/colpali
💻 Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
model = ColQwen2_5.from_pretrained(
"vidore/colqwen2.5-v0.1",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2_5_Processor.from_pretrained("vidore/colqwen2.5-v0.1")
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"Is attention really all you need?",
"What is the amount of bananas farmed in Salvador?",
]
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
📚 Documentation
Model Training
Dataset
Our training dataset consists of 127,460 query - page pairs. It includes train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is composed of pages from web - crawled PDF documents, with pseudo - questions generated by VLM (Claude - 3 Sonnet) added (37%). The training set is fully English, allowing us to study zero - shot generalization to non - English languages. We ensure that no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created using 2% of the samples for hyperparameter tuning.
Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.
Parameters
All models are trained for 1 epoch on the train set. Unless otherwise specified, we train models in bfloat16
format, use low - rank adapters (LoRA) with alpha = 32
and r = 32
on the transformer layers of the language model and the final randomly initialized projection layer, and use a paged_adamw_8bit
optimizer. We train on an 8 - GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.
Limitations
- Focus: The model mainly focuses on PDF - type documents and high - resource languages, which may limit its generalization to other document types or less - represented languages.
- Support: The model relies on multi - vector retrieval based on the ColBERT late interaction mechanism. Adapting it to widely used vector retrieval frameworks without native multi - vector support may require engineering efforts.
📄 License
ColQwen2.5's vision language backbone model (Qwen2.5 - VL) is under Qwen RESEARCH LICENSE AGREEMENT
license. The adapters attached to the model are under MIT license.
Contact
- Manuel Faysse: manuel.faysse@illuin.tech
- Hugues Sibille: hugues.sibille@illuin.tech
- Tony Wu: tony.wu@illuin.tech
Citation
If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
📋 Information Table
Property |
Details |
Model Type |
Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy |
Training Data |
A dataset of 127,460 query - page pairs, including 63% openly available academic datasets and 37% synthetic dataset |
Training Version |
colpali-engine==0.3.7 |