Model Overview
Model Features
Model Capabilities
Use Cases
🚀 IEIT-Systems ColQwen2-7B: Visual Retriever based on Qwen2-VL-7B-Instruct with ColBERT strategy
This project presents a visual retriever, IEIT-Systems ColQwen2-7B, which is based on the Qwen2-VL-7B-Instruct model and uses the ColBERT strategy. It efficiently indexes documents from their visual features, offering a novel approach in the field of visual document retrieval.
🚀 Quick Start
To get started with the model, ensure that colpali-engine
is installed from source or with a version superior to 0.3.4, and transformers
version must be > 4.46.1. You can install the necessary packages using the following command:
pip install git+https://github.com/illuin-tech/colpali
Here is a basic usage example:
import torch
from PIL import Image
from colpali_engine.models import ColQwen2, ColQwen2Processor
model = ColQwen2.from_pretrained(
"yydxlv/colqwen2-7b-v1.0",
torch_dtype=torch.bfloat16,
device_map="cuda:0", # or "mps" if on Apple Silicon
).eval()
processor = ColQwen2Processor.from_pretrained("yydxlv/colqwen2-7b-v1.0")
# Your inputs
images = [
Image.new("RGB", (32, 32), color="white"),
Image.new("RGB", (16, 16), color="black"),
]
queries = [
"Is attention really all you need?",
"What is the amount of bananas farmed in Salvador?",
]
# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
# Forward pass
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
✨ Features
- Novel Architecture and Strategy: ColQwen is based on a novel model architecture and training strategy using Vision Language Models (VLMs) to efficiently index documents from visual features.
- Multi - vector Representations: It is an extension of Qwen2-VL-7B that generates ColBERT-style multi-vector representations of text and images.
- Dynamic Image Resolution: This model takes dynamic image resolutions in input without resizing them, maintaining their aspect ratio. The maximal resolution is set to create at most 768 image patches.
📦 Installation
To use this model, you need to install the colpali-engine
package. You can install it from source using the following command:
pip install git+https://github.com/illuin-tech/colpali
Make sure transformers
version is > 4.46.1.
📚 Documentation
Version Specificity
This model takes dynamic image resolutions in input and does not resize them, unlike in ColPali where aspect ratio is changed. The maximal resolution is set so that 768 image patches are created at most. Experiments show that using larger amounts of image patches leads to clear improvements, but at the cost of increased memory requirements.
This version is trained with colpali-engine==0.3.4
. The data used is the same as the ColPali data described in the paper, and the fine - tune has also been carried out with the ShareGPT4V (https://sharegpt4v.github.io/) dataset.
Model Training
Parameters
We train models using low - rank adapters (LoRA) with alpha = 32
and r = 32
on the transformer layers from the language model, as well as the final randomly initialized projection layer. We use a paged_adamw_8bit
optimizer.
The training is carried out on an 8xA100 GPU setup with distributed data parallelism (via accelerate). The learning rate is set to 5e - 4 with linear decay and 1% warmup steps. The batch size per device is 32, and the data is in bfloat16
format.
🔧 Technical Details
The model is an extension of Qwen2-VL-7B and generates ColBERT-style multi - vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.
📄 License
ColQwen2's vision language backbone model (Qwen2-VL) is under apache2.0
license. This fine - tuned adapter is under CC BY NC 4.0 license. Therefore, the use of the model is research only at the moment.
📚 Citation
If you use this models from this organization in your research, please cite the original paper as follows:
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
Developed by: IEIT systems
⚠️ Important Note
- Focus: The model primarily focuses on PDF - type documents and high - resources languages, potentially limiting its generalization to other document types or less represented languages.
- Support: The model relies on multi - vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi - vector support.
Property | Details |
---|---|
Model Type | Visual Retriever based on Qwen2-VL-7B-Instruct with ColBERT strategy |
Training Data | vidore/colpali_train_set, ShareGPT4V (https://sharegpt4v.github.io/) |
Base Model | Qwen/Qwen2-VL-7B-Instruct |
Library Name | peft |
Pipeline Tag | visual-document-retrieval |
License | CC BY NC 4.0 |







