Colqwen2.5-v0.1 Open-Source Visual Retrieval Model - Generate Multi-Vector Representations to Aid Efficient Document Retrieval

Colqwen2.5 V0.1

Developed by vidore

A visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, capable of generating multi-vector representations for text and images to enable efficient document retrieval.

Text-to-Image

Safetensors

EnglishOpen Source License:MIT #Multimodal Document Retrieval #ColBERT Vector Representation #Dynamic Image Resolution

Downloads 985

Release Time : 1/30/2025

Model Overview

ColQwen2.5 is a vision-language model that efficiently indexes documents through visual features, supports dynamic input image resolution, and is suitable for document retrieval tasks.

Model Features

Dynamic Input Image Resolution

Supports dynamic input image resolution without altering aspect ratio, with a maximum resolution limit of up to 768 image patches.

Multi-Vector Representation

Generates ColBERT-style multi-vector representations for both text and images, enhancing retrieval efficiency.

Efficient Training

Utilizes LoRA adapters and paged_adamw_8bit optimizer, trained with data parallelism on 8 GPUs, learning rate 5e-5, batch size 32.

Model Capabilities

Visual Document Retrieval

Multi-Vector Representation Generation

Dynamic Image Processing

Use Cases

Document Retrieval

Academic Literature Retrieval

Used to retrieve specific content in academic literature, such as data in charts or specific text paragraphs.

Experiments show that increasing the number of image patches significantly improves retrieval performance.

PDF Document Retrieval

Retrieves specific information from PDF documents, such as tables, charts, or text content.

Performs well on the ViDoRe evaluation set, with no overlapping documents from the training set.

🚀 ColQwen2.5: Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy

ColQwen is a model based on a novel model architecture and training strategy using Vision Language Models (VLMs) to efficiently index documents from their visual features. It's an extension of Qwen2.5-VL-3B, generating ColBERT-style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

✨ Features

Dynamic Image Resolution: This model accepts dynamic image resolutions as input without resizing, thus maintaining the aspect ratio, unlike ColPali. The maximal resolution is set to create at most 768 image patches. Experiments show that more image patches lead to better performance, though it increases memory requirements.
Trained with Specific Version: This version is trained with colpali-engine==0.3.7.
Same Training Data: The training data is the same as the ColPali data described in the paper.

📦 Installation

Make sure colpali-engine is installed from source or with a version superior to 0.3.1. Also, transformers version must be > 4.45.0.

pip install git+https://github.com/illuin-tech/colpali

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model = ColQwen2_5.from_pretrained(
        "vidore/colqwen2.5-v0.1",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",  # or "mps" if on Apple Silicon
        attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
    ).eval()
processor = ColQwen2_5_Processor.from_pretrained("vidore/colqwen2.5-v0.1")

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

📚 Documentation

Model Training

Dataset

Our training dataset consists of 127,460 query - page pairs. It includes train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is composed of pages from web - crawled PDF documents, with pseudo - questions generated by VLM (Claude - 3 Sonnet) added (37%). The training set is fully English, allowing us to study zero - shot generalization to non - English languages. We ensure that no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created using 2% of the samples for hyperparameter tuning.

Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.

Parameters

All models are trained for 1 epoch on the train set. Unless otherwise specified, we train models in bfloat16 format, use low - rank adapters (LoRA) with alpha = 32 and r = 32 on the transformer layers of the language model and the final randomly initialized projection layer, and use a paged_adamw_8bit optimizer. We train on an 8 - GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.

Limitations

Focus: The model mainly focuses on PDF - type documents and high - resource languages, which may limit its generalization to other document types or less - represented languages.
Support: The model relies on multi - vector retrieval based on the ColBERT late interaction mechanism. Adapting it to widely used vector retrieval frameworks without native multi - vector support may require engineering efforts.

📄 License

ColQwen2.5's vision language backbone model (Qwen2.5 - VL) is under Qwen RESEARCH LICENSE AGREEMENT license. The adapters attached to the model are under MIT license.

Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

📋 Information Table

Property	Details
Model Type	Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy
Training Data	A dataset of 127,460 query - page pairs, including 63% openly available academic datasets and 37% synthetic dataset
Training Version	`colpali-engine==0.3.7`

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご