colqwen2.5-v0.2 Open-source Visual Retrieval Model - Efficiently Index Documents Using Visual Features

Colqwen2.5 V0.2

Developed by vidore

ColQwen2.5 is a visual retrieval model based on Qwen2.5-VL-3B-Instruct and the ColBERT strategy, focusing on efficiently indexing documents through visual features.

Text-to-Image

Safetensors

EnglishOpen Source License:MIT #Multi-vector Document Retrieval #Dynamic Image Resolution #PDF Visual Indexing

Downloads 22.31k

Release Time : 1/31/2025

Model Overview

ColQwen2.5 is a Vision-Language Model (VLM) capable of generating ColBERT-style multi-vector representations for both text and images, enabling efficient document retrieval.

Model Features

Dynamic Input Image Resolution

Supports dynamic input image resolution without resizing, maintaining the same aspect ratio during processing.

Multi-vector Representation

Generates ColBERT-style multi-vector representations for both text and images, enhancing retrieval efficiency.

High-Resolution Processing

Maximum resolution is set to generate up to 768 image patches, with increased patch count significantly improving performance.

Model Capabilities

Visual Document Retrieval

Multi-vector Representation Generation

Dynamic Image Processing

Use Cases

Document Retrieval

Academic Document Retrieval

Used for retrieving relevant content in academic papers.

PDF Document Retrieval

Used for retrieving visual and textual information in PDF documents.

🚀 ColQwen2.5: Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy

ColQwen is a model that leverages a novel model architecture and training strategy based on Vision Language Models (VLMs). It can efficiently index documents from their visual features. It extends Qwen2.5-VL-3B to generate ColBERT-style multi-vector representations of text and images. This model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

✨ Features

Dynamic Image Resolution: This model accepts dynamic image resolutions as input without resizing them, which avoids changing their aspect ratio as in ColPali. The maximal resolution is set to create at most 768 image patches. Experiments show that using a larger number of image patches can lead to clear improvements, although it comes at the cost of increased memory requirements.
Specific Training Version: This version is trained with colpali-engine==0.3.7.
Same Training Data: The training data is the same as the ColPali data described in the paper.

📦 Installation

Make sure colpali-engine is installed from source or with a version superior to 0.3.1. The transformers version must be > 4.45.0.

pip install git+https://github.com/illuin-tech/colpali

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model = ColQwen2_5.from_pretrained(
        "vidore/colqwen2.5-v0.2",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",  # or "mps" if on Apple Silicon
        attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
    ).eval()
processor = ColQwen2_5_Processor.from_pretrained("vidore/colqwen2.5-v0.2")

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

📚 Documentation

Model Training

Dataset

Our training dataset consists of 127,460 query - page pairs. It is composed of train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is made up of pages from web - crawled PDF documents and augmented with VLM - generated (Claude - 3 Sonnet) pseudo - questions (37%). By design, our training set is fully English, which allows us to study zero - shot generalization to non - English languages. We explicitly ensure that no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created using 2% of the samples to tune hyperparameters.

Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.

Parameters

All models are trained for 1 epoch on the train set. Unless otherwise specified, we train models in bfloat16 format. We use low - rank adapters (LoRA) with alpha = 32 and r = 32 on the transformer layers from the language model, as well as the final randomly initialized projection layer. We use a paged_adamw_8bit optimizer. The training is conducted on an 8 - GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.

🔧 Technical Details

Model Architecture: Based on Vision Language Models (VLMs), it extends Qwen2.5-VL-3B and generates ColBERT-style multi-vector representations of text and images.
Input Handling: Accepts dynamic image resolutions without resizing, with a maximum resolution set to create at most 768 image patches.

📄 License

ColQwen2.5's vision language backbone model (Qwen2.5-VL) is under the Qwen RESEARCH LICENSE AGREEMENT license. The adapters attached to the model are under the MIT license.

Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

Limitations

⚠️ Important Note

Focus: The model primarily focuses on PDF - type documents and high - resources languages, which may limit its generalization to other document types or less represented languages.

Support: The model relies on multi - vector retrieval derived from the ColBERT late interaction mechanism. This may require engineering efforts to adapt it to widely used vector retrieval frameworks that lack native multi - vector support.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご