Colqwen2-v0.1 Open-source Visual Retrieval Model - Efficiently Utilize Visual Features to Index Documents

Colqwen2 V0.1

Developed by vidore

A visual retrieval model based on Qwen2-VL-2B-Instruct and ColBERT strategy, capable of efficiently indexing documents through visual features

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Multi-vector Document Retrieval #Vision-Language Model #Dynamic Resolution Processing

Downloads 21.25k

Release Time : 9/26/2024

Model Overview

ColQwen2 is an innovative vision-language model that extends the Qwen2-VL-2B architecture and adopts a ColBERT-style multi-vector representation strategy to achieve efficient visual document retrieval.

Model Features

Dynamic Image Resolution Support

Supports dynamic input image resolution without resizing, with a maximum resolution set to generate up to 768 image patches

Multi-vector Representation

Employs a ColBERT-style multi-vector representation strategy, capable of generating multi-vector representations for both text and images

Efficient Retrieval

Efficiently indexes documents through visual features, particularly suitable for PDF document retrieval

LoRA Adaptation

Applies Low-Rank Adaptation (LoRA) on the Transformer and projection layers of the language model to optimize training efficiency

Model Capabilities

Visual Document Retrieval

Multimodal Representation Learning

Cross-modal Matching

Image Understanding

Text Understanding

Use Cases

Document Retrieval

Academic Literature Retrieval

Quickly retrieves relevant content from academic PDF documents through visual features

Enterprise Document Management

Efficiently indexes and manages internal corporate PDF document libraries

Cross-modal Search

Image-Text Association Search

Retrieves related image content through text queries or retrieves related text descriptions through images

🚀 ColQwen2: Visual Retriever based on Qwen2-VL-2B-Instruct with ColBERT strategy

ColQwen is a model that leverages a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It's an extension of Qwen2-VL-2B, generating ColBERT-style multi-vector representations of text and images. This model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

This untrained base version ensures deterministic projection layer initialization.

✨ Features

Version specificity

This model accepts dynamic image resolutions without resizing, thus preserving their aspect ratio, unlike ColPali. The maximal resolution is set to create at most 768 image patches. Experiments indicate significant improvements with more image patches, though it increases memory requirements. This version is trained with colpali-engine==0.3.1 and uses the same data as described in the ColPali paper.

Model Training

Dataset

Our training dataset consists of 127,460 query - page pairs. It includes train sets from openly available academic datasets (63%) and a synthetic dataset composed of web - crawled PDF pages with pseudo - questions generated by VLM (Claude - 3 Sonnet) (37%). The training set is entirely in English, allowing us to study zero - shot generalization to non - English languages. We ensure no multi - page PDF document is used in both ViDoRe and the train set to avoid evaluation contamination. A 2% validation set is used for hyperparameter tuning.

Note: Multilingual data is present in the pretraining corpus of the language model and likely in the multimodal training.

Parameters

All models are trained for 1 epoch on the train set. Unless otherwise specified, we train models in bfloat16 format, using low - rank adapters (LoRA) with alpha = 32 and r = 32 on the transformer layers of the language model and the final randomly initialized projection layer. We use a paged_adamw_8bit optimizer. Training is done on an 8 - GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.

Limitations

Focus: The model mainly targets PDF - type documents and high - resource languages, which may limit its generalization to other document types or less - represented languages.
Support: The model relies on multi - vector retrieval from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks without native multi - vector support.

📦 Installation

Make sure colpali-engine is installed from source or with a version greater than 0.3.1. The transformers version must be > 4.45.0.

pip install git+https://github.com/illuin-tech/colpali

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2, ColQwen2Processor

model = ColQwen2.from_pretrained(
    "vidore/colqwen2-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

📄 License

ColQwen2's vision language backbone model (Qwen2 - VL) is under the apache2.0 license. The adapters attached to the model are under the MIT license.

📚 Documentation

Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

Property	Details
Library Name	colpali
Base Model	vidore/colqwen2-base
New Version	vidore/colqwen2-v1.0
Pipeline Tag	visual-document-retrieval
License	apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご