Colqwen2-v1.0 Open-Source Visual Retrieval Model - Free Deployment for Efficient Indexing of Visual Features in Documents

Colqwen2 V1.0

Developed by vidore

ColQwen2 is a visual retrieval model based on Qwen2-VL-2B-Instruct and the ColBERT strategy, designed for efficient indexing of document visual features.

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Visual Document Retrieval #Multi-vector Representation #Dynamic Resolution Processing

Downloads 106.85k

Release Time : 11/3/2024

Model Overview

ColQwen2 is a Vision-Language Model (VLM) capable of generating ColBERT-style multi-vector representations for both text and images, primarily used for document retrieval tasks.

Model Features

Dynamic Input Image Resolution

Supports original aspect ratio input without resizing, with maximum resolution set to generate up to 768 image patches.

Multi-vector Representation

Utilizes ColBERT-style multi-vector representation to improve retrieval efficiency.

LoRA Adapter

Applies Low-Rank Adaptation (LoRA) with parameters alpha=32 and r=32 to the Transformer layers and projection layers of the language model.

Model Capabilities

Visual Document Retrieval

Multimodal Representation Learning

Cross-modal Retrieval

Use Cases

Document Retrieval

Academic Literature Retrieval

Retrieve relevant academic literature from a large collection of PDF documents.

Significantly improves retrieval efficiency.

Enterprise Document Management

Efficient indexing and retrieval of internal corporate documents.

Enhances document search efficiency.

🚀 ColQwen2: Visual Retriever based on Qwen2-VL-2B-Instruct with ColBERT strategy

ColQwen2 is a model leveraging a novel architecture and training strategy based on Vision Language Models (VLMs). It can efficiently index documents from their visual features, offering a powerful solution for visual document retrieval.

Key Features

It's an extension of Qwen2-VL-2B, generating ColBERT-style multi-vector representations of text and images.
Introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

🚀 Quick Start

Prerequisites

Make sure colpali-engine is installed from source or with a version superior to 0.3.4. transformers version must be > 4.46.1.

pip install git+https://github.com/illuin-tech/colpali

Basic Usage

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2, ColQwen2Processor

model = ColQwen2.from_pretrained(
    "vidore/colqwen2-v1.0",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

✨ Features

Dynamic Image Resolution: This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali. The maximal resolution is set so that 768 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements.
Trained with Specific Engine: This version is trained with colpali-engine==0.3.1.
Same Training Data: Data is the same as the ColPali data described in the paper.

📚 Documentation

Model Training

Dataset

Our training dataset of 127,460 query - page pairs consists of train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is made up of pages from web - crawled PDF documents and augmented with VLM - generated (Claude - 3 Sonnet) pseudo - questions (37%). The training set is fully English, allowing us to study zero - shot generalization to non - English languages. We ensure no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters.

Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.

Parameters

All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in bfloat16 format, use low - rank adapters (LoRA) with alpha = 32 and r = 32 on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a paged_adamw_8bit optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay with 2.5% warmup steps, and a batch size of 32.

Version Specificity

This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali. Maximal resolution is set so that 768 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements. This version is trained with colpali-engine==0.3.1. Data is the same as the ColPali data described in the paper.

Limitations

Focus: The model primarily focuses on PDF - type documents and high - resources languages, potentially limiting its generalization to other document types or less represented languages.
Support: The model relies on multi - vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi - vector support.

📄 License

ColQwen2's vision language backbone model (Qwen2 - VL) is under apache2.0 license. The adapters attached to the model are under MIT license.

📞 Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

📚 Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

📋 Model Information

Property	Details
Model Type	Visual Retriever based on Qwen2-VL-2B-Instruct with ColBERT strategy
Training Data	A training dataset of 127,460 query - page pairs, composed of 63% openly available academic datasets and 37% synthetic dataset
Pipeline Tag	visual - document - retrieval
Base Model	vidore/colqwen2 - base
Library Name	colpali
Tags	colpali, vidore - experimental, vidore

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご