colqwen2-v1.0-hf Open-source Visual Retrieval Model - Free Generation of Multi-vector Representations for Text and Images

Colqwen2 V1.0 Hf

Developed by vidore

A visual retrieval model based on Qwen2-VL-2B-Instruct and ColBERT strategy, capable of generating multi-vector representations for text and images

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Document Visual Retrieval #Multi-Vector Representation #PDF Parsing

Downloads 61

Release Time : 2/11/2025

Model Overview

ColQwen2 is a novel vision-language model designed for document visual feature indexing. It extends the Qwen2-VL-2B model and adopts a ColBERT-style multi-vector representation strategy, making it suitable for efficient document retrieval tasks.

Model Features

Multi-Vector Representation

Adopts the ColBERT strategy to generate multi-vector representations for text and images, improving retrieval accuracy

Vision-Language Fusion

Combines visual and language features to achieve cross-modal document retrieval

Efficient Retrieval

Optimizes retrieval efficiency through a late interaction mechanism

Model Capabilities

Document visual feature extraction

Cross-modal retrieval

Text-image matching

Multi-vector representation generation

Use Cases

Document Management

Enterprise Document Retrieval

Quickly locate specific information within company internal documents

Improves document retrieval efficiency and accuracy

Academic Literature Search

Locate relevant content within a large number of PDF research papers

Accelerates the research process

Knowledge Management

Knowledge Base Construction

Provides efficient retrieval capabilities for knowledge base systems

Enhances knowledge acquisition experience

🚀 ColQwen2: Visual Retriever based on Qwen2-VL-2B-Instruct with ColBERT strategy

ColQwen2 is a model that leverages a novel model architecture and training strategy based on Vision Language Models (VLMs). It can efficiently index documents using their visual features. It extends Qwen2-VL-2B and generates ColBERT-style multi-vector representations of text and images. This model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

The HuggingFace transformers 🤗 implementation was contributed by Tony Wu (@tonywu71) and Yoni Gozlan (@yonigozlan).

🚀 Quick Start

Important Notes

⚠️ Important Note EXPERIMENTAL: Wait for https://github.com/huggingface/transformers/pull/35778 to be merged before using!

💡 Usage Tip This version of ColQwen2 should be loaded with the transformers 🤗 release, not with colpali-engine. It was converted using the convert_colqwen2_weights_to_hf.py script from the vidore/colqwen2-v1.0-merged checkpoint.

✨ Features

Based on a novel model architecture and training strategy of Vision Language Models (VLMs).
Can efficiently index documents from their visual features.
Generates ColBERT-style multi-vector representations of text and images.

📚 Documentation

Model Description

Read the transformers 🤗 model card: https://huggingface.co/docs/transformers/en/model_doc/colqwen2.

Model Training

Dataset

Our training dataset consists of 127,460 query - page pairs. It is composed of train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is made up of pages from web - crawled PDF documents and augmented with VLM - generated (Claude - 3 Sonnet) pseudo - questions (37%). The training set is fully English by design, which allows us to study zero - shot generalization to non - English languages. We explicitly ensure that no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters.

💻 Usage Examples

Basic Usage

import torch
from PIL import Image

from transformers import ColQwen2ForRetrieval, ColQwen2Processor
from transformers.utils.import_utils import is_flash_attn_2_available


model_name = "vidore/colqwen2-v1.0-hf"

model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

# Your inputs (replace dummy images with screenshots of your documents)
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images).embeddings
    query_embeddings = model(**batch_queries).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

🔧 Technical Details

Limitations

Focus: The model primarily focuses on PDF - type documents and high - resources languages, potentially limiting its generalization to other document types or less represented languages.
Support: The model relies on multi - vector retrieving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi - vector support.

📄 License

ColQwen2's vision language backbone model (Qwen2 - VL) is under apache - 2.0 license. ColQwen2 inherits from this apache - 2.0 license.

Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご