colqwen2-7b-v1.0 Open-Source Visual Retrieval Model - Supports Multi-Vector Text and Image Representations

Colqwen2 7b V1.0

Developed by yydxlv

A visual retrieval model based on Qwen2-VL-7B-Instruct and ColBERT strategy, supporting multi-vector text and image representation

Text-to-Image

Safetensors

English#Multimodal Document Retrieval #ColBERT Vector Representation #Dynamic Image Resolution

Downloads 25

Release Time : 12/24/2024

Model Overview

ColQwen is a novel vision-language model architecture that efficiently indexes documents through visual features, particularly suitable for PDF document retrieval

Model Features

Dynamic Image Resolution Support

Supports dynamic input image resolution without adjusting aspect ratio, capable of generating up to 768 image patches

Multi-Vector Representation

Adopts ColBERT-style multi-vector text and image representation to enhance retrieval efficiency

Efficient Training Strategy

Utilizes LoRA adapter training to optimize computational resource usage

Model Capabilities

Visual Document Retrieval

Multimodal Embedding

Image Feature Extraction

Text Feature Extraction

Use Cases

Document Retrieval

PDF Document Retrieval

Content retrieval for PDF documents based on visual features

Improves document retrieval efficiency

🚀 IEIT-Systems ColQwen2-7B: Visual Retriever based on Qwen2-VL-7B-Instruct with ColBERT strategy

This project presents a visual retriever, IEIT-Systems ColQwen2-7B, which is based on the Qwen2-VL-7B-Instruct model and uses the ColBERT strategy. It efficiently indexes documents from their visual features, offering a novel approach in the field of visual document retrieval.

🚀 Quick Start

To get started with the model, ensure that colpali-engine is installed from source or with a version superior to 0.3.4, and transformers version must be > 4.46.1. You can install the necessary packages using the following command:

pip install git+https://github.com/illuin-tech/colpali

Here is a basic usage example:

import torch
from PIL import Image

from colpali_engine.models import ColQwen2, ColQwen2Processor

model = ColQwen2.from_pretrained(
        "yydxlv/colqwen2-7b-v1.0",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",  # or "mps" if on Apple Silicon
    ).eval()
processor = ColQwen2Processor.from_pretrained("yydxlv/colqwen2-7b-v1.0")

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

✨ Features

Novel Architecture and Strategy: ColQwen is based on a novel model architecture and training strategy using Vision Language Models (VLMs) to efficiently index documents from visual features.
Multi - vector Representations: It is an extension of Qwen2-VL-7B that generates ColBERT-style multi-vector representations of text and images.
Dynamic Image Resolution: This model takes dynamic image resolutions in input without resizing them, maintaining their aspect ratio. The maximal resolution is set to create at most 768 image patches.

📦 Installation

To use this model, you need to install the colpali-engine package. You can install it from source using the following command:

pip install git+https://github.com/illuin-tech/colpali

Make sure transformers version is > 4.46.1.

📚 Documentation

Version Specificity

This model takes dynamic image resolutions in input and does not resize them, unlike in ColPali where aspect ratio is changed. The maximal resolution is set so that 768 image patches are created at most. Experiments show that using larger amounts of image patches leads to clear improvements, but at the cost of increased memory requirements.

This version is trained with colpali-engine==0.3.4. The data used is the same as the ColPali data described in the paper, and the fine - tune has also been carried out with the ShareGPT4V (https://sharegpt4v.github.io/) dataset.

Model Training

Parameters

We train models using low - rank adapters (LoRA) with alpha = 32 and r = 32 on the transformer layers from the language model, as well as the final randomly initialized projection layer. We use a paged_adamw_8bit optimizer.

The training is carried out on an 8xA100 GPU setup with distributed data parallelism (via accelerate). The learning rate is set to 5e - 4 with linear decay and 1% warmup steps. The batch size per device is 32, and the data is in bfloat16 format.

🔧 Technical Details

The model is an extension of Qwen2-VL-7B and generates ColBERT-style multi - vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository.

📄 License

ColQwen2's vision language backbone model (Qwen2-VL) is under apache2.0 license. This fine - tuned adapter is under CC BY NC 4.0 license. Therefore, the use of the model is research only at the moment.

📚 Citation

If you use this models from this organization in your research, please cite the original paper as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

Developed by: IEIT systems

⚠️ Important Note

Focus: The model primarily focuses on PDF - type documents and high - resources languages, potentially limiting its generalization to other document types or less represented languages.
Support: The model relies on multi - vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi - vector support.

Property	Details
Model Type	Visual Retriever based on Qwen2-VL-7B-Instruct with ColBERT strategy
Training Data	vidore/colpali_train_set, ShareGPT4V (https://sharegpt4v.github.io/)
Base Model	Qwen/Qwen2-VL-7B-Instruct
Library Name	peft
Pipeline Tag	visual-document-retrieval
License	CC BY NC 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご