Colpali-v1.2-hf Open-Source Visual Retrieval Model - Efficiently Index Documents Based on Visual Features

Colpali V1.2 Hf

Developed by vidore

ColPali is a visual retrieval model based on PaliGemma-3B and the ColBERT strategy, designed for efficient document indexing through visual features

Text-to-Image

Transformers

English#Document Visual Retrieval #Multi-vector Representation #PDF Document Processing

Downloads 5,075

Release Time : 11/28/2024

Model Overview

ColPali is an innovative vision-language model that extends PaliGemma-3B and adopts a ColBERT-style multi-vector representation strategy to efficiently generate joint representations of text and images for document retrieval tasks.

Model Features

Multi-vector Representation

Uses the ColBERT strategy to generate interactive representations between text tokens and image patches

Efficient Retrieval

Indexes documents through visual features for efficient document retrieval

Vision-Language Joint Modeling

Combines the strengths of visual encoder (SigLIP) and language model (PaliGemma-3B)

LoRA Fine-tuning

Uses Low-Rank Adaptation (LoRA) for efficient fine-tuning, reducing training costs

Model Capabilities

Visual Document Retrieval

Multimodal Representation Learning

Cross-modal Matching

Document Content Understanding

Use Cases

Document Management

Enterprise Document Retrieval

Quickly locate relevant content in company internal documents based on queries

Academic Literature Search

Retrieve relevant information in academic papers through visual features

Knowledge Management

Knowledge Base Construction

Build searchable knowledge base systems for organizations

🚀 ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy

ColPali is a model leveraging a novel architecture and training strategy based on Vision Language Models (VLMs). It efficiently indexes documents using their visual features, offering a powerful solution for visual document retrieval.

⚠️ Important Note

This version of ColPali should be loaded with the transformers 🤗 release, not with colpali-engine. It was converted using the convert_colpali_weights_to_hf.py script from the vidore/colpali-v1.2-merged checkpoint.

✨ Features

Novel Architecture: Based on PaliGemma-3B, it generates ColBERT-style multi-vector representations of text and images.
Multilingual Potential: Although trained on English data, multilingual data is present in the pretraining corpus of the language model, allowing for potential zero-shot generalization to non-English languages.

📦 Installation

No specific installation steps were provided in the original README.

💻 Usage Examples

Basic Usage

import torch
from PIL import Image

from transformers import ColPaliForRetrieval, ColPaliProcessor

model_name = "vidore/colpali-v1.2-hf"

model = ColPaliForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings.embeddings, image_embeddings.embeddings)

📚 Documentation

Model Description

This model is built iteratively starting from an off-the-shelf SigLIP model. It was finetuned to create BiSigLIP and then fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali.

Model Training

Dataset

Our training dataset of 127,460 query-page pairs consists of train sets from openly available academic datasets (63%) and a synthetic dataset. The synthetic dataset is made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). The training set is fully English, enabling zero-shot generalization studies. A validation set with 2% of the samples is used for hyperparameter tuning.

Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and may occur during PaliGemma-3B's multimodal training.

Parameters

All models are trained for 1 epoch on the train set. We train models in bfloat16 format, use low-rank adapters (LoRA) with alpha=32 and r=32 on the transformer layers from the language model, as well as the final randomly initialized projection layer. We use a paged_adamw_8bit optimizer. Training is done on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay and 2.5% warmup steps, and a batch size of 32.

Resources

The ColPali arXiv paper can be found here. 📄
The official blog post detailing ColPali can be found here. 📝
The original model implementation code for the ColPali model and for the colpali-engine package can be found here. 🌎
Cookbooks for learning to use the transformers-native version of ColPali, fine-tuning, and similarity maps generation can be found here. 📚

Limitations

Focus: The model mainly focuses on PDF-type documents and high-resource languages, which may limit its generalization to other document types or less represented languages.
Support: The model relies on multi-vector retrieval from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks without native multi-vector support.

🔧 Technical Details

The model is based on a novel architecture and training strategy. It uses the ColBERT strategy to compute interactions between text tokens and image patches, which significantly improves performance compared to BiPali. The training dataset is carefully designed to prevent evaluation contamination and enable zero-shot generalization studies.

📄 License

ColPali's vision language backbone model (PaliGemma) is under gemma license as specified in its model card. ColPali inherits from this gemma license.

Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

Property	Details
Library Name	transformers
Tags	colpali
License	gemma
Datasets	vidore/colpali_train_set
Language	en
Base Model	vidore/colpaligemma-3b-pt-448-base
New Version	vidore/colpali-v1.3-hf
Pipeline Tag	visual-document-retrieval

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご