Colpali-v1.3-hf Open-Source Vision-Language Model - Efficiently Utilize Visual Features to Index Documents

Colpali V1.3 Hf

Developed by vidore

ColPali is a vision-language model extended from PaliGemma-3B, capable of efficiently indexing documents through visual features and generating ColBERT-style multi-vector representations.

Text-to-Image

Transformers

English#Multimodal Document Retrieval #ColBERT Vector Representation #PDF Visual Indexing

Downloads 790

Release Time : 11/28/2024

Model Overview

This model indexes documents via visual features, combining PaliGemma-3B's vision-language capabilities with ColBERT's multi-vector representation strategy to achieve efficient document retrieval.

Model Features

Multi-vector Representation

Adopts the ColBERT strategy to generate multi-vector representations for text and images, improving retrieval accuracy.

Vision-Language Fusion

Combines PaliGemma-3B's vision-language capabilities to achieve cross-modal understanding.

Efficient Retrieval

Optimizes retrieval efficiency by indexing documents through visual features.

Model Capabilities

Visual Document Retrieval

Cross-modal Understanding

Multi-vector Representation Generation

Use Cases

Document Retrieval

PDF Document Retrieval

Quickly retrieve relevant content from PDF documents via visual features.

Cross-modal Search

Image-Text Association Search

Retrieve related image content based on text queries or vice versa.

🚀 ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy

ColPali is a model leveraging a novel architecture and training strategy based on Vision Language Models (VLMs). It can efficiently index documents using their visual features. It extends PaliGemma-3B to generate ColBERT-style multi-vector representations of text and images. This model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository. The HuggingFace transformers 🤗 implementation was contributed by Tony Wu (@tonywu71) and Yoni Gozlan (@yonigozlan).

🚀 Quick Start

⚠️ Important Note

This version of ColPali should be loaded with the transformers 🤗 release, not with colpali-engine. It was converted using the convert_colpali_weights_to_hf.py script from the vidore/colpali-v1.3-merged checkpoint.

✨ Features

Based on Vision Language Models (VLMs), it can efficiently index documents from visual features.
It is an extension of PaliGemma-3B, generating ColBERT-style multi-vector representations of text and images.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from PIL import Image

from transformers import ColPaliForRetrieval, ColPaliProcessor

model_name = "vidore/colpali-v1.3-hf"

model = ColPaliForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings.embeddings, image_embeddings.embeddings)

📚 Documentation

Model Description

Read the transformers 🤗 model card: https://huggingface.co/docs/transformers/en/model_doc/colpali.

Model Training

Dataset

Our training dataset consists of 127,460 query - page pairs. It includes train sets from openly available academic datasets (63%) and a synthetic dataset composed of pages from web - crawled PDF documents, augmented with VLM - generated (Claude - 3 Sonnet) pseudo - questions (37%). The training set is fully English, allowing us to study zero - shot generalization to non - English languages. We ensure no multi - page PDF document is used in both ViDoRe and the train set to prevent evaluation contamination. A validation set is created with 2% of the samples for hyperparameter tuning.

Note: Multilingual data is present in the pretraining corpus of the language model (Gemma - 2B) and may occur during PaliGemma - 3B's multimodal training.

Parameters

All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in bfloat16 format, use low - rank adapters (LoRA) with alpha = 32 and r = 32 on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a paged_adamw_8bit optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e - 5 with linear decay and 2.5% warmup steps, and a batch size of 32.

🔧 Technical Details

The model is based on Vision Language Models (VLMs) and uses a novel architecture and training strategy. It extends PaliGemma-3B and generates ColBERT-style multi-vector representations of text and images.

📄 License

ColPali's vision language backbone model (PaliGemma) is under gemma license as specified in its model card. ColPali inherits from this gemma license.

Resources

The ColPali arXiv paper can be found here. 📄
The official blog post detailing ColPali can be found here. 📝
The original model implementation code for the ColPali model and for the colpali-engine package can be found here. 🌎
Cookbooks for learning to use the transformers - native version of ColPali, fine - tuning, and similarity maps generation can be found here. 📚

Limitations

Focus: The model primarily focuses on PDF - type documents and high - resources languages, potentially limiting its generalization to other document types or less represented languages.
Support: The model relies on multi - vector retrieving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi - vector support.

Contact

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご