ReT-CLIP-ViT-L-14 Open Source Model - Supports Multimodal Queries and Enables Fine-Grained Document Retrieval

Ret CLIP ViT L 14

Developed by aimagelab

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by fusing multi-level representations from vision and text backbone networks.

Multimodal Fusion

Transformers

Open Source License:Apache-2.0 #Multimodal Document Retrieval #Recurrence-Enhanced Transformer #Cross-Level Feature Fusion

Downloads 523

Release Time : 3/25/2025

Model Overview

ReT employs Transformer-based recurrent units and sigmoid gating mechanisms to selectively regulate cross-level and cross-modal information flow. It can independently process multimodal queries and documents to generate latent token sets for similarity computation.

Model Features

Multi-Level Feature Fusion

Utilizes multi-level representations from vision and text backbone networks, not just final-layer features

Recurrent Gating Mechanism

LSTM-inspired sigmoid gating mechanism dynamically regulates cross-modal information flow

Independent Multimodal Processing

Can simultaneously process image and text content in queries and documents

Fine-Grained Similarity Computation

Generates latent token sets to support fine-grained late-interaction similarity matching

Model Capabilities

Multimodal Document Retrieval

Image-Text Joint Representation

Cross-Modal Similarity Computation

Vision-Language Feature Fusion

Use Cases

Information Retrieval

Cross-Modal Knowledge Retrieval

Retrieve documents containing relevant answers through image-text hybrid queries

Effectiveness validated on customized M2KR benchmark

Question Answering Systems

Visual Question Answering Support

Provides document retrieval containing Q&A pairs and corresponding images for VQA systems

Supports visual QA scenarios like OKVQA/E-VQA

🚀 Model Card for ReT - Multimodal Document Retrieval

ReT is a cutting - edge solution for multimodal document retrieval, supporting both multimodal queries and documents. Unlike traditional methods relying solely on final - layer features of vision - and - language backbones, ReT uses a Transformer - based recurrent cell. It leverages multi - level representations from different layers of visual and textual backbones. The model features sigmoidal gates, inspired by LSTM design, to selectively control information flow between layers and modalities. ReT processes multimodal queries and documents independently, generating latent tokens for fine - grained late interaction similarity computation. It is designed to handle images and text in both queries and documents, and has been trained and evaluated on a customized version of the challenging M2KR benchmark.

🚀 Quick Start

Prerequisites

Follow the instructions on the repository to install the required environment.

Usage Example

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

✨ Features

Multimodal Support: ReT can handle both images and text in queries and documents.
Multi - Level Representation: It leverages multi - level features from different layers of visual and textual backbones.
Selective Information Flow: Sigmoidal gates inspired by LSTM control information flow between layers and modalities.

📦 Installation

Refer to the repository for detailed installation instructions.

📚 Documentation

Model Sources

Repository: https://github.com/aimagelab/ReT
Paper: Recurrence - Enhanced Vision - and - Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Model Details

Property	Details
Model Type	ReT for multimodal document retrieval
Base Model	openai/clip - vit - large - patch14
Training Datasets	aimagelab/ReT - M2KR
License	apache - 2.0
Pipeline Tag	visual - document - retrieval

📄 License

This project is licensed under the Apache 2.0 license.

📄 Citation

BibTeX:

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご