ReT-OpenCLIP-ViT-G-14 Open Source Model - Supports Multimodal Query and Fine-grained Document Retrieval

Ret OpenCLIP ViT G 14

Developed by aimagelab

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from visual and textual backbone networks.

Multimodal Fusion

Transformers

Open Source License:Apache-2.0 #Multimodal Document Retrieval #Recurrent Gated Transformer #Cross-level Feature Fusion

Downloads 77

Release Time : 3/25/2025

Model Overview

ReT employs Transformer-based recurrent units and Sigmoid gating mechanisms, supporting mixed image and text inputs for visual document retrieval tasks.

Model Features

Multi-level Feature Integration

Unlike traditional methods that only use the final layer features, ReT integrates multi-level representations from visual and textual backbone networks.

Sigmoid Gating Mechanism

A gating mechanism inspired by LSTM, selectively regulating information flow across levels and modalities.

Hybrid Modality Processing

Capable of independently processing image, text, or mixed-modality queries and document inputs.

Model Capabilities

Multimodal Document Retrieval

Image-Text Joint Feature Extraction

Fine-grained Similarity Calculation

Use Cases

Information Retrieval

Visual Question Answering Document Retrieval

Retrieve relevant documents containing answers based on question text and reference images.

Validated effectiveness on a customized M2KR benchmark.

Cross-modal Retrieval

Use text queries to retrieve relevant image documents or use image queries to retrieve relevant text documents.

🚀 Model Card for Model ID

ReT is a novel approach for multimodal document retrieval, supporting both multimodal queries and documents. Different from existing methods that only use features from the final layer of vision - and - language backbones, ReT uses a Transformer - based recurrent cell to leverage multi - level representations from different layers of both visual and textual backbones. The model features sigmoidal gates inspired by LSTM design, which selectively control information flow between layers and modalities. ReT processes multimodal queries and documents independently, generating sets of latent tokens for fine - grained late interaction similarity computation. It is designed to handle images and text in both queries and documents. It has been trained and evaluated on a custom version of the challenging M2KR benchmark, with the following modifications: MSMARCO has been excluded as it has no images, and the documents from OVEN, InfoSeek, E - VQA, and OKVQA have been enriched with images.

✨ Features

Novel approach for multimodal document retrieval.
Leverage multi - level representations from different layers of visual and textual backbones.
Use sigmoidal gates to control information flow.
Process multimodal queries and documents independently for fine - grained similarity computation.

📦 Installation

Follow the instructions on the repository to install the required environment.

💻 Usage Examples

Basic Usage

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-G-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

📚 Documentation

Model Sources

Repository: https://github.com/aimagelab/ReT
Paper: Recurrence - Enhanced Vision - and - Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

BibTeX:

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご