ReT-OpenCLIP-ViT-H-14 Open Source Model - Supports Multimodal Queries and Fine-Grained Document Retrieval

Ret OpenCLIP ViT H 14

Developed by aimagelab

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from vision and text backbone networks.

Multimodal Fusion

Transformers

Open Source License:Apache-2.0 #Multimodal Document Retrieval #Recurrence-Enhanced Transformer #Cross-Level Feature Fusion

Downloads 23

Release Time : 3/25/2025

Model Overview

ReT employs Transformer-based recurrent units and sigmoid gating mechanisms to selectively regulate cross-level and cross-modal information flow, supporting the processing of multimodal queries and documents to generate latent token sets for similarity computation.

Model Features

Recurrence-Enhanced Architecture

Utilizes LSTM-inspired sigmoid gating mechanisms to integrate multi-level features from vision and text networks.

Multimodal Hybrid Processing

Supports arbitrary combinations of images and texts in queries and documents as input.

Fine-Grained Similarity Computation

Generates latent token sets to support fine-grained matching with late interaction.

Model Capabilities

Multimodal Document Retrieval

Image-Text Hybrid Query Processing

Cross-Modal Feature Fusion

Use Cases

Information Retrieval

Visual Question Answering Document Retrieval

Retrieve relevant image-text documents based on text queries containing visual questions.

Evaluated on the custom M2KR benchmark (including datasets like OVEN/InfoSeek).

Cross-Modal Search

Image-to-Text Retrieval

Use images as query conditions to retrieve relevant documents.

🚀 Model Card for Model ID

ReT is a novel solution for multimodal document retrieval, supporting both multimodal queries and documents. Different from existing methods that only utilize features from the final layer of vision - and - language backbones, ReT uses a Transformer - based recurrent cell to leverage multi - level representations from different layers of both visual and textual backbones. The model features sigmoidal gates inspired by LSTM design, which selectively control the information flow between layers and modalities. ReT processes multimodal queries and documents independently, generating sets of latent tokens for fine - grained late interaction similarity computation. It is designed to handle images and text in both queries and documents. To achieve this, it has been trained and evaluated on a custom version of the challenging M2KR benchmark, with the following modifications: MSMARCO has been excluded as it lacks images, and the documents from OVEN, InfoSeek, E - VQA, and OKVQA have been enhanced with the addition of images.

✨ Features

Novel approach for multimodal document retrieval supporting multimodal queries and documents.
Utilizes a Transformer - based recurrent cell to leverage multi - level representations.
Sigmoidal gates inspired by LSTM design to control information flow.
Independent processing of multimodal queries and documents for fine - grained similarity computation.

📦 Installation

Follow the instructions on the repository to install the required environment.

💻 Usage Examples

Basic Usage

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-H-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

📚 Documentation

Model Sources

Repository: https://github.com/aimagelab/ReT
Paper: Recurrence - Enhanced Vision - and - Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

BibTeX:

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

📋 Model Information

Property	Details
Library Name	transformers
Base Model	laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K
Datasets	aimagelab/ReT - M2KR
Pipeline Tag	visual - document - retrieval

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご