đ Model Card for Model ID
ReT is a novel approach for multimodal document retrieval, supporting both multimodal queries and documents. Different from existing methods that only use features from the final layer of vision - and - language backbones, ReT uses a Transformer - based recurrent cell to leverage multi - level representations from different layers of both visual and textual backbones. The model features sigmoidal gates inspired by LSTM design, which selectively control information flow between layers and modalities. ReT processes multimodal queries and documents independently, generating sets of latent tokens for fine - grained late interaction similarity computation. It is designed to handle images and text in both queries and documents. It has been trained and evaluated on a custom version of the challenging M2KR benchmark, with the following modifications: MSMARCO has been excluded as it has no images, and the documents from OVEN, InfoSeek, E - VQA, and OKVQA have been enriched with images.
⨠Features
- Novel approach for multimodal document retrieval.
- Leverage multi - level representations from different layers of visual and textual backbones.
- Use sigmoidal gates to control information flow.
- Process multimodal queries and documents independently for fine - grained similarity computation.
đĻ Installation
Follow the instructions on the repository to install the required environment.
đģ Usage Examples
Basic Usage
from src.models import RetrieverModel, RetModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-G-14', device_map=device)
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'
ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()
p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''
ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)
đ Documentation
Model Sources
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
BibTeX:
@inproceedings{caffagni2025recurrence,
title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}