đ Model Card for Model ID
ReT is a novel solution for multimodal document retrieval, supporting both multimodal queries and documents. Different from existing methods that only utilize features from the final layer of vision - and - language backbones, ReT uses a Transformer - based recurrent cell to leverage multi - level representations from different layers of both visual and textual backbones. The model features sigmoidal gates inspired by LSTM design, which selectively control the information flow between layers and modalities. ReT processes multimodal queries and documents independently, generating sets of latent tokens for fine - grained late interaction similarity computation. It is designed to handle images and text in both queries and documents. To achieve this, it has been trained and evaluated on a custom version of the challenging M2KR benchmark, with the following modifications: MSMARCO has been excluded as it lacks images, and the documents from OVEN, InfoSeek, E - VQA, and OKVQA have been enhanced with the addition of images.
⨠Features
- Novel approach for multimodal document retrieval supporting multimodal queries and documents.
- Utilizes a Transformer - based recurrent cell to leverage multi - level representations.
- Sigmoidal gates inspired by LSTM design to control information flow.
- Independent processing of multimodal queries and documents for fine - grained similarity computation.
đĻ Installation
Follow the instructions on the repository to install the required environment.
đģ Usage Examples
Basic Usage
from src.models import RetrieverModel, RetModel
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-H-14', device_map=device)
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'
ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()
p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''
ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)
đ Documentation
Model Sources
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
BibTeX:
@inproceedings{caffagni2025recurrence,
title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
đ Model Information
Property |
Details |
Library Name |
transformers |
Base Model |
laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K |
Datasets |
aimagelab/ReT - M2KR |
Pipeline Tag |
visual - document - retrieval |