R

Ret OpenCLIP ViT H 14

Developed by aimagelab
ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from vision and text backbone networks.
Downloads 23
Release Time : 3/25/2025

Model Overview

ReT employs Transformer-based recurrent units and sigmoid gating mechanisms to selectively regulate cross-level and cross-modal information flow, supporting the processing of multimodal queries and documents to generate latent token sets for similarity computation.

Model Features

Recurrence-Enhanced Architecture
Utilizes LSTM-inspired sigmoid gating mechanisms to integrate multi-level features from vision and text networks.
Multimodal Hybrid Processing
Supports arbitrary combinations of images and texts in queries and documents as input.
Fine-Grained Similarity Computation
Generates latent token sets to support fine-grained matching with late interaction.

Model Capabilities

Multimodal Document Retrieval
Image-Text Hybrid Query Processing
Cross-Modal Feature Fusion

Use Cases

Information Retrieval
Visual Question Answering Document Retrieval
Retrieve relevant image-text documents based on text queries containing visual questions.
Evaluated on the custom M2KR benchmark (including datasets like OVEN/InfoSeek).
Cross-Modal Search
Image-to-Text Retrieval
Use images as query conditions to retrieve relevant documents.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase