R

Ret CLIP ViT L 14

Developed by aimagelab
ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by fusing multi-level representations from vision and text backbone networks.
Downloads 523
Release Time : 3/25/2025

Model Overview

ReT employs Transformer-based recurrent units and sigmoid gating mechanisms to selectively regulate cross-level and cross-modal information flow. It can independently process multimodal queries and documents to generate latent token sets for similarity computation.

Model Features

Multi-Level Feature Fusion
Utilizes multi-level representations from vision and text backbone networks, not just final-layer features
Recurrent Gating Mechanism
LSTM-inspired sigmoid gating mechanism dynamically regulates cross-modal information flow
Independent Multimodal Processing
Can simultaneously process image and text content in queries and documents
Fine-Grained Similarity Computation
Generates latent token sets to support fine-grained late-interaction similarity matching

Model Capabilities

Multimodal Document Retrieval
Image-Text Joint Representation
Cross-Modal Similarity Computation
Vision-Language Feature Fusion

Use Cases

Information Retrieval
Cross-Modal Knowledge Retrieval
Retrieve documents containing relevant answers through image-text hybrid queries
Effectiveness validated on customized M2KR benchmark
Question Answering Systems
Visual Question Answering Support
Provides document retrieval containing Q&A pairs and corresponding images for VQA systems
Supports visual QA scenarios like OKVQA/E-VQA
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase