Ret CLIP ViT L 14
R
Ret CLIP ViT L 14
Developed by aimagelab
ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by fusing multi-level representations from vision and text backbone networks.
Downloads 523
Release Time : 3/25/2025
Model Overview
ReT employs Transformer-based recurrent units and sigmoid gating mechanisms to selectively regulate cross-level and cross-modal information flow. It can independently process multimodal queries and documents to generate latent token sets for similarity computation.
Model Features
Multi-Level Feature Fusion
Utilizes multi-level representations from vision and text backbone networks, not just final-layer features
Recurrent Gating Mechanism
LSTM-inspired sigmoid gating mechanism dynamically regulates cross-modal information flow
Independent Multimodal Processing
Can simultaneously process image and text content in queries and documents
Fine-Grained Similarity Computation
Generates latent token sets to support fine-grained late-interaction similarity matching
Model Capabilities
Multimodal Document Retrieval
Image-Text Joint Representation
Cross-Modal Similarity Computation
Vision-Language Feature Fusion
Use Cases
Information Retrieval
Cross-Modal Knowledge Retrieval
Retrieve documents containing relevant answers through image-text hybrid queries
Effectiveness validated on customized M2KR benchmark
Question Answering Systems
Visual Question Answering Support
Provides document retrieval containing Q&A pairs and corresponding images for VQA systems
Supports visual QA scenarios like OKVQA/E-VQA
Featured Recommended AI Models
Š 2025AIbase