ReT-OpenCLIP-ViT-G-14開源模型 - 支持多模態查詢與文檔細粒度檢索

首頁

Ret OpenCLIP ViT G 14

由aimagelab開發

ReT是一種支持多模態查詢與文檔檢索的創新方法，通過整合視覺與文本主幹網絡不同層級的多元表徵實現細粒度檢索。

多模態融合

Transformers

開源協議:Apache-2.0 #多模態文檔檢索 #循環門控Transformer #跨層級特徵融合

下載量 77

發布時間 : 3/25/2025

模型概述

ReT採用基於Transformer的循環單元和Sigmoid門控機制，支持圖像與文本混合輸入，用於視覺文檔檢索任務。

模型特點

多層級特徵整合

不同於傳統方法僅使用最後一層特徵，ReT整合視覺與文本主幹網絡不同層級的多元表徵

Sigmoid門控機制

受LSTM啟發的門控機制，選擇性調控跨層級與跨模態的信息流

混合模態處理

可獨立處理圖像、文本或混合模態的查詢和文檔輸入

模型能力

多模態文檔檢索

圖像-文本聯合特徵提取

細粒度相似度計算

使用案例

信息檢索

視覺問答文檔檢索

根據問題文本和參考圖像檢索包含答案的相關文檔

在定製版M2KR基準測試中驗證效果

跨模態檢索

使用文本查詢檢索相關圖像文檔，或使用圖像查詢檢索相關文本文檔

🚀 ReT - 多模態文檔檢索模型

ReT是一種用於多模態文檔檢索的創新方法，支持多模態查詢和文檔。與僅使用視覺 - 語言主幹網絡最後一層特徵的現有方法不同，ReT採用基於Transformer的循環單元，利用視覺和文本主幹網絡不同層的多級表示。該模型具有受LSTM設計啟發的S形門，可選擇性地控制層與模態之間的信息流。ReT獨立處理多模態查詢和文檔，生成用於細粒度後期交互相似度計算的潛在令牌集。ReT旨在處理查詢和文檔中的圖像和文本。為此，它在具有挑戰性的M2KR基準測試的自定義版本上進行了訓練和評估，並做了以下修改：排除了不包含圖像的MSMARCO，併為來自OVEN、InfoSeek、E - VQA和OKVQA的文檔添加了圖像。

🚀 快速開始

ReT是一種用於多模態文檔檢索的新方法，支持多模態查詢和文檔。它利用Transformer架構，從視覺和文本主幹網絡的不同層提取多級特徵。

✨ 主要特性

多模態支持：支持多模態查詢和文檔，能夠處理圖像和文本。
多級特徵利用：採用Transformer循環單元，利用視覺和文本主幹網絡不同層的多級表示。
S形門設計：受LSTM啟發的S形門，可選擇性地控制層與模態之間的信息流。
細粒度交互：獨立處理多模態查詢和文檔，生成潛在令牌集用於細粒度後期交互相似度計算。

📦 安裝指南

請按照倉庫中的說明安裝所需環境。

💻 使用示例

基礎用法

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-OpenCLIP-ViT-G-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

📚 詳細文檔

模型來源

倉庫地址：https://github.com/aimagelab/ReT
論文：Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

數據集

訓練和評估使用了自定義版本的M2KR基準測試，排除了不包含圖像的MSMARCO，併為來自OVEN、InfoSeek、E - VQA和OKVQA的文檔添加了圖像。

模型信息

屬性	詳情
庫名稱	transformers
模型類型	視覺文檔檢索
基礎模型	laion/CLIP - ViT - bigG - 14 - laion2B - 39B - b160k
訓練數據	aimagelab/ReT - M2KR
許可證	apache - 2.0

📄 許可證

本模型使用Apache 2.0許可證。

📚 引用

如果您使用了該模型，請引用以下論文：

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}