ReT-CLIP-ViT-L-14開源模型 - 支持多模態查詢，實現細粒度文檔檢索

首頁

Ret CLIP ViT L 14

由aimagelab開發

ReT是一種支持多模態查詢與文檔檢索的創新方法，通過融合視覺與文本骨幹網絡多層級表徵實現細粒度檢索。

多模態融合

Transformers

開源協議:Apache-2.0 #多模態文檔檢索 #循環增強Transformer #跨層級特徵融合

下載量 523

發布時間 : 3/25/2025

模型概述

ReT採用基於Transformer的循環單元和sigmoid門控機制，選擇性調控跨層級與跨模態信息流，可獨立處理多模態查詢與文檔，生成用於相似度計算的潛在標記集。

模型特點

多層級特徵融合

利用視覺與文本骨幹網絡的多層級表徵，而非僅最終層特徵

循環門控機制

受LSTM啟發的sigmoid門控機制，動態調控跨模態信息流

多模態獨立處理

可同時處理查詢與文檔中的圖像和文本內容

細粒度相似度計算

生成潛在標記集支持細粒度的延遲交互式相似度匹配

模型能力

多模態文檔檢索

圖像-文本聯合表徵

跨模態相似度計算

視覺-語言特徵融合

使用案例

信息檢索

跨模態知識檢索

通過圖文混合查詢檢索包含相關答案的文檔

在定製版M2KR基準上驗證有效性

問答系統

視覺問答支持

為VQA系統提供包含問題答案及對應圖像的文檔檢索

支持OKVQA/E-VQA等視覺問答場景

🚀 視覺文檔檢索模型ReT

ReT是一種用於多模態文檔檢索的創新方法，支持多模態查詢和文檔。它突破了現有方法僅利用視覺和語言主幹網絡最後一層特徵的侷限，採用基於Transformer的循環單元，充分利用視覺和文本主幹網絡不同層的多級表示。該模型受LSTM設計啟發，配備了Sigmoid門，可選擇性地控制層與模態之間的信息流。ReT獨立處理多模態查詢和文檔，生成用於細粒度後期交互相似度計算的潛在令牌集，能夠同時處理查詢和文檔中的圖像與文本。

🚀 快速開始

安裝環境

按照倉庫中的說明安裝所需環境。

使用示例

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14', device_map=device)

# 查詢
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# 文檔
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

✨ 主要特性

多模態支持：支持多模態查詢和文檔，能夠同時處理圖像和文本。
多級特徵利用：採用Transformer-based循環單元，利用視覺和文本主幹網絡不同層的多級表示。
信息流動控制：受LSTM設計啟發的Sigmoid門，可選擇性地控制層與模態之間的信息流。
細粒度交互：獨立處理多模態查詢和文檔，生成潛在令牌集用於細粒度後期交互相似度計算。

📚 詳細文檔

模型來源

倉庫：https://github.com/aimagelab/ReT
論文：Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

訓練與評估

該模型在具有挑戰性的M2KR基準測試的自定義版本上進行了訓練和評估，具體修改如下：排除了不包含圖像的MSMARCO，併為OVEN、InfoSeek、E-VQA和OKVQA的文檔添加了圖像。

📄 許可證

本項目採用Apache-2.0許可證。

📝 引用

如果您在研究中使用了該模型，請使用以下BibTeX引用：

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}