開源VisRAG-Ret模型 - 避免信息損耗，直接將文檔圖像嵌入處理

首頁

Visrag Ret

由openbmb開發

VisRAG是基於視覺語言模型(VLM)的檢索增強生成(RAG)系統，可直接將文檔作為圖像進行嵌入表徵，避免傳統文本解析導致的信息損耗。

文本生成圖像

Safetensors

英語開源協議:Apache-2.0 #多模態文檔檢索 #視覺增強生成 #PDF信息保留

下載量 1,294

發布時間 : 10/14/2024

模型概述

VisRAG是一種創新的多模態文檔檢索增強生成系統，通過視覺語言模型直接處理文檔圖像，保留原始文檔的完整信息，提高檢索和生成質量。

模型特點

視覺文檔檢索

直接將文檔作為圖像處理，避免傳統文本解析導致的信息丟失

多模態增強

結合視覺和語言信息，提供更全面的文檔理解能力

高效檢索

通過優化的嵌入表徵實現快速準確的文檔檢索

模型能力

文檔圖像嵌入

多模態檢索

檢索增強生成

跨模態理解

使用案例

文檔處理

學術論文檢索

根據查詢從大量學術論文PDF中檢索相關內容

保留原始文檔的格式和視覺信息，提高檢索準確性

企業文檔管理

從企業文檔庫中檢索相關信息

無需預先解析文檔，直接處理原始文件

知識問答

基於文檔的問答系統

從文檔中檢索相關信息用於生成答案

提供更準確的答案，保留原始文檔的視覺佈局信息

🚀 VisRAG：多模態文檔上基於視覺的檢索增強生成

VisRAG 是一種基於新型視覺語言模型（VLM）的檢索增強生成（RAG）管道。它直接將文檔作為圖像進行嵌入，避免了傳統文本解析過程中的信息損失，能最大程度保留和利用原始文檔中的數據信息。

🚀 快速開始

VisRAG 是一個創新的基於視覺語言模型（VLM）的 RAG 管道。在這個管道中，它直接將文檔作為圖像使用 VLM 進行嵌入，然後進行檢索以增強 VLM 的生成能力，而不是先解析文檔以獲取文本。與傳統的基於文本的 RAG 相比，VisRAG 最大程度地保留和利用了原始文檔中的數據信息，消除了解析過程中引入的信息損失。

✨ 主要特性

VisRAG-Ret

VisRAG-Ret 是一個基於 MiniCPM-V 2.0 構建的文檔嵌入模型。MiniCPM-V 2.0 是一個視覺語言模型，它集成了 SigLIP 作為視覺編碼器，以及 MiniCPM-2B 作為語言模型。

VisRAG-Gen

在論文中，我們使用 MiniCPM-V 2.0、MiniCPM-V 2.6 和 GPT-4o 作為生成器。實際上，你可以使用任何你喜歡的 VLM！

📦 安裝指南

torch==2.1.2
torchvision==0.16.2
transformers==4.40.2
sentencepiece==0.1.99
decord==0.6.0
Pillow==10.1.0

💻 使用示例

基礎用法

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
from PIL import Image
import requests
from io import BytesIO

def weighted_mean_pooling(hidden, attention_mask):
    attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)
    s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)
    d = attention_mask_.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(text_or_image_list):
    
    if (isinstance(text_or_image_list[0], str)):
        inputs = {
            "text": text_or_image_list,
            'image': [None] * len(text_or_image_list),
            'tokenizer': tokenizer
        }
    else:
        inputs = {
            "text": [''] * len(text_or_image_list),
            'image': text_or_image_list,
            'tokenizer': tokenizer
        }
    outputs = model(**inputs)
    attention_mask = outputs.attention_mask
    hidden = outputs.last_hidden_state

    reps = weighted_mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

model_name_or_path = "openbmb/VisRAG-Ret"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model.eval()

queries = ["What does a dog look like?"]
INSTRUCTION = "Represent this query for retrieving relevant documents: "
queries = [INSTRUCTION + query for query in queries]

print("Downloading images...")
passages = [
    Image.open(BytesIO(requests.get(
        'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/cat.jpeg'
    ).content)).convert('RGB'),
    Image.open(BytesIO(requests.get(
        'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/dog.jpg'
    ).content)).convert('RGB')
]
print("Images downloaded.")

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())

🔧 技術細節

VisRAG-Ret

我們為 VisRAG-Ret 準備的包含 362,110 個查詢 - 文檔（Q - D）對的訓練數據集，由公開可用的學術數據集的訓練集（34%）和一個合成數據集（66%）組成。合成數據集由網絡爬取的 PDF 文檔頁面組成，並通過 VLM 生成（GPT - 4o）的偽查詢進行增強。你可以在 Hugging Face 上的 VisRAG 集合中找到它，該集合在本頁面開頭有引用。

VisRAG-Gen

生成部分不使用任何微調；我們直接使用現成的大語言模型/視覺語言模型進行生成。

📄 許可證

本倉庫中的代碼遵循 Apache - 2.0 許可證發佈。
VisRAG - Ret 模型權重的使用必須嚴格遵循 MiniCPM 模型許可證.md。
VisRAG - Ret 的模型和權重完全免費用於學術研究。填寫 "問卷" 進行註冊後，VisRAG - Ret 權重也可免費用於商業用途。

📑 引用

@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
      title={VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents}, 
      author={Shi Yu and Chaoyue Tang and Bokai Xu and Junbo Cui and Junhao Ran and Yukun Yan and Zhenghao Liu and Shuo Wang and Xu Han and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2410.10594},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2410.10594}, 
}

📧 聯繫信息

喻石：yus21@mails.tsinghua.edu.cn
唐超越：tcy006@gmail.com

🎉 最新消息

2024 年 11 月 4 日：我們在 Hugging Face Space 上發佈了 VisRAG 管道。
2024 年 10 月 31 日：我們在 Colab 上發佈了 VisRAG 管道。
2024 年 10 月 15 日：我們在 Hugging Face 上發佈了訓練數據和測試數據，你可以在 Hugging Face 上的 VisRAG 集合中找到它們。該集合在本頁面開頭有引用。
2024 年 10 月 14 日：我們在 arXiv 上發佈了論文。在 Hugging Face 上發佈了模型。在 GitHub 上發佈了代碼。