bert-base-1024-biencoder-64M-pairs開源模型 - 免費實現句子與段落嵌入功能

首頁

Bert Base 1024 Biencoder 64M Pairs

由shreyansh26開發

基於MosaicML預訓練的1024序列長度BERT的長上下文雙編碼器，用於句子和段落嵌入

文本嵌入

Transformers

支持多種語言#長文本編碼 #語義搜索 #密集向量檢索

下載量 19

發布時間 : 8/22/2023

模型概述

該模型將句子和段落映射到768維密集向量空間，可用於聚類或語義搜索等任務。

模型特點

長上下文支持

支持1024序列長度，適合處理長文檔和段落

大規模訓練

在64M隨機採樣的句子/段落對上進行了訓練

高效檢索

專為語義搜索和信息檢索任務優化

模型能力

句子嵌入

段落嵌入

語義相似度計算

信息檢索

文檔聚類

使用案例

信息檢索

語義搜索

構建搜索引擎的語義檢索功能

在多個檢索基準測試中表現良好

問答系統

用於檢索與問題最相關的文檔段落

文本分析

文檔聚類

將相似內容的文檔分組

🚀 bert-base-1024-biencoder-64M-pairs

這是一個基於 MosaicML在1024序列長度上預訓練的BERT 的長上下文雙編碼器模型。該模型可將句子和段落映射到768維的密集向量空間，適用於聚類或語義搜索等任務。

🚀 快速開始

📦 安裝指南

下載模型和相關腳本

git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs

💻 使用示例

基礎用法

import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel

# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0

class AutoModelForSentenceEmbedding(nn.Module):
    def __init__(self, model, tokenizer, normalize=True):
        super(AutoModelForSentenceEmbedding, self).__init__()

        self.model = model.to("cuda")
        self.normalize = normalize
        self.tokenizer = tokenizer

    def forward(self, **kwargs):
        model_output = self.model(**kwargs)
        embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
        if self.normalize:
            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

        return embeddings

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentences = ["This is an example sentence", "Each sentence is converted"]

encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)

print(embeddings)
print(embeddings.shape)

📚 詳細文檔

訓練

該模型在6400萬個隨機採樣的句子/段落對上進行了訓練，這些數據來自Sentence Transformers模型使用的同一訓練集。訓練集的詳細信息可查看此處。

訓練（包括超參數）、推理和數據加載腳本均可在這個GitHub倉庫中找到。

評估

我們在一些基於檢索的基準測試（CQADupstackEnglishRetrieval、DBPedia、MSMARCO、QuoraRetrieval）上運行了該模型，結果可查看此處。

🔧 技術細節

屬性	詳情
數據集	sentence-transformers/embedding-training-data、flax-sentence-embeddings/stackexchange_xml、snli、eli5、search_qa、multi_nli、wikihow、natural_questions、trivia_qa、ms_marco、gooaq、yahoo_answers_topics
語言	en
推理	false
任務類型	sentence-similarity、feature-extraction、text-retrieval
標籤	information retrieval、ir、documents retrieval、passage retrieval、beir、benchmark、sts、semantic search、sentence-transformers、feature-extraction、sentence-similarity、transformers