🚀 bert-base-1024-biencoder-64M-pairs
這是一個基於 MosaicML在1024序列長度上預訓練的BERT 的長上下文雙編碼器模型。該模型可將句子和段落映射到768維的密集向量空間,適用於聚類或語義搜索等任務。
🚀 快速開始
📦 安裝指南
下載模型和相關腳本
git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs
💻 使用示例
基礎用法
import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel
class AutoModelForSentenceEmbedding(nn.Module):
def __init__(self, model, tokenizer, normalize=True):
super(AutoModelForSentenceEmbedding, self).__init__()
self.model = model.to("cuda")
self.normalize = normalize
self.tokenizer = tokenizer
def forward(self, **kwargs):
model_output = self.model(**kwargs)
embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
if self.normalize:
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentences = ["This is an example sentence", "Each sentence is converted"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)
print(embeddings)
print(embeddings.shape)
📚 詳細文檔
訓練
該模型在6400萬個隨機採樣的句子/段落對上進行了訓練,這些數據來自Sentence Transformers模型使用的同一訓練集。訓練集的詳細信息可查看 此處。
訓練(包括超參數)、推理和數據加載腳本均可在 這個GitHub倉庫 中找到。
評估
我們在一些基於檢索的基準測試(CQADupstackEnglishRetrieval、DBPedia、MSMARCO、QuoraRetrieval)上運行了該模型,結果可查看 此處。
🔧 技術細節
屬性 |
詳情 |
數據集 |
sentence-transformers/embedding-training-data、flax-sentence-embeddings/stackexchange_xml、snli、eli5、search_qa、multi_nli、wikihow、natural_questions、trivia_qa、ms_marco、gooaq、yahoo_answers_topics |
語言 |
en |
推理 |
false |
任務類型 |
sentence-similarity、feature-extraction、text-retrieval |
標籤 |
information retrieval、ir、documents retrieval、passage retrieval、beir、benchmark、sts、semantic search、sentence-transformers、feature-extraction、sentence-similarity、transformers |