🚀 bert-base-1024-biencoder-64M-pairs
这是一个基于 MosaicML在1024序列长度上预训练的BERT 的长上下文双编码器模型。该模型可将句子和段落映射到768维的密集向量空间,适用于聚类或语义搜索等任务。
🚀 快速开始
📦 安装指南
下载模型和相关脚本
git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs
💻 使用示例
基础用法
import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel
class AutoModelForSentenceEmbedding(nn.Module):
def __init__(self, model, tokenizer, normalize=True):
super(AutoModelForSentenceEmbedding, self).__init__()
self.model = model.to("cuda")
self.normalize = normalize
self.tokenizer = tokenizer
def forward(self, **kwargs):
model_output = self.model(**kwargs)
embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
if self.normalize:
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentences = ["This is an example sentence", "Each sentence is converted"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)
print(embeddings)
print(embeddings.shape)
📚 详细文档
训练
该模型在6400万个随机采样的句子/段落对上进行了训练,这些数据来自Sentence Transformers模型使用的同一训练集。训练集的详细信息可查看 此处。
训练(包括超参数)、推理和数据加载脚本均可在 这个GitHub仓库 中找到。
评估
我们在一些基于检索的基准测试(CQADupstackEnglishRetrieval、DBPedia、MSMARCO、QuoraRetrieval)上运行了该模型,结果可查看 此处。
🔧 技术细节
属性 |
详情 |
数据集 |
sentence-transformers/embedding-training-data、flax-sentence-embeddings/stackexchange_xml、snli、eli5、search_qa、multi_nli、wikihow、natural_questions、trivia_qa、ms_marco、gooaq、yahoo_answers_topics |
语言 |
en |
推理 |
false |
任务类型 |
sentence-similarity、feature-extraction、text-retrieval |
标签 |
information retrieval、ir、documents retrieval、passage retrieval、beir、benchmark、sts、semantic search、sentence-transformers、feature-extraction、sentence-similarity、transformers |