🚀 bert-base-1024-biencoder-6M-pairs
A long context biencoder based on MosaicML's BERT pretrained on 1024 sequence length. This model maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
🚀 Quick Start
Datasets
- sentence-transformers/embedding-training-data
- flax-sentence-embeddings/stackexchange_xml
- snli
- eli5
- search_qa
- multi_nli
- wikihow
- natural_questions
- trivia_qa
- ms_marco
- gooaq
- yahoo_answers_topics
Language
Inference
Pipeline Tag
Task Categories
- sentence-similarity
- feature-extraction
- text-retrieval
Tags
- information retrieval
- ir
- documents retrieval
- passage retrieval
- beir
- benchmark
- sts
- semantic search
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
📦 Installation
Download the model and related scripts
git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-6M-pairs
💻 Usage Examples
Basic Usage
import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel
class AutoModelForSentenceEmbedding(nn.Module):
def __init__(self, model, tokenizer, normalize=True):
super(AutoModelForSentenceEmbedding, self).__init__()
self.model = model.to("cuda")
self.normalize = normalize
self.tokenizer = tokenizer
def forward(self, **kwargs):
model_output = self.model(**kwargs)
embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
if self.normalize:
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentences = ["This is an example sentence", "Each sentence is converted"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)
print(embeddings)
print(embeddings.shape)
📚 Documentation
Training
This model has been trained on 6.4M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the training set here.
The training (along with hyperparameters), inference and data loading scripts can all be found in this Github repository.
Evaluations
We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are here.