bert-base-1024-biencoder-64M-pairs Open Source Model - Free Sentence and Paragraph Embedding Functionality

Bert Base 1024 Biencoder 64M Pairs

Developed by shreyansh26

A long-context bi-encoder based on MosaicML's pre-trained BERT with 1024 sequence length, for sentence and paragraph embeddings

Text Embedding

Transformers

Supports Multiple Languages#Long text encoding #Semantic search #Dense vector retrieval

Downloads 19

Release Time : 8/22/2023

Model Overview

This model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search.

Model Features

Long-context support

Supports 1024 sequence length, suitable for processing long documents and paragraphs

Large-scale training

Trained on 64M randomly sampled sentence/paragraph pairs

Efficient retrieval

Optimized for semantic search and information retrieval tasks

Model Capabilities

Sentence embeddings

Paragraph embeddings

Semantic similarity computation

Information retrieval

Document clustering

Use Cases

Information retrieval

Semantic search

Building semantic retrieval functionality for search engines

Performs well on multiple retrieval benchmarks

Question answering systems

Used to retrieve the most relevant document passages for questions

Text analysis

Document clustering

Grouping documents with similar content

🚀 bert-base-1024-biencoder-64M-pairs

A long context biencoder based on MosaicML's BERT pretrained on 1024 sequence length. This model maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

🚀 Quick Start

Datasets

sentence-transformers/embedding-training-data
flax-sentence-embeddings/stackexchange_xml
snli
eli5
search_qa
multi_nli
wikihow
natural_questions
trivia_qa
ms_marco
gooaq
yahoo_answers_topics

Language

Inference

false

Pipeline Tag

sentence-similarity

Task Categories

sentence-similarity
feature-extraction
text-retrieval

📦 Installation

Download the model and related scripts

git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs

💻 Usage Examples

Basic Usage

import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel

# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0

class AutoModelForSentenceEmbedding(nn.Module):
    def __init__(self, model, tokenizer, normalize=True):
        super(AutoModelForSentenceEmbedding, self).__init__()

        self.model = model.to("cuda")
        self.normalize = normalize
        self.tokenizer = tokenizer

    def forward(self, **kwargs):
        model_output = self.model(**kwargs)
        embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
        if self.normalize:
            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

        return embeddings

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentences = ["This is an example sentence", "Each sentence is converted"]

encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)

print(embeddings)
print(embeddings.shape)

📚 Documentation

Training

This model has been trained on 64M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the training set here.

The training (along with hyperparameters), inference and data loading scripts can all be found in this Github repository.

Evaluations

We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご