bert-base-1024-biencoder-6M-pairs Open-source Model - Generate 768-dimensional Vector Representations for Sentences and Paragraphs

Bert Base 1024 Biencoder 6M Pairs

Developed by shreyansh26

A long-context bi-encoder based on MosaicML's pre-trained BERT with 1024 sequence length, designed for generating 768-dimensional dense vector representations of sentences and paragraphs

Text Embedding

Transformers

Supports Multiple Languages#Long text encoding #Semantic search #Dense vector retrieval

Downloads 24

Release Time : 8/17/2023

Model Overview

This model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search. Supports a sequence length of 1024 and is trained on 6.4M sentence/paragraph pairs.

Model Features

Long context support

Supports 1024 sequence length, ideal for processing long texts

Efficient bi-encoder

Utilizes a bi-encoder architecture for efficiently generating vector representations of sentences and paragraphs

Large-scale training data

Trained on 6.4M randomly sampled sentence/paragraph pairs

Model Capabilities

Sentence vectorization

Paragraph vectorization

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Information retrieval

Document retrieval

Using vector similarity for document retrieval

Performs well on multiple retrieval benchmarks

Question answering systems

Used for paragraph retrieval in question answering systems

Text analysis

Text clustering

Text clustering based on semantic similarity

🚀 bert-base-1024-biencoder-6M-pairs

A long context biencoder based on MosaicML's BERT pretrained on 1024 sequence length. This model maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

🚀 Quick Start

Datasets

sentence-transformers/embedding-training-data
flax-sentence-embeddings/stackexchange_xml
snli
eli5
search_qa
multi_nli
wikihow
natural_questions
trivia_qa
ms_marco
gooaq
yahoo_answers_topics

Language

Inference

false

Pipeline Tag

sentence-similarity

Task Categories

sentence-similarity
feature-extraction
text-retrieval

📦 Installation

Download the model and related scripts

git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-6M-pairs

💻 Usage Examples

Basic Usage

import torch
from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
from mosaic_bert import BertModel

# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0

class AutoModelForSentenceEmbedding(nn.Module):
    def __init__(self, model, tokenizer, normalize=True):
        super(AutoModelForSentenceEmbedding, self).__init__()

        self.model = model.to("cuda")
        self.normalize = normalize
        self.tokenizer = tokenizer

    def forward(self, **kwargs):
        model_output = self.model(**kwargs)
        embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
        if self.normalize:
            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

        return embeddings

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
model = AutoModelForSentenceEmbedding(model, tokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentences = ["This is an example sentence", "Each sentence is converted"]

encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
embeddings = model(**encoded_input)

print(embeddings)
print(embeddings.shape)

📚 Documentation

Training

This model has been trained on 6.4M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the training set here.

The training (along with hyperparameters), inference and data loading scripts can all be found in this Github repository.

Evaluations

We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご