stackoverflow_mpnet-base Open-source Model - Free for Semantic Search and Sentence Similarity Calculation

Stackoverflow Mpnet Base

Developed by flax-sentence-embeddings

A sentence embedding model trained on StackOverflow data based on Microsoft's mpnet-base model, suitable for semantic search and sentence similarity calculation

Text Embedding

PyTorch

#StackOverflow Semantic Encoding #Q&A Pair Optimization #Contrastive Learning Fine-tuning

Downloads 35

Release Time : 3/2/2022

Model Overview

This is a sentence embedding model trained on 18,562,443 pairs of StackOverflow (title, body) data based on Microsoft's mpnet-base model, capable of generating vector representations that capture semantic information

Model Features

Large-scale StackOverflow Data Training

Trained on 18,562,443 pairs of StackOverflow (title, body) data, optimized for technical Q&A scenarios

Efficient TPU Training

Trained on 7 TPU v3-8 accelerators with support from Google's technical team

Contrastive Learning Optimization

Utilizes a Siamese network architecture and contrastive learning objectives to enhance sentence embedding quality

Model Capabilities

Sentence Embedding Generation

Semantic Similarity Calculation

Text Feature Extraction

Semantic Search

Text Clustering

Use Cases

Technical Q&A Systems

StackOverflow Question Matching

Matching user questions with existing questions based on similarity

Improves question retrieval accuracy

Technical Document Retrieval

Retrieving relevant technical documents based on user queries

Enhances document search efficiency

Information Retrieval

Semantic Search

Search system based on semantic matching rather than keyword matching

Provides more relevant search results

🚀 stackoverflow_mpnet-base

This is a model based on microsoft/mpnet-base, trained on 18,562,443 (title, body) pairs from StackOverflow, which can be used for semantic search, clustering, and sentence similarity tasks.

🚀 Quick Start

This model is designed to serve as a sentence encoder for search engines. Given an input sentence, it outputs a vector that captures the semantic information of the sentence. This sentence vector can be used for semantic search, clustering, or sentence similarity tasks.

✨ Features

Trained on StackOverflow Data: Utilizes 18,562,443 (title, body) pairs from StackOverflow, making it suitable for tasks related to programming Q&A.
Sentence Encoding: Outputs vectors that capture sentence semantic information, useful for semantic search, clustering, and sentence similarity tasks.
Based on mpnet-base: Built upon the microsoft/mpnet-base model, benefiting from its pre - trained knowledge.

📦 Installation

To use this model, you need to install the SentenceTransformers library. You can install it via pip:

pip install sentence-transformers

💻 Usage Examples

Basic Usage

Here is how to use this model to get the features of a given text using the SentenceTransformers library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('flax-sentence-embeddings/stackoverflow_mpnet-base')
text = "Replace me by any question / answer you'd like."
text_embbedding = model.encode(text)
# array([-0.01559514,  0.04046123,  0.1317083 ,  0.00085931,  0.04585106,
#        -0.05607086,  0.0138078 ,  0.03569756,  0.01420381,  0.04266302 ...],
#        dtype=float32)

📚 Documentation

Training Procedure

Pre - training

We use the pretrained microsoft/mpnet-base. Please refer to the model card for more detailed information about the pre - training procedure.

Fine - tuning

We fine - tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the cross entropy loss by comparing with true pairs.

Hyper parameters

We trained the model on a TPU v3 - 8. We train the model during 80k steps using a batch size of 1024 (128 per TPU core). We use a learning rate warm - up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with a 2e - 5 learning rate. The full training script is accessible in this current repository.

Training data

We used 18,562,443 (title, body) pairs from StackOverflow as training data.

Property	Details
Model Type	Based on microsoft/mpnet-base
Training Data	18,562,443 (title, body) pairs from StackOverflow

Development Background

We developed this model during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face. We developed this model as part of the project: Train the Best Sentence Embedding Model Ever with 1B Training Pairs. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3 - 8, as well as assistance from Google’s Flax, JAX, and Cloud team members about efficient deep learning frameworks.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご