Bert-MLM_arXiv-MP-class_zbMath Open-source Model - Free Calculation of Short Mathematical Text Similarity

Bert MLM Arxiv MP Class Zbmath

Developed by math-similarity

This is a model based on sentence-transformers, specifically designed for calculating the similarity of short mathematical texts, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space.

Text Embedding

Transformers

#Mathematical Text Similarity #Short Text Vectorization #Academic Paper Matching

Downloads 415

Release Time : 5/18/2024

Model Overview

This model is designed to process text in the mathematical domain, particularly suitable for calculating the semantic similarity of short texts such as mathematical paper abstracts and theorem descriptions, and can be used for tasks like clustering or semantic search.

Model Features

Specialized for Mathematical Texts

Optimized specifically for texts in the mathematical domain, effectively handling short texts containing mathematical formulas and terminology.

High-Dimensional Semantic Encoding

Maps text into a 768-dimensional dense vector space, capturing deep semantic relationships.

Sentence Transformer Compatibility

Based on the sentence-transformers framework, easy to integrate into existing NLP workflows.

Model Capabilities

Mathematical Text Similarity Calculation

Semantic Vector Generation

Short Text Clustering

Academic Literature Retrieval

Use Cases

Academic Research

Mathematical Paper Similarity Search

Search for papers similar to a given abstract in mathematical literature databases.

Improves the accuracy of relevant literature retrieval.

Theorem Classification

Automatic classification based on the semantic similarity of theorem descriptions.

Assists in the construction of mathematical knowledge bases.

Educational Technology

Exercise Similarity Matching

Match similar mathematical problems on educational platforms.

Supports personalized learning recommendations.

🚀 Bert-MLM_arXiv-MP-class_zbMath

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search, especially designed for computing similarities of short mathematical texts.

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Specifically designed for computing similarities of short mathematical texts.
Can be used for tasks like clustering or semantic search.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

model = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
model = AutoModel.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Intended uses

Our model is intended to be used as a sentence and short paragraph encoder for mathematical texts. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks. By default, input text longer than 256 word pieces is truncated.

Training procedure

Domain-adaption: We use the domain-adapted math-similarity/Bert-MLM_arXiv model. Please refer to the model card for more detailed information about the domain-adaption procedure.
Pooling: We add a mean-pooling layer on top of the domain-adapted model.
Fine-tuning: We fine-tune the model using a cosine-similarity objective. Formally, it computes the vectors u = model(sentence_A) and v = model(sentence_B) and measures the cosine-similarity between the two. By default, it minimizes the following loss: ||input_label - cos_score_transformation(cosine_sim(u,v))||_2, with MSE as loss function. We use title-pairs from zbMath as fine-tuning dataset and model semantic similarity with their MSC codes. Two titles are defined as similar, if they share their primary MSC₅ and another secondary MSC₅. Otherwise, they are defined as semantically dissimilar. The training set contains 351.472 title pairs and the evaluation set contains 43.935 pairs. See the training notebook for more information. Unfortunately, we cannot include a dataset with titles due to licensing issues. However, we have created a dataset than only contains the respective zbMath identifiers (also known as an) with primary and secondary MSC classification but without titles. It is available as datasets/math-similarity/class-zbmath-identifier.

🔧 Technical Details

The model is a sentence and short paragraph encoder for mathematical texts. It uses domain-adapted math-similarity/Bert-MLM_arXiv model, adds a mean-pooling layer on top, and fine-tunes the model using a cosine-similarity objective. The fine-tuning dataset is from zbMath, and the training and evaluation sets have specific numbers of title pairs.

📄 License

This model is an additional resource for the CICM'24 submission On modelling similarity of short mathematical texts.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご