Bge Base Financial Matryoshka

B

Bge Base Financial Matryoshka

Developed by philschmid

This is a sentence embedding model fine-tuned from BAAI/bge-base-en-v1.5, specifically designed for financial domain texts. It can map sentences and paragraphs to a 768-dimensional vector space.

Text Embedding EnglishOpen Source License:Apache-2.0 #Financial semantic search #High-dimensional vector representation #Multi-task support

Downloads 1,138

Release Time : 6/3/2024

Model Overview

This model is developed based on the sentence-transformers framework and is suitable for natural language processing tasks such as semantic text similarity calculation, semantic search, paraphrase mining, text classification, and clustering.

Model Features

Optimized for the financial domain

Fine-tuned for financial domain texts, it can better handle financial-related semantics

High-dimensional vector representation

Maps text to a 768-dimensional dense vector space to effectively capture semantic information

Multi-task support

Supports multiple NLP tasks such as semantic similarity calculation, search, and classification

Long text processing

Supports a maximum sequence length of 512 tokens, suitable for processing paragraph-level texts

Model Capabilities

Semantic text similarity calculation

Semantic search

Paraphrase mining

Text classification

Text clustering

Use Cases

Financial information retrieval

Financial report information query

Quickly retrieve key information from company financial reports

Achieved a MAP@100 of 0.7907 on the baseline dataset

Financial Q&A system

Build a financial Q&A system based on semantic matching

Achieved an accuracy@1 of 0.7086 on the baseline dataset

Financial text analysis

Extraction of key information from financial reports

Automatically identify and classify key data points in financial reports

🚀 BGE base Financial Matryoshka

This model, based on sentence-transformers, is fine - tuned from BAAI/bge-base-en-v1.5. It maps sentences and paragraphs to a 768 - dimensional dense vector space and can be applied in various tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

✨ Features

Maps sentences and paragraphs to a 768 - dimensional vector space.
Suitable for multiple NLP tasks like semantic similarity, search, and classification.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("philschmid/bge-base-financial-matryoshka")
# Run inference
sentences = [
    "What was Gilead's total revenue in 2023?",
    'What was the total revenue for the year ended December 31, 2023?',
    'How much was the impairment related to the CAT loan receivable in 2023?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	BAAI/bge-base-en-v1.5
Maximum Sequence Length	512 tokens
Output Dimensionality	768 tokens
Similarity Function	Cosine Similarity
Language	en
License	apache - 2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

🔧 Technical Details

Evaluation

Information Retrieval

The model is evaluated on multiple datasets using the InformationRetrievalEvaluator.

Dataset: `basline_768`

Metric	Value
cosine_accuracy@1	0.7086
cosine_accuracy@3	0.8514
cosine_accuracy@5	0.8843
cosine_accuracy@10	0.9271
cosine_precision@1	0.7086
cosine_precision@3	0.2838
cosine_precision@5	0.1769
cosine_precision@10	0.0927
cosine_recall@1	0.7086
cosine_recall@3	0.8514
cosine_recall@5	0.8843
cosine_recall@10	0.9271
cosine_ndcg@10	0.8215
cosine_mrr@10	0.7874
cosine_map@100	0.7907

Dataset: `basline_512`

Metric	Value
cosine_accuracy@1	0.7114
cosine_accuracy@3	0.85
cosine_accuracy@5	0.8829
cosine_accuracy@10	0.9229
cosine_precision@1	0.7114
cosine_precision@3	0.2833
cosine_precision@5	0.1766
cosine_precision@10	0.0923
cosine_recall@1	0.7114
cosine_recall@3	0.85
cosine_recall@5	0.8829
cosine_recall@10	0.9229
cosine_ndcg@10	0.8209
cosine_mrr@10	0.7879
cosine_map@100	0.7916

Dataset: `basline_256`

Metric	Value
cosine_accuracy@1	0.7057
cosine_accuracy@3	0.8414
cosine_accuracy@5	0.88
cosine_accuracy@10	0.9229
cosine_precision@1	0.7057
cosine_precision@3	0.2805
cosine_precision@5	0.176
cosine_precision@10	0.0923
cosine_recall@1	0.7057
cosine_recall@3	0.8414
cosine_recall@5	0.88
cosine_recall@10	0.9229
cosine_ndcg@10	0.8162
cosine_mrr@10	0.7818
cosine_map@100	0.7854

Dataset: `basline_128`

Metric	Value
cosine_accuracy@1	0.7029
cosine_accuracy@3	0.8343
cosine_accuracy@5	0.8743
cosine_accuracy@10	0.9171
cosine_precision@1	0.7029
cosine_precision@3	0.2781
cosine_precision@5	0.1749
cosine_precision@10	0.0917
cosine_recall@1	0.7029
cosine_recall@3	0.8343
cosine_recall@5	0.8743
cosine_recall@10	0.9171
cosine_ndcg@10	0.8109
cosine_mrr@10	0.7769
cosine_map@100	0.7803

Dataset: `basline_64`

Metric	Value
cosine_accuracy@1	0.6729
cosine_accuracy@3	0.8171
cosine_accuracy@5	0.8614
cosine_accuracy@10	0.9014
cosine_precision@1	0.6729
cosine_precision@3	0.2724
cosine_precision@5	0.1723
cosine_precision@10	0.0901
cosine_recall@1	0.6729
cosine_recall@3	0.8171
cosine_recall@5	0.8614
cosine_recall@10	0.9014
cosine_ndcg@10	0.79
cosine_mrr@10	0.754
cosine_map@100	0.7582

Training Details

Training Dataset

Unnamed Dataset

Size: 6,300 training samples
Columns: positive and anchor
Approximate statistics based on the first 1000 samples:
positive anchor
type string string
details
min: 10 tokens
mean: 46.11 tokens
max: 289 tokens
min: 7 tokens
mean: 20.26 tokens
max: 43 tokens

Samples:

positive	anchor
`Fiscal 2023 total gross profit margin of 35.1% represents an increase of 1.7 percentage points as compared to the respective prior year period.`	`What was the total gross profit margin for Hewlett Packard Enterprise in fiscal 2023?`
`Noninterest expense increased to $65.8 billion in 2023, primarily due to higher investments in people and technology and higher FDIC expense, including $2.1 billion for the estimated special assessment amount arising from the closure of Silicon Valley Bank and Signature Bank.`	`What was the total noninterest expense for the company in 2023?`
`As of May 31, 2022, FedEx Office had approximately 12,000 employees.`	`How many employees did FedEx Office have as of May 31, 2022?`

📄 License

This model is released under the apache - 2.0 license.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase