e5_large_finetune_word Open-source Sentence Transformation Model - Free for Semantic Similarity Calculation and Text Retrieval

E5 Large Finetune Word

Developed by meandyou200175

This is a fine-tuned sentence transformer model based on multilingual-e5-large, which maps text to a 1024-dimensional vector space for tasks such as semantic similarity calculation and text retrieval.

Text Embedding #Multilingual Semantic Matching #High-dimensional Vector Retrieval #Fine-grained Text Similarity

Downloads 259

Release Time : 5/9/2025

Model Overview

This model is specifically designed for semantic representation of sentences and paragraphs, supporting embedding representation and similarity calculation for multilingual texts, suitable for scenarios such as information retrieval, text classification, and clustering.

Model Features

Multilingual Support

Based on the multilingual-e5-large base model, capable of handling text embeddings in multiple languages.

High-dimensional Semantic Representation

Maps text to a 1024-dimensional dense vector space, capturing deep semantic features.

Excellent Retrieval Performance

Performs exceptionally well in information retrieval tasks, achieving an accuracy@1 of 90.73%.

Efficient Similarity Calculation

Supports fast cosine similarity calculation, suitable for large-scale text matching.

Model Capabilities

Semantic Text Similarity Calculation

Semantic Search

Paraphrase Mining

Text Classification

Text Clustering

Use Cases

Information Retrieval

Tag Matching

Semantically matches user queries with predefined tag libraries.

Accuracy@1 reaches 90.73%.

Content Recommendation

🚀 SentenceTransformer based on intfloat/multilingual-e5-large

This is a fine - tuned model from intfloat/multilingual - e5 - large using sentence - transformers. It maps sentences and paragraphs to a 1024 - dimensional dense vector space and can be applied in semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other tasks.

📚 Documentation

✨ Features

Model Type: Sentence Transformer
Base model: intfloat/multilingual - e5 - large
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("meandyou200175/e5_large_finetune_word")
# Run inference
sentences = [
    'A long appendage protruding from the lower back. Often covered in fur or scales. A common feature of animal girls.',
    'tail',
    'stomach day',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📊 Evaluation

Information Retrieval

Evaluated with InformationRetrievalEvaluator

Property	Details
cosine_accuracy@1	0.9073
cosine_accuracy@2	0.9739
cosine_accuracy@5	0.9942
cosine_accuracy@10	0.999
cosine_accuracy@100	1.0
cosine_precision@1	0.9073
cosine_precision@2	0.487
cosine_precision@5	0.1988
cosine_precision@10	0.0999
cosine_precision@100	0.01
cosine_recall@1	0.9073
cosine_recall@2	0.9739
cosine_recall@5	0.9942
cosine_recall@10	0.999
cosine_recall@100	1.0
cosine_ndcg@10	0.9602
cosine_mrr@1	0.9073
cosine_mrr@2	0.9406
cosine_mrr@5	0.9463
cosine_mrr@10	0.947
cosine_mrr@100	0.9471
cosine_map@100	0.9471

🔧 Technical Details

Training Dataset

Unnamed Dataset

Size: 10,356 training samples
Columns: query and positive
Approximate statistics based on the first 1000 samples:
query positive
type string string
details
min: 3 tokens
mean: 36.54 tokens
max: 177 tokens
min: 3 tokens
mean: 5.3 tokens
max: 13 tokens

	query	positive
type	string	string
details	min: 3 tokens mean: 36.54 tokens max: 177 tokens	min: 3 tokens mean: 5.3 tokens max: 13 tokens

Samples:

query	positive
`Eyewear shaped like a semicircle.`	`semi - circular eyewear`
`A handheld electric appliance used for drying and styling hair.`	`hair dryer`
`When one breast is exposed while the other remains covered or confined by clothing. See breasts out for when both breasts are exposed.`	`one breast out`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

word_embedding

Dataset: word_embedding at af76b11
Size: 1,036 evaluation samples
Columns: query and positive
Approximate statistics based on the first 1000 samples:
query positive
type string string
details
min: 4 tokens
mean: 35.89 tokens
max: 164 tokens
min: 3 tokens
mean: 5.38 tokens
max: 14 tokens

	query	positive
type	string	string
details	min: 4 tokens mean: 35.89 tokens max: 164 tokens	min: 3 tokens mean: 5.38 tokens max: 14 tokens

Samples:

query	positive
`A machine that manipulates data according to a list of instructions. The ability to store and execute lists of instructions called programs make computers extremely versatile. On Danbooru's images they are most often used for drawing, playing games and accessing the internet.`	`computer`
`A playing card with two clubs.`	`two of clubs`
`Yebisu (ãƒ±ãƒ“ã‚¹, Ebisu) is a beer produced by Sapporo Breweries. It is one of the most popular beers in Japan.`	`Yebisu beer`

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご