Release of all-MiniLM-L6 Multilingual Model - Supports Semantic Similarity Calculation in English, Spanish and Portuguese

All MiniLM L6 Multilingual V2 En Es Pt Pt Br V2

Developed by jvanhoof

This is a multilingual sentence embedding model supporting English, Spanish, and Portuguese, capable of mapping text to a 384-dimensional vector space, suitable for tasks such as semantic similarity calculation.

Text Embedding

Safetensors

Supports Multiple Languages#Multilingual sentence embedding #Cross-language semantic matching #Low-dimensional efficient vectors

Downloads 1,957

Release Time : 1/8/2025

Model Overview

This model is fine-tuned based on the sentence-transformers framework, specifically designed for semantic understanding tasks in English, Spanish, and Portuguese texts, such as semantic search, text classification, and clustering analysis.

Model Features

Multilingual support

Specially optimized for English, Spanish, and Portuguese, effectively handling semantic understanding tasks in these three languages.

Efficient vector representation

Maps sentences and paragraphs to a 384-dimensional dense vector space, facilitating efficient similarity calculation and semantic analysis.

High translation accuracy

Achieves an average accuracy of 98.34% in English-Portuguese translation tasks and 90.30% accuracy in English-Spanish translation.

Model Capabilities

Semantic text similarity calculation

Cross-language semantic search

Paraphrase mining

Text classification

Text clustering

Multilingual text processing

Use Cases

Information retrieval

Cross-language document search

Search for similar documents in different languages using the same semantic vector space

High-accuracy cross-language matching capability

Content management

Multilingual content deduplication

Identify identical content expressed in different languages

Effectively reduces redundant content

🚀 SentenceTransformer based on sentence-transformers/paraphrase-MiniLM-L6-v2

This SentenceTransformer model is fine - tuned from sentence-transformers/paraphrase-MiniLM-L6-v2 on the en - pt - br, en - es, and en - pt datasets. It maps sentences and paragraphs to a 384 - dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Maps sentences and paragraphs to a 384 - dimensional dense vector space.
Applicable for various NLP tasks such as semantic textual similarity, semantic search, etc.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("jvanhoof/all-MiniLM-L6-multilingual-v2-en-es-pt-pt-br")
# Run inference
sentences = [
    'We now call this place home.',
    'Moramos ali. Agora é aqui a nossa casa.',
    'É mais fácil do que se possa imaginar.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	sentence-transformers/paraphrase-MiniLM-L6-v2
Maximum Sequence Length	128 tokens
Output Dimensionality	384 dimensions
Similarity Function	Cosine Similarity
Training Datasets	en - pt - br, en - es, en - pt
Languages	en, multilingual, es, pt

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Evaluation

Metrics

Knowledge Distillation

Datasets: en - pt - br, en - es, and en - pt
Evaluated with MSEEvaluator

Metric	en - pt - br	en - es	en - pt
negative_mse	-4.0617	-4.2473	-4.2555

Translation

Datasets: en - pt - br, en - es, and en - pt
Evaluated with TranslationEvaluator

Metric	en - pt - br	en - es	en - pt
src2trg_accuracy	0.9859	0.908	0.8951
trg2src_accuracy	0.9808	0.898	0.8824
mean_accuracy	0.9834	0.903	0.8888

Semantic Similarity

Dataset: sts17 - es - en - test
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.7714
spearman_cosine	0.7862

Training Details

Training Datasets

en - pt - br

Dataset: en - pt - br at 0c70bc6
Size: 405,807 training samples
Columns: english, non_english, and label
Approximate statistics based on the first 1000 samples:
english non_english label
type string string list
details
min: 4 tokens
mean: 23.98 tokens
max: 128 tokens
min: 6 tokens
mean: 36.86 tokens
max: 128 tokens
size: 384 elements

	english	non_english	label
type	string	string	list
details	min: 4 tokens mean: 23.98 tokens max: 128 tokens	min: 6 tokens mean: 36.86 tokens max: 128 tokens	size: 384 elements

Samples:

english	non_english	label
`And then there are certain conceptual things that can also benefit from hand calculating, but I think they're relatively small in number.`	`E também existem alguns aspectos conceituais que também podem se beneficiar do cálculo manual, mas eu acho que eles são relativamente poucos.`	`[-0.2655501961708069, 0.2715710997581482, 0.13977409899234772, 0.007375418208539486, -0.09395705163478851, ...]`
`One thing I often ask about is ancient Greek and how this relates.`	`Uma coisa sobre a qual eu pergunto com frequencia é grego antigo e como ele se relaciona a isto.`	`[0.34961527585983276, -0.01806497573852539, 0.06103038787841797, 0.11750973761081696, -0.34720802307128906, ...]`
`See, the thing we're doing right now is we're forcing people to learn mathematics.`	`Vejam, o que estamos fazendo agora, é que estamos forçando as pessoas a aprender matemática.`	`[0.031645823270082474, -0.1787087768316269, -0.30170342326164246, 0.1304805874824524, -0.29176947474479675, ...]`

Loss: MSELoss

en - es

Dataset: en - es
Size: 6,889,042 training samples
Columns: english, non_english, and label
Approximate statistics based on the first 1000 samples:
english non_english label
type string string list
details
min: 4 tokens
mean: 24.04 tokens
max: 128 tokens
min: 5 tokens
mean: 35.11 tokens
max: 128 tokens
size: 384 elements

	english	non_english	label
type	string	string	list
details	min: 4 tokens mean: 24.04 tokens max: 128 tokens	min: 5 tokens mean: 35.11 tokens max: 128 tokens	size: 384 elements

Samples:

english	non_english	label
`And then there are certain conceptual things that can also benefit from hand calculating, but I think they're relatively small in number.`	`Y luego hay ciertas aspectos conceptuales que pueden beneficiarse del cálculo a mano pero creo que son relativamente pocos.`	`[-0.2655501961708069, 0.2715710997581482, 0.13977409899234772, 0.007375418208539486, -0.09395705163478851, ...]`
`One thing I often ask about is ancient Greek and how this relates.`	`Algo que pregunto a menudo es sobre el griego antiguo y cómo se relaciona.`	`[0.34961527585983276, -0.01806497573852539, 0.06103038787841797, 0.11750973761081696, -0.34720802307128906, ...]`
`See, the thing we're doing right now is we're forcing people to learn mathematics.`	`Vean, lo que estamos haciendo ahora es forzar a la gente a aprender matemáticas.`	`[0.031645823270082474, -0.1787087768316269, -0.30170342326164246, 0.1304805874824524, -0.29176947474479675, ...]`

Loss: MSELoss

en - pt

Dataset: en - pt
Size: 6,636,095 training samples
Columns: english, non_english, and label
Approximate statistics based on the first 1000 samples:
english non_english label
type string string list
details
min: 4 tokens
mean: 23.5 tokens
max: 128 tokens
min: 5 tokens
mean: 35.23 tokens
max: 128 tokens
size: 384 elements

	english	non_english	label
type	string	string	list
details	min: 4 tokens mean: 23.5 tokens max: 128 tokens	min: 5 tokens mean: 35.23 tokens max: 128 tokens	size: 384 elements

Samples:

english	non_english	label
`And the country that does this first will, in my view, leapfrog others in achieving a new economy even, an improved economy, an improved outlook.`	`E o país que fizer isto primeiro vai, na minha opinião, ultrapassar outros em alcançar uma nova economia até uma economia melhorada, uma visão melhorada.`	`[-0.13...]`

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご