🚀 SentenceTransformer based on sentence-transformers/paraphrase-MiniLM-L6-v2
This SentenceTransformer model is fine - tuned from sentence-transformers/paraphrase-MiniLM-L6-v2 on the en - pt - br, en - es, and en - pt datasets. It maps sentences and paragraphs to a 384 - dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
✨ Features
- Maps sentences and paragraphs to a 384 - dimensional dense vector space.
- Applicable for various NLP tasks such as semantic textual similarity, semantic search, etc.
📦 Installation
First, install the Sentence Transformers library:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jvanhoof/all-MiniLM-L6-multilingual-v2-en-es-pt-pt-br")
sentences = [
'We now call this place home.',
'Moramos ali. Agora é aqui a nossa casa.',
'É mais fácil do que se possa imaginar.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
📚 Documentation
Model Details
Model Description
Property |
Details |
Model Type |
Sentence Transformer |
Base model |
sentence-transformers/paraphrase-MiniLM-L6-v2 |
Maximum Sequence Length |
128 tokens |
Output Dimensionality |
384 dimensions |
Similarity Function |
Cosine Similarity |
Training Datasets |
en - pt - br, en - es, en - pt |
Languages |
en, multilingual, es, pt |
Model Sources
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Evaluation
Metrics
Knowledge Distillation
- Datasets:
en - pt - br
, en - es
, and en - pt
- Evaluated with
MSEEvaluator
Metric |
en - pt - br |
en - es |
en - pt |
negative_mse |
-4.0617 |
-4.2473 |
-4.2555 |
Translation
Metric |
en - pt - br |
en - es |
en - pt |
src2trg_accuracy |
0.9859 |
0.908 |
0.8951 |
trg2src_accuracy |
0.9808 |
0.898 |
0.8824 |
mean_accuracy |
0.9834 |
0.903 |
0.8888 |
Semantic Similarity
Metric |
Value |
pearson_cosine |
0.7714 |
spearman_cosine |
0.7862 |
Training Details
Training Datasets
en - pt - br
- Dataset: en - pt - br at 0c70bc6
- Size: 405,807 training samples
- Columns:
english
, non_english
, and label
- Approximate statistics based on the first 1000 samples:
|
english |
non_english |
label |
type |
string |
string |
list |
details |
- min: 4 tokens
- mean: 23.98 tokens
- max: 128 tokens
|
- min: 6 tokens
- mean: 36.86 tokens
- max: 128 tokens
|
|
- Samples:
english |
non_english |
label |
And then there are certain conceptual things that can also benefit from hand calculating, but I think they're relatively small in number. |
E também existem alguns aspectos conceituais que também podem se beneficiar do cálculo manual, mas eu acho que eles são relativamente poucos. |
[-0.2655501961708069, 0.2715710997581482, 0.13977409899234772, 0.007375418208539486, -0.09395705163478851, ...] |
One thing I often ask about is ancient Greek and how this relates. |
Uma coisa sobre a qual eu pergunto com frequencia é grego antigo e como ele se relaciona a isto. |
[0.34961527585983276, -0.01806497573852539, 0.06103038787841797, 0.11750973761081696, -0.34720802307128906, ...] |
See, the thing we're doing right now is we're forcing people to learn mathematics. |
Vejam, o que estamos fazendo agora, é que estamos forçando as pessoas a aprender matemática. |
[0.031645823270082474, -0.1787087768316269, -0.30170342326164246, 0.1304805874824524, -0.29176947474479675, ...] |
- Loss:
MSELoss
en - es
- Dataset: en - es
- Size: 6,889,042 training samples
- Columns:
english
, non_english
, and label
- Approximate statistics based on the first 1000 samples:
|
english |
non_english |
label |
type |
string |
string |
list |
details |
- min: 4 tokens
- mean: 24.04 tokens
- max: 128 tokens
|
- min: 5 tokens
- mean: 35.11 tokens
- max: 128 tokens
|
|
- Samples:
english |
non_english |
label |
And then there are certain conceptual things that can also benefit from hand calculating, but I think they're relatively small in number. |
Y luego hay ciertas aspectos conceptuales que pueden beneficiarse del cálculo a mano pero creo que son relativamente pocos. |
[-0.2655501961708069, 0.2715710997581482, 0.13977409899234772, 0.007375418208539486, -0.09395705163478851, ...] |
One thing I often ask about is ancient Greek and how this relates. |
Algo que pregunto a menudo es sobre el griego antiguo y cómo se relaciona. |
[0.34961527585983276, -0.01806497573852539, 0.06103038787841797, 0.11750973761081696, -0.34720802307128906, ...] |
See, the thing we're doing right now is we're forcing people to learn mathematics. |
Vean, lo que estamos haciendo ahora es forzar a la gente a aprender matemáticas. |
[0.031645823270082474, -0.1787087768316269, -0.30170342326164246, 0.1304805874824524, -0.29176947474479675, ...] |
- Loss:
MSELoss
en - pt
- Dataset: en - pt
- Size: 6,636,095 training samples
- Columns:
english
, non_english
, and label
- Approximate statistics based on the first 1000 samples:
|
english |
non_english |
label |
type |
string |
string |
list |
details |
- min: 4 tokens
- mean: 23.5 tokens
- max: 128 tokens
|
- min: 5 tokens
- mean: 35.23 tokens
- max: 128 tokens
|
|
- Samples:
english |
non_english |
label |
And the country that does this first will, in my view, leapfrog others in achieving a new economy even, an improved economy, an improved outlook. |
E o país que fizer isto primeiro vai, na minha opinião, ultrapassar outros em alcançar uma nova economia até uma economia melhorada, uma visão melhorada. |
[-0.13...] |