đ upskyy/e5-small-korean
This model is a fine - tuned version of intfloat/multilingual-e5-small on korsts and kornli datasets. It maps sentences and paragraphs into a 384 - dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, etc.
đ Quick Start
This model is a fine - tuned version of intfloat/multilingual-e5-small on korsts and kornli datasets. It can map sentences and paragraphs to a 384 - dimensional dense vector space, which is useful for various NLP tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.
⨠Features
- Multilingual Support: Supports multiple languages including Korean, which broadens its application scope.
- High - Dimensional Embeddings: Outputs 384 - dimensional dense vector embeddings for sentences and paragraphs.
- Cosine Similarity: Uses cosine similarity as the similarity function, which is effective for measuring semantic similarity.
đĻ Installation
First, you need to install the Sentence Transformers library:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("upskyy/e5-small-korean")
sentences = [
'ėė´ëĨŧ ę°ė§ ėë§ę° í´ëŗė 깡ëë¤.',
'ë ėŦëė´ í´ëŗė 깡ëë¤.',
'í ë¨ėę° í´ëŗėė ę°ëĨŧ ė°ėą
ėí¨ë¤.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
Advanced Usage
Without using the sentence - transformers library, you can use the model as follows. First, pass your input through the transformer model, and then apply the appropriate pooling operation on the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["ėë
íė¸ė?", "íęĩė´ ëŦ¸ėĨ ėë˛ ëŠė ėí ë˛í¸ ëǍë¸ė
ëë¤."]
tokenizer = AutoTokenizer.from_pretrained("upskyy/e5-small-korean")
model = AutoModel.from_pretrained("upskyy/e5-small-korean")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Model Details
Property |
Details |
Model Type |
Sentence Transformer |
Base model |
intfloat/multilingual-e5-small |
Maximum Sequence Length |
512 tokens |
Output Dimensionality |
384 tokens |
Similarity Function |
Cosine Similarity |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Evaluation
Semantic Similarity
Metric |
Value |
pearson_cosine |
0.848 |
spearman_cosine |
0.8467 |
pearson_manhattan |
0.8309 |
spearman_manhattan |
0.8373 |
pearson_euclidean |
0.8328 |
spearman_euclidean |
0.8395 |
pearson_dot |
0.8212 |
spearman_dot |
0.8226 |
pearson_max |
0.848 |
spearman_max |
0.8467 |
Framework Versions
- Python: 3.10.13
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.0+cu121
- Accelerate: 0.30.1
- Datasets: 2.16.1
- Tokenizers: 0.19.1
đ§ Technical Details
The model is based on the SentenceTransformer
framework. It consists of a Transformer
layer and a Pooling
layer. The Transformer
layer uses a BertModel
to generate contextualized word embeddings, and the Pooling
layer aggregates these embeddings to obtain sentence - level embeddings.
đ License
This model is released under the MIT license.
đ Citation
BibTeX
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}