e5-small-korean Open-source Korean Sentence Embedding Model - Free for Semantic Similarity Calculation Tasks

E5 Small Korean

Developed by upskyy

A Korean sentence embedding model fine-tuned from intfloat/multilingual-e5-small, supporting 384-dimensional vector representation, suitable for tasks like semantic similarity calculation

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Korean sentence embedding #Multilingual semantic similarity #384-dimensional vector space

Downloads 2,510

Release Time : 8/9/2024

Model Overview

This model is a sentence embedding model specifically optimized for Korean, capable of mapping sentences and paragraphs into a 384-dimensional dense vector space. It can be used for tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

Model Features

Korean optimization

Specially fine-tuned for Korean, excelling in Korean semantic similarity tasks

Multilingual support

Based on a multilingual foundation model, retaining the ability to handle multiple languages

Efficient representation

Converts text into compact 384-dimensional vector representations, balancing effectiveness and efficiency

Model Capabilities

Semantic textual similarity calculation

Semantic search

Paraphrase mining

Text classification

Text clustering

Feature extraction

Use Cases

Information retrieval

🚀 upskyy/e5-small-korean

This model is a fine - tuned version of intfloat/multilingual-e5-small on korsts and kornli datasets. It maps sentences and paragraphs into a 384 - dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, etc.

🚀 Quick Start

This model is a fine - tuned version of intfloat/multilingual-e5-small on korsts and kornli datasets. It can map sentences and paragraphs to a 384 - dimensional dense vector space, which is useful for various NLP tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

✨ Features

Multilingual Support: Supports multiple languages including Korean, which broadens its application scope.
High - Dimensional Embeddings: Outputs 384 - dimensional dense vector embeddings for sentences and paragraphs.
Cosine Similarity: Uses cosine similarity as the similarity function, which is effective for measuring semantic similarity.

📦 Installation

First, you need to install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("upskyy/e5-small-korean")

# Run inference
sentences = [
    '아이를 가진 엄마가 해변을 걷는다.',
    '두 사람이 해변을 걷는다.',
    '한 남자가 해변에서 개를 산책시킨다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Advanced Usage

Without using the sentence - transformers library, you can use the model as follows. First, pass your input through the transformer model, and then apply the appropriate pooling operation on the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("upskyy/e5-small-korean")
model = AutoModel.from_pretrained("upskyy/e5-small-korean")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Model Details

Property	Details
Model Type	Sentence Transformer
Base model	intfloat/multilingual-e5-small
Maximum Sequence Length	512 tokens
Output Dimensionality	384 tokens
Similarity Function	Cosine Similarity

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

Semantic Similarity

Dataset: sts-dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.848
spearman_cosine	0.8467
pearson_manhattan	0.8309
spearman_manhattan	0.8373
pearson_euclidean	0.8328
spearman_euclidean	0.8395
pearson_dot	0.8212
spearman_dot	0.8226
pearson_max	0.848
spearman_max	0.8467

Framework Versions

Python: 3.10.13
Sentence Transformers: 3.0.1
Transformers: 4.42.4
PyTorch: 2.3.0+cu121
Accelerate: 0.30.1
Datasets: 2.16.1
Tokenizers: 0.19.1

🔧 Technical Details

The model is based on the SentenceTransformer framework. It consists of a Transformer layer and a Pooling layer. The Transformer layer uses a BertModel to generate contextualized word embeddings, and the Pooling layer aggregates these embeddings to obtain sentence - level embeddings.

📄 License

This model is released under the MIT license.

📖 Citation

BibTeX

@article{wang2024multilingual,
  title={Multilingual E5 Text Embeddings: A Technical Report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご