đ upskyy/gte-korean-base
This model is fine - tuned on korsts and kornli from [Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base). It maps sentences and paragraphs to a 768 - dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
đ Quick Start
This model maps sentences and paragraphs to a 768 - dimensional dense vector space. It can be used for various tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.
⨠Features
- Maps sentences and paragraphs to a 768 - dimensional dense vector space.
- Applicable for multiple natural language processing tasks like semantic textual similarity, semantic search, etc.
đĻ Installation
First, you need to install the Sentence Transformers library:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("upskyy/gte-korean-base", trust_remote_code=True)
sentences = [
'ėė´ëĨŧ ę°ė§ ėë§ę° í´ëŗė 깡ëë¤.',
'ë ėŦëė´ í´ëŗė 깡ëë¤.',
'í ë¨ėę° í´ëŗėė ę°ëĨŧ ė°ėą
ėí¨ë¤.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
print(similarities)
Advanced Usage
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["ėë
íė¸ė?", "íęĩė´ ëŦ¸ėĨ ėë˛ ëŠė ėí ë˛í¸ ëǍë¸ė
ëë¤."]
tokenizer = AutoTokenizer.from_pretrained("upskyy/gte-korean-base")
model = AutoModel.from_pretrained("upskyy/gte-korean-base", trust_remote_code=True)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Model Details
Model Description
Property |
Details |
Model Type |
Sentence Transformer |
Base model |
[Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base) |
Maximum Sequence Length |
8192 tokens |
Output Dimensionality |
768 tokens |
Similarity Function |
Cosine Similarity |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Evaluation
Metrics
Semantic Similarity
Metric |
Value |
pearson_cosine |
0.8681 |
spearman_cosine |
0.8689 |
pearson_manhattan |
0.7794 |
spearman_manhattan |
0.7817 |
pearson_euclidean |
0.781 |
spearman_euclidean |
0.7836 |
pearson_dot |
0.718 |
spearman_dot |
0.7553 |
pearson_max |
0.8681 |
spearman_max |
0.8689 |
Framework Versions
- Python: 3.10.13
- Sentence Transformers: 3.0.1
- Transformers: 4.42.4
- PyTorch: 2.3.0+cu121
- Accelerate: 0.30.1
- Datasets: 2.16.1
- Tokenizers: 0.19.1
đ License
This model is licensed under the apache - 2.0
license.
đ§ Technical Details
The model is a Sentence Transformer fine - tuned on korsts and kornli from [Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base). It uses cosine similarity as the similarity function and has a maximum sequence length of 8192 tokens with an output dimensionality of 768 tokens.
đ Citation
BibTeX
@misc{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
year={2024},
eprint={2407.19669},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.19669},
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}