gte-base-korean Open-source Korean Sentence Embedding Model - Free Deployment to Facilitate Semantic Search and Similarity Calculation

Gte Base Korean

Developed by upskyy

A Korean sentence embedding model fine - tuned on Alibaba - NLP/gte - multilingual - base, supporting tasks such as semantic text similarity calculation and semantic search.

Text Embedding Open Source License:Apache-2.0 #Korean semantic embedding #Multilingual support #Long text processing

Downloads 1,436

Release Time : 8/8/2024

Model Overview

This model can map sentences and paragraphs to a 768 - dimensional dense vector space, suitable for tasks like semantic text similarity calculation, semantic search, paraphrase mining, text classification, and clustering.

Model Features

Multilingual support

Supports multiple languages, including Korean, Afrikaans, Arabic, Azerbaijani, etc.

High - dimensional output

Outputs 768 - dimensional vectors, suitable for tasks such as semantic similarity calculation.

Long sequence processing

Supports a maximum sequence length of 8192 tokens.

Model Capabilities

Semantic text similarity calculation

Semantic search

Paraphrase mining

Text classification

Text clustering

Use Cases

Text similarity

Sentence similarity comparison

Calculate the semantic similarity between different Korean sentences.

Can output a similarity score between 0 and 1.

Information retrieval

Semantic search

A search system based on semantics rather than keyword matching.

🚀 upskyy/gte-korean-base

This model is fine - tuned on korsts and kornli from [Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base). It maps sentences and paragraphs to a 768 - dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

🚀 Quick Start

This model maps sentences and paragraphs to a 768 - dimensional dense vector space. It can be used for various tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

✨ Features

Maps sentences and paragraphs to a 768 - dimensional dense vector space.
Applicable for multiple natural language processing tasks like semantic textual similarity, semantic search, etc.

📦 Installation

First, you need to install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("upskyy/gte-korean-base", trust_remote_code=True)

# Run inference
sentences = [
    '아이를 가진 엄마가 해변을 걷는다.',
    '두 사람이 해변을 걷는다.',
    '한 남자가 해변에서 개를 산책시킨다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
print(similarities)
# tensor([[1.0000, 0.6274, 0.3788],
#        [0.6274, 1.0000, 0.5978],
#        [0.3788, 0.5978, 1.0000]])

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("upskyy/gte-korean-base")
model = AutoModel.from_pretrained("upskyy/gte-korean-base", trust_remote_code=True)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	[Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base)
Maximum Sequence Length	8192 tokens
Output Dimensionality	768 tokens
Similarity Function	Cosine Similarity

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

Metrics

Semantic Similarity

Dataset: sts - dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.8681
spearman_cosine	0.8689
pearson_manhattan	0.7794
spearman_manhattan	0.7817
pearson_euclidean	0.781
spearman_euclidean	0.7836
pearson_dot	0.718
spearman_dot	0.7553
pearson_max	0.8681
spearman_max	0.8689

Framework Versions

Python: 3.10.13
Sentence Transformers: 3.0.1
Transformers: 4.42.4
PyTorch: 2.3.0+cu121
Accelerate: 0.30.1
Datasets: 2.16.1
Tokenizers: 0.19.1

📄 License

This model is licensed under the apache - 2.0 license.

🔧 Technical Details

The model is a Sentence Transformer fine - tuned on korsts and kornli from [Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base). It uses cosine similarity as the similarity function and has a maximum sequence length of 8192 tokens with an output dimensionality of 768 tokens.

📖 Citation

BibTeX

@misc{zhang2024mgte,
      title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
      author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
      year={2024},
      eprint={2407.19669},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.19669}, 
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご