gte-vi-base-v1 Open-source Model - Free Support for Tasks such as Vietnamese Semantic Text Similarity

Gte Vi Base V1

Developed by haiFrHust

This is a sentence-transformers model fine-tuned from Alibaba-NLP/gte-multilingual-base, supporting Vietnamese for tasks like semantic text similarity.

Text Embedding

Safetensors

OtherOpen Source License:MIT #Vietnamese sentence similarity #High-precision semantic matching #Multilingual vector embeddings

Downloads 272

Release Time : 4/4/2025

Model Overview

The model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as semantic text similarity, semantic search, paraphrase mining, text classification, and clustering.

Model Features

Multilingual support

Based on the Alibaba-NLP/gte-multilingual-base model, specifically optimized for Vietnamese.

Long text processing

Supports sequences up to 8192 tokens, making it suitable for handling long texts.

Efficient semantic representation

Maps text into a 768-dimensional dense vector space, capturing deep semantic information.

Model Capabilities

Semantic text similarity calculation

Semantic search

Paraphrase mining

Text classification

Text clustering

Use Cases

Text matching

🚀 SentenceTransformer based on Alibaba-NLP/gte-multilingual-base

This is a Sentence Transformer model fine-tuned from Alibaba-NLP/gte-multilingual-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

🚀 Quick Start

This model is a fine - tuned version of Alibaba-NLP/gte-multilingual-base using the Sentence Transformers library. It can be used for various NLP tasks such as semantic similarity calculation, clustering, etc.

✨ Features

Semantic Understanding: Maps sentences and paragraphs to a 768 - dimensional dense vector space, enabling effective semantic similarity calculations.
Multilingual Support: Based on a multilingual base model, it can handle multiple languages.
High Accuracy: Achieved a high cosine accuracy of 0.9982 on the xnli - vi - test dataset.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Dù thế nào đi nữa , tôi sẽ biết ấn độ cuối cùng đã tăng lên từ trạng thái thế giới thứ ba khi tôi quay về thăm người thân của mình và những lời đầu tiên họ nói là không ăn , ăn , gầy trai .',
    'Khi người thân của tôi nói ít về thức ăn , tôi sẽ biết ấn độ đang được cải thiện .',
    'Ernie Lewis đã trải qua 15 năm trong chính sách tư nhân trước khi tham gia hội đồng appalred .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Information

Property	Details
Model Type	Sentence Transformer
Base model	Alibaba-NLP/gte-multilingual-base
Maximum Sequence Length	8192 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Training Dataset	The Vietnamese subsection of the facebook/xnli dataset with 130k triplets
Language	Vietnamese
License	MIT

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

Metrics - Triplet

Dataset: xnli - vi - test
Evaluated with: TripletEvaluator

Metric	Value
cosine_accuracy	0.9982

Training Details

Training Dataset

Unnamed Dataset
- Size: 130,899 training samples
- Columns: sentence_0, sentence_1, and sentence_2
- Approximate statistics based on the first 1000 samples: | | sentence_0 | sentence_1 | sentence_2 | |---------|------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|------------------------------------------------------------------------------------| | type | string | string | string | | details |
  - min: 3 tokens
  - mean: 35.19 tokens
  - max: 167 tokens
  |
  - min: 6 tokens
  - mean: 18.96 tokens
  - max: 64 tokens
  |
  - min: 6 tokens
  - mean: 19.34 tokens
  - max: 57 tokens
  |
- Samples: | sentence_0 | sentence_1 | sentence_2 | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------| | Trong thời gian đó , julius đã lấy được đo lường của anh ta . | Julius đã làm việc của mình rồi . | Tôi biết anh đã nói với tôi rằng anh không bao giờ đi nhà hàng , bởi vì anh sợ họ sẽ nhổ vào thức ăn của anh . | | Khi hoàn thiện , các công cụ sẽ cho phép các ứng dụng để đánh giá các dự án công nghệ của họ trong thời gian triển khai để cả hai đảm bảo hoàn thành thành công và cuối cùng , để xác định xem mục tiêu của họ | Một khi hoàn thiện các công cụ sẽ được đánh giá nếu mục tiêu được đạt được . | Có thường xuyên chiến đấu quanh khu vực của penrith . | | H ' s , thân yêu tôi . | À , con yêu bé nhỏ của ta . | Đúng rồi đó . |
- Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non - Default Hyperparameters
- eval_strategy: steps
- per_device_train_batch_size: 32
- per_device_eval_batch_size: 32
- num_train_epochs: 1
- fp16: True
- multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	Training Loss	xnli - vi - test_cosine_accuracy
0.1222	500	0.3095	-
0.2444	1000	0.1216	0.9976
0.3667	1500	0.1093	-
0.4889	2000	0.103	0.9988
0.6111	2500	0.0934	-
0.7333	3000	0.0929	0.9982
0.8555	3500	0.0847	-
0.9778	4000	0.0966	0.9982
1.0	4091	-	0.9982

Framework Versions

Python: 3.10.12
Sentence Transformers: 4.0.2
Transformers: 4.50.3
PyTorch: 2.6.0+cu124
Accelerate: 0.26.1
Datasets: 3.5.0
Tokenizers: 0.21.1

📄 License

This model is licensed under the MIT license.

📚 Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご