Open-source Vietnamese document embedding model - Supports 8096-token context and efficiently processes Vietnamese documents

Vietnamese Document Embedding

Developed by dangvantuan

A document embedding model for Vietnamese, supporting contexts up to 8096 tokens, trained based on gte-multilingual

OtherOpen Source License:Apache-2.0 #Vietnamese Long Text Embedding #Multi-Loss Joint Training #Semantic Similarity Calculation

Downloads 77.61k

Release Time : 8/15/2024

Model Overview

This is a long-text embedding model specifically trained for Vietnamese, capable of generating precise and contextually relevant sentence embeddings, suitable for tasks such as semantic similarity calculation and document retrieval in Vietnamese text.

Model Features

Long Text Support

Supports contexts up to 8096 tokens, suitable for processing long Vietnamese documents

Multi-Stage Training

Trained in two stages with XNLI natural language inference and STS semantic similarity to enhance model performance

Advanced Loss Functions

Utilizes multiple negative ranking loss, Matryoshka2dLoss, and similarity loss for training

Model Capabilities

Vietnamese text embedding

Sentence similarity calculation

Document retrieval

Semantic feature extraction

Use Cases

Text Retrieval

Vietnamese Document Retrieval

Use this model to generate embeddings for Vietnamese documents, enabling an efficient document retrieval system

Semantic Analysis

Vietnamese Sentence Similarity Calculation

Calculate semantic similarity between Vietnamese sentences for use in QA systems or chatbots

Achieved an average Spearman score of 82.45 on the STS Benchmark

🚀 Vietnamese Document Embedding Model

This is a Document Embedding Model for the Vietnamese language, with a context length of up to 8096 tokens, designed to generate precise sentence embeddings.

🚀 Quick Start

To use this model, you need to install sentence-transformers first:

pip install -U sentence-transformers

Then, you can use the model as follows:

from sentence_transformers import SentenceTransformer
sentences = ["Hà Nội là thủ đô của Việt Nam", "Đà Nẵng là thành phố du lịch"]

model = SentenceTransformer('dangvantuan/vietnamese-document-embedding', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)

✨ Features

Long Context Support: The model can handle context lengths of up to 8096 tokens, suitable for long Vietnamese texts.
Specialized for Vietnamese: It is specifically trained for the Vietnamese language, providing more accurate sentence embeddings.
Multiple Loss Functions: Trained using Multi - Negative Ranking Loss, Matryoshka2dLoss, and SimilarityLoss to enhance performance.

📦 Installation

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["Hà Nội là thủ đô của Việt Nam", "Đà Nẵng là thành phố du lịch"]

model = SentenceTransformer('dangvantuan/vietnamese-document-embedding', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage - Evaluation

from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
vi_sts = load_dataset("doanhieung/vi-stsbenchmark")["train"]
df_dev = vi_sts.filter(lambda example: example['split'] == 'dev')
df_test = vi_sts.filter(lambda example: example['split'] == 'test')

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")

📚 Documentation

Model Description

vietnamese-document-embedding is a Document Embedding Model for the Vietnamese language with a context length of up to 8096 tokens. It is a specialized long text - embedding model trained specifically for Vietnamese, built upon [gte - multilingual](Alibaba - NLP/gte - multilingual - base) and trained using the Multi - Negative Ranking Loss, Matryoshka2dLoss, and SimilarityLoss.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: VietnameseModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Training and Fine - tuning process

The model went through a rigorous four - stage training and fine - tuning process to enhance its ability to generate precise and contextually relevant sentence embeddings for the Vietnamese language. The following is an outline of these stages:

Stage 1: Training NLI on dataset XNLI

Dataset: XNLI - vn
Method: Training using Multi - Negative Ranking Loss and Matryoshka2dLoss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.

Stage 2: Fine - tuning for Semantic Textual Similarity on STS Benchmark

Dataset: [STSB - vn](https://huggingface.co/datasets/doanhieung/vi - stsbenchmark)
Method: Fine - tuning specifically for the semantic textual similarity benchmark using Siamese BERT - Networks configured with the 'sentence - transformers' library. This stage honed the model's precision in capturing semantic similarity across various types of Vietnamese texts.

Evaluation

The model can be evaluated on the [Vienamese data of stsb](https://huggingface.co/datasets/doanhieung/vi - stsbenchmark) as shown in the code example above.

Metric for all dataset of [Semantic Textual Similarity on STS Benchmark](https://huggingface.co/datasets/anti - ai/ViSTS)

Spearman score

Model	[STSB]	[STS12]	[STS13]	[STS14]	[STS15]	[STS16]	[SICK]	Mean
[dangvantuan/vietnamese - embedding](https://huggingface.co/dangvantuan/vietnamese - embedding)	84.84	79.04	85.30	81.38	87.06	79.95	79.58	82.45
[dangvantuan/vietnamese - embedding - LongContext](https://huggingface.co/dangvantuan/vietnamese - embedding - LongContext)	85.25	75.77	83.82	81.69	88.48	81.5	78.2	82.10

🔧 Technical Details

The model is built upon [gte - multilingual](Alibaba - NLP/gte - multilingual - base) and uses the Multi - Negative Ranking Loss, Matryoshka2dLoss, and SimilarityLoss during training. It has a specific architecture for pooling and normalization to generate high - quality sentence embeddings.

📄 License

This model is licensed under the [apache - 2.0](https://www.apache.org/licenses/LICENSE - 2.0) license.

📚 Citation

@article{reimers2019sentence,
   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
   author={Nils Reimers, Iryna Gurevych},
   journal={https://arxiv.org/abs/1908.10084},
   year={2019}
}

@article{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  journal={arXiv preprint arXiv:2407.19669},
  year={2024}
}

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}

@article{li20242d,
  title={2d matryoshka sentence embeddings},
  author={Li, Xianming and Li, Zongxi and Li, Jing and Xie, Haoran and Li, Qing},
  journal={arXiv preprint arXiv:2402.14776},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご