🚀 bkai-foundation-models/vietnamese-bi-encoder
This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.
Key Information
Property |
Details |
Model Type |
Sentence-transformers |
Training Data |
- MS Macro (translated into Vietnamese) - SQuAD v2 (translated into Vietnamese) - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge |
License |
apache-2.0 |
Results on the Remaining 20% of the Training Set from the Legal Text Retrieval Zalo 2021 Challenge
Pretrained Model |
Training Datasets |
Acc@1 |
Acc@10 |
Acc@100 |
Pre@10 |
MRR@10 |
Vietnamese-SBERT |
- |
32.34 |
52.97 |
89.84 |
7.05 |
45.30 |
PhoBERT-base-v2 |
MSMACRO |
47.81 |
77.19 |
92.34 |
7.72 |
58.37 |
PhoBERT-base-v2 |
MSMACRO + SQuADv2.0 + 80% Zalo |
73.28 |
93.59 |
98.85 |
9.36 |
80.73 |
🚀 Quick Start
This model maps sentences and paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. It is trained on a merged dataset including MS Macro (translated into Vietnamese), SQuAD v2 (translated into Vietnamese), and 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge. phobert-base-v2 is used as the pre-trained backbone.
✨ Features
- Maps sentences and paragraphs to a 768-dimensional dense vector space.
- Can be used for clustering or semantic search.
- Trained on a diverse dataset for better performance.
📦 Installation
To use this model, you need to install sentence-transformers:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Usage (Widget HuggingFace)
The widget uses a custom pipeline on top of the default pipeline by adding an additional word segmenter before PhobertTokenizer. So you don't need to segment words before using the API. An example can be seen in the Hosted inference API.
Usage (HuggingFace Transformers)
Without sentence-transformers, you can use the model as follows:
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['Cô ấy là một người vui_tính .', 'Cô ấy cười nói suốt cả ngày .']
tokenizer = AutoTokenizer.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
model = AutoModel.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
🔧 Technical Details
Training
The model was trained with the following parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 17584 with parameters:
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss
with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit()-Method:
{
"epochs": 15,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 1000,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
📄 License
This project is licensed under the apache-2.0 license.
Citation
Please cite our manuscript if this dataset is used for your work:
@article{duc2024towards,
title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models},
author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang},
journal={arXiv preprint arXiv:2403.01616},
year={2024}
}