Vietnamese Bi-Encoder Open-Source Model - Free Support for Vietnamese Text Semantic Similarity Analysis

Vietnamese Bi Encoder

Developed by bkai-foundation-models

This is a sentence transformer model based on PhoBERT-base-v2, specifically designed for Vietnamese text semantic similarity tasks.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Vietnamese sentence embedding #Multi-task training #Legal text retrieval

Downloads 30.46k

Release Time : 9/9/2023

Model Overview

The model maps Vietnamese sentences and paragraphs into a 768-dimensional dense vector space, suitable for natural language processing tasks such as clustering and semantic search.

Model Features

Optimized Vietnamese processing

Pre-trained on PhoBERT-base-v2 and specifically optimized for Vietnamese text

Multi-dataset training

Trained on MS Macro, SQuAD v2, and Zalo Legal Text Retrieval Challenge datasets

High-performance semantic encoding

Excellent performance on Zalo Legal Text Retrieval task with Acc@1 reaching 73.28%

Model Capabilities

Sentence embedding

Semantic similarity calculation

Text clustering

Information retrieval

Use Cases

Legal text retrieval

Legal document similarity search

Finding semantically similar documents in legal document libraries

Achieved Acc@1 of 73.28% on Zalo Legal Text Retrieval Challenge

Educational applications

Educational content retrieval

Searching for relevant learning materials in educational resource libraries

🚀 bkai-foundation-models/vietnamese-bi-encoder

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

Key Information

Property	Details
Model Type	Sentence-transformers
Training Data	- MS Macro (translated into Vietnamese) - SQuAD v2 (translated into Vietnamese) - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
License	apache-2.0

Results on the Remaining 20% of the Training Set from the Legal Text Retrieval Zalo 2021 Challenge

Pretrained Model	Training Datasets	Acc@1	Acc@10	Acc@100	Pre@10	MRR@10
Vietnamese-SBERT	-	32.34	52.97	89.84	7.05	45.30
PhoBERT-base-v2	MSMACRO	47.81	77.19	92.34	7.72	58.37
PhoBERT-base-v2	MSMACRO + SQuADv2.0 + 80% Zalo	73.28	93.59	98.85	9.36	80.73

🚀 Quick Start

This model maps sentences and paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. It is trained on a merged dataset including MS Macro (translated into Vietnamese), SQuAD v2 (translated into Vietnamese), and 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge. phobert-base-v2 is used as the pre-trained backbone.

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Can be used for clustering or semantic search.
Trained on a diverse dataset for better performance.

📦 Installation

To use this model, you need to install sentence-transformers:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]

model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Usage (Widget HuggingFace)

The widget uses a custom pipeline on top of the default pipeline by adding an additional word segmenter before PhobertTokenizer. So you don't need to segment words before using the API. An example can be seen in the Hosted inference API.

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model as follows:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
sentences = ['Cô ấy là một người vui_tính .', 'Cô ấy cười nói suốt cả ngày .']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
model = AutoModel.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 17584 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 15,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 License

This project is licensed under the apache-2.0 license.

Citation

Please cite our manuscript if this dataset is used for your work:

  @article{duc2024towards,
    title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models},
    author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang},
    journal={arXiv preprint arXiv:2403.01616},
    year={2024}
  }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご