Vietnamese_Embedding Open-Source Vietnamese Embedding Model - Enhance Vietnamese Information Retrieval Capability

Home

Vietnamese Embedding

Developed by AITeamVN

Vietnamese embedding model fine-tuned on BGE-M3, enhancing Vietnamese retrieval capabilities

Text Embedding

Safetensors

Other#Vietnamese retrieval enhancement #Long text embedding #Legal domain optimization

Downloads 14.26k

Release Time : 3/17/2025

Model Overview

Vietnamese_Embedding is an embedding model fine-tuned on the BGE-M3 model, specifically optimized for Vietnamese retrieval tasks, trained on approximately 300,000 sets of Vietnamese query, positive document, and negative document triplets.

Model Features

Vietnamese optimization

Fine-tuned specifically for Vietnamese retrieval tasks, improving the embedding quality of Vietnamese text

Long text support

Supports sequences up to 2048 tokens, suitable for processing long documents

High performance

Outperforms the base model BGE-M3 and other Vietnamese embedding models in legal text retrieval tasks

Model Capabilities

Vietnamese text embedding

Sentence similarity calculation

Document retrieval

Use Cases

Information retrieval

Legal document retrieval

Achieves high-accuracy document retrieval on legal text datasets

Accuracy@1 reaches 0.7274 on the Legal Zalo 2021 dataset

General document retrieval

Applicable to various Vietnamese document retrieval tasks

🚀 Vietnamese_Embedding

Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model to enhance retrieval capabilities for Vietnamese. It addresses the need for effective Vietnamese text retrieval and provides high - quality embedding representations for Vietnamese language processing.

🚀 Quick Start

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("AITeamVN/Vietnamese_Embedding")
model.max_seq_length = 2048
sentences_1 = ["Trí tuệ nhân tạo là gì", "Lợi ích của giấc ngủ"]
sentences_2 = ["Trí tuệ nhân tạo là công nghệ giúp máy móc suy nghĩ và học hỏi như con người. Nó hoạt động bằng cách thu thập dữ liệu, nhận diện mẫu và đưa ra quyết định.", 
               "Giấc ngủ giúp cơ thể và não bộ nghỉ ngơi, hồi phục năng lượng và cải thiện trí nhớ. Ngủ đủ giấc giúp tinh thần tỉnh táo và làm việc hiệu quả hơn."]
query_embedding = model.encode(sentences_1)
doc_embeddings = model.encode(sentences_2)
similarity = query_embedding @ doc_embeddings.T
print(similarity)

'''
array([[0.66212064, 0.33066642],
       [0.25866613, 0.5865289 ]], dtype=float32)
'''

✨ Features

Fine - tuned from the BGE - M3 model to enhance Vietnamese retrieval capabilities.
Trained on approximately 300,000 triplets of Vietnamese queries, positive documents, and negative documents.
Trained with a maximum sequence length of 2048.

📦 Installation

The installation can be achieved by installing the sentence-transformers library. You can use the following command:

pip install sentence-transformers

📚 Documentation

Model Details

Property	Details
Model Type	Sentence Transformer
Base model	BAAI/bge-m3
Maximum Sequence Length	2048 tokens
Output Dimensionality	1024 dimensions
Similarity Function	Dot product Similarity
Language	Vietnamese
License	Apache 2.0

Evaluation

Dataset: Entire training dataset of Legal Zalo 2021. Our model was not trained on this dataset.

Model	Accuracy@1	Accuracy@3	Accuracy@5	Accuracy@10	MRR@10
Vietnamese_Reranker	0.7944	0.9324	0.9537	0.9740	0.8672
Vietnamese_Embedding_v2	0.7262	0.8927	0.9268	0.9578	0.8149
Vietnamese_Embedding (public)	0.7274	0.8992	0.9305	0.9568	0.8181
Vietnamese - bi - encoder (BKAI)	0.7109	0.8680	0.9014	0.9299	0.7951
BGE - M3	0.5682	0.7728	0.8382	0.8921	0.6822

Vietnamese_Reranker and Vietnamese_Embedding_v2 were trained on 1100000 triplets. Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains.

You can access 2 models via links: Vietnamese_Embedding_v2, Vietnamese_Reranker

You can reproduce the evaluation result by running the Python code evaluation_model.py (data downloaded from Kaggle).

📄 License

This model is licensed under the Apache 2.0 license.

Contact

Email: nguyennhotrung3004@gmail.com

Developer

Member: Nguyễn Nho Trung, Nguyễn Nhật Quang

Citation

@misc{Vietnamese_Embedding,
  title={Vietnamese_Embedding: Embedding model in Vietnamese language.},
  author={Nguyen Nho Trung, Nguyen Nhat Quang},
  year={2025},
  publisher={Huggingface},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご