Vietnamese_Embedding開源越南語嵌入模型 - 增強越南語信息檢索能力

首頁

Vietnamese Embedding

由AITeamVN開發

基於BGE-M3微調的越南語嵌入模型，增強越南語檢索能力

文本嵌入

Safetensors

其他#越南語檢索增強 #長文本嵌入 #法律領域優化

下載量 14.26k

發布時間 : 3/17/2025

模型概述

Vietnamese_Embedding是基於BGE-M3模型微調的嵌入模型，專門針對越南語檢索任務優化，在約30萬組越南語查詢、正向文檔和負向文檔三元組上進行訓練。

模型特點

越南語優化

專門針對越南語檢索任務進行微調，提升越南語文本的嵌入質量

長文本支持

支持最大2048個標記的序列長度，適合處理長文檔

高性能

在法律文本檢索任務上表現優於基礎模型BGE-M3和其他越南語嵌入模型

模型能力

越南語文本嵌入

句子相似度計算

文檔檢索

使用案例

信息檢索

法律文檔檢索

在法律文本數據集上實現高準確率的文檔檢索

在Legal Zalo 2021數據集上Accuracy@1達到0.7274

通用文檔檢索

適用於各種越南語文檔的檢索任務

🚀 越南語嵌入模型

越南語嵌入模型是一個從BGE - M3模型（https://huggingface.co/BAAI/bge - m3）微調而來的嵌入模型，旨在增強越南語的檢索能力。

🚀 快速開始

越南語嵌入模型是基於BGE - M3模型微調的，用於提升越南語的檢索性能。以下是使用該模型的示例代碼：

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("AITeamVN/Vietnamese_Embedding")
model.max_seq_length = 2048
sentences_1 = ["Trí tuệ nhân tạo là gì", "Lợi ích của giấc ngủ"]
sentences_2 = ["Trí tuệ nhân tạo là công nghệ giúp máy móc suy nghĩ và học hỏi như con người. Nó hoạt động bằng cách thu thập dữ liệu, nhận diện mẫu và đưa ra quyết định.", 
               "Giấc ngủ giúp cơ thể và não bộ nghỉ ngơi, hồi phục năng lượng và cải thiện trí nhớ. Ngủ đủ giấc giúp tinh thần tỉnh táo và làm việc hiệu quả hơn."]
query_embedding = model.encode(sentences_1)
doc_embeddings = model.encode(sentences_2)
similarity = query_embedding @ doc_embeddings.T
print(similarity)

'''
array([[0.66212064, 0.33066642],
       [0.25866613, 0.5865289 ]], dtype=float32)
'''

✨ 主要特性

該模型在約300,000個越南語的查詢、正文檔和負文檔三元組上進行訓練。
模型訓練時的最大序列長度為2048。

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	句子轉換器
基礎模型	[BAAI/bge - m3](https://huggingface.co/BAAI/bge - m3)
最大序列長度	2048個標記
輸出維度	1024維
相似度函數	點積相似度
語言	越南語
許可證	Apache 2.0

評估

數據集：2021年Legal Zalo的整個訓練數據集，本模型未在該數據集上進行訓練。

模型	Accuracy@1	Accuracy@3	Accuracy@5	Accuracy@10	MRR@10
Vietnamese_Reranker	0.7944	0.9324	0.9537	0.9740	0.8672
Vietnamese_Embedding_v2	0.7262	0.8927	0.9268	0.9578	0.8149
Vietnamese_Embedding (public)	0.7274	0.8992	0.9305	0.9568	0.8181
Vietnamese - bi - encoder (BKAI)	0.7109	0.8680	0.9014	0.9299	0.7951
BGE - M3	0.5682	0.7728	0.8382	0.8921	0.6822

Vietnamese_Reranker和Vietnamese_Embedding_v2在1100000個三元組上進行訓練。雖然Vietnamese_Embedding_v2在法律領域的得分略有下降，但由於該階段的數據量更大，它在其他領域表現良好。

你可以通過以下鏈接訪問兩個模型：Vietnamese_Embedding_v2，Vietnamese_Reranker。你可以通過運行Python代碼evaluation_model.py（數據從Kaggle下載）來複現評估結果。

📄 許可證

本模型使用的許可證為Apache 2.0。

👥 聯繫信息

郵箱：nguyennhotrung3004@gmail.com
開發者：Nguyễn Nho Trung, Nguyễn Nhật Quang

📖 引用

@misc{Vietnamese_Embedding,
  title={Vietnamese_Embedding: Embedding model in Vietnamese language.},
  author={Nguyen Nho Trung, Nguyen Nhat Quang},
  year={2025},
  publisher={Huggingface},
}