DEk21_hcmute_embedding: An Open-Source Vietnamese Text Embedding Model - Improve RAG and Productivity, Suitable for Legal Scenarios

Dek21 Hcmute Embedding

Developed by huyydangg

Vietnamese text embedding model focused on RAG and production efficiency, trained on a dataset of 100,000 legal questions

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Vietnamese legal retrieval #Russian doll embedding #High-dimensional vector compression

Downloads 696

Release Time : 1/25/2025

Model Overview

This model is a Vietnamese sentence transformer model specifically designed for similarity calculation and information retrieval of legal texts, trained using Russian doll loss for improved efficiency.

Model Features

Russian doll loss training

Allows truncation of embedding vectors with minimal performance loss, enabling faster comparison of smaller embedding vectors and improving production efficiency

Legal domain optimization

Trained on an internal dataset of approximately 100,000 legal questions and their related contexts, making it particularly suitable for legal text processing

Efficient vector comparison

Supports embedding vectors of multiple dimensions (768/512/256/128/64), allowing flexible selection based on performance requirements

Model Capabilities

Legal text similarity calculation

Legal information retrieval

Legal clause matching

Vietnamese text feature extraction

Use Cases

Legal information retrieval

Legal clause matching

Matching user queries with relevant legal clauses

Achieved a cosine accuracy@1 of 0.5856 on the test dataset

Legal Q&A system

Building a knowledge-based legal Q&A system

Achieved ndcg@3 of 0.9084 on the VMTEB-Zalo-legel-retrieval-wseg dataset

Legal document processing

Legal document classification

Automatic classification of legal documents

Legal document clustering

Automatic clustering of similar legal documents

🚀 DEk21_hcmute_embedding

DEk21_hcmute_embedding is a Vietnamese text embedding model focused on Retrieval Augmented Generation (RAG) and production efficiency. It addresses the need for accurate and efficient text representation in real - world legal applications, enabling faster and more effective information retrieval.

🚀 Quick Start

Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer
import torch

# Download from the 🤗 Hub
model = SentenceTransformer("huyydangg/DEk21_hcmute_embedding")

# Define query (câu hỏi pháp luật) và docs (điều luật)
query = "Điều kiện để kết hôn hợp pháp là gì?"
docs = [
    "Điều 8 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của công dân trong quan hệ gia đình.",
    "Điều 18 Luật Hôn nhân và gia đình 2014 quy định về độ tuổi kết hôn của nam và nữ.",
    "Điều 14 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của cá nhân khi tham gia hợp đồng.",
    "Điều 27 Luật Hôn nhân và gia đình 2014 quy định về các trường hợp không được kết hôn.",
    "Điều 51 Luật Hôn nhân và gia đình 2014 quy định về việc kết hôn giữa công dân Việt Nam và người nước ngoài."
]

# Encode query and documents
query_embedding = model.encode([query])
doc_embeddings = model.encode(docs)
similarities = torch.nn.functional.cosine_similarity(
    torch.tensor(query_embedding), torch.tensor(doc_embeddings)
).flatten()

# Sort documents by cosine similarity
sorted_indices = torch.argsort(similarities, descending=True)
sorted_docs = [docs[idx] for idx in sorted_indices]
sorted_scores = [similarities[idx].item() for idx in sorted_indices]

# Print sorted documents with their cosine scores
for doc, score in zip(sorted_docs, sorted_scores):
    print(f"Document: {doc} - Cosine Similarity: {score:.4f}")

✨ Features

📚 Trained Dataset: The model was trained on an in - house dataset consisting of approximately 100,000 examples of legal questions and their related contexts.
🪆 Efficiency: Trained with a Matryoshka loss, allowing embeddings to be truncated with minimal performance loss. This ensures that smaller embeddings are faster to compare, making the model efficient for real - world production use.

📦 Installation

To use this model, you need to install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
import torch

# Download from the 🤗 Hub
model = SentenceTransformer("huyydangg/DEk21_hcmute_embedding")

# Define a simple query and document
query = "A simple legal query"
doc = "A related legal document"

# Encode query and document
query_embedding = model.encode([query])
doc_embedding = model.encode([doc])
similarity = torch.nn.functional.cosine_similarity(
    torch.tensor(query_embedding), torch.tensor(doc_embedding)
).item()

print(f"Similarity between query and document: {similarity:.4f}")

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Maximum Sequence Length	512 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Language	Vietnamese
License	apache - 2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

🔧 Technical Details

Evaluation

Metrics

Information Retrieval

Datasets: [another - symato/VMTEB - Zalo - legel - retrieval - wseg](https://huggingface.co/datasets/another - symato/VMTEB - Zalo - legel - retrieval - wseg)
Evaluated with InformationRetrievalEvaluator

model	type	ndcg@3	ndcg@5	ndcg@10	mrr@3	mrr@5	mrr@10
huyydangg/DEk21_hcmute_embedding_wseg	dense	0.908405	0.914792	0.917742	0.889583	0.893099	0.894266
AITeamVN/Vietnamese_Embedding	dense	0.842687	0.854993	0.865006	0.822135	0.82901	0.833389
bkai - foundation - models/vietnamese - bi - encoder	hybrid	0.827247	0.844781	0.846937	0.799219	0.809505	0.806771
bkai - foundation - models/vietnamese - bi - encoder	dense	0.814116	0.82965	0.839567	0.796615	0.805286	0.809572
AITeamVN/Vietnamese_Embedding	hybrid	0.788724	0.810062	0.820797	0.758333	0.77224	0.776461
BAAI/bge - m3	dense	0.784056	0.80665	0.817016	0.763281	0.775859	0.780293
BAAI/bge - m3	hybrid	0.775239	0.797382	0.811962	0.747656	0.763333	0.77128
huyydangg/DEk21_hcmute_embedding	dense	0.752173	0.769259	0.785101	0.72474	0.734427	0.741076

📄 License

This model is released under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご