🚀 DEk21_hcmute_embedding
DEk21_hcmute_embedding is a Vietnamese text embedding model focused on Retrieval Augmented Generation (RAG) and production efficiency. It addresses the need for accurate and efficient text representation in real - world legal applications, enabling faster and more effective information retrieval.
🚀 Quick Start
Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer("huyydangg/DEk21_hcmute_embedding")
query = "Điều kiện để kết hôn hợp pháp là gì?"
docs = [
"Điều 8 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của công dân trong quan hệ gia đình.",
"Điều 18 Luật Hôn nhân và gia đình 2014 quy định về độ tuổi kết hôn của nam và nữ.",
"Điều 14 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của cá nhân khi tham gia hợp đồng.",
"Điều 27 Luật Hôn nhân và gia đình 2014 quy định về các trường hợp không được kết hôn.",
"Điều 51 Luật Hôn nhân và gia đình 2014 quy định về việc kết hôn giữa công dân Việt Nam và người nước ngoài."
]
query_embedding = model.encode([query])
doc_embeddings = model.encode(docs)
similarities = torch.nn.functional.cosine_similarity(
torch.tensor(query_embedding), torch.tensor(doc_embeddings)
).flatten()
sorted_indices = torch.argsort(similarities, descending=True)
sorted_docs = [docs[idx] for idx in sorted_indices]
sorted_scores = [similarities[idx].item() for idx in sorted_indices]
for doc, score in zip(sorted_docs, sorted_scores):
print(f"Document: {doc} - Cosine Similarity: {score:.4f}")
✨ Features
- 📚 Trained Dataset: The model was trained on an in - house dataset consisting of approximately 100,000 examples of legal questions and their related contexts.
- 🪆 Efficiency: Trained with a Matryoshka loss, allowing embeddings to be truncated with minimal performance loss. This ensures that smaller embeddings are faster to compare, making the model efficient for real - world production use.
📦 Installation
To use this model, you need to install the Sentence Transformers library:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer("huyydangg/DEk21_hcmute_embedding")
query = "A simple legal query"
doc = "A related legal document"
query_embedding = model.encode([query])
doc_embedding = model.encode([doc])
similarity = torch.nn.functional.cosine_similarity(
torch.tensor(query_embedding), torch.tensor(doc_embedding)
).item()
print(f"Similarity between query and document: {similarity:.4f}")
📚 Documentation
Model Details
Model Description
Property |
Details |
Model Type |
Sentence Transformer |
Maximum Sequence Length |
512 tokens |
Output Dimensionality |
768 dimensions |
Similarity Function |
Cosine Similarity |
Language |
Vietnamese |
License |
apache - 2.0 |
Model Sources
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
🔧 Technical Details
Evaluation
Metrics
Information Retrieval
- Datasets: [another - symato/VMTEB - Zalo - legel - retrieval - wseg](https://huggingface.co/datasets/another - symato/VMTEB - Zalo - legel - retrieval - wseg)
- Evaluated with
InformationRetrievalEvaluator
model |
type |
ndcg@3 |
ndcg@5 |
ndcg@10 |
mrr@3 |
mrr@5 |
mrr@10 |
huyydangg/DEk21_hcmute_embedding_wseg |
dense |
0.908405 |
0.914792 |
0.917742 |
0.889583 |
0.893099 |
0.894266 |
AITeamVN/Vietnamese_Embedding |
dense |
0.842687 |
0.854993 |
0.865006 |
0.822135 |
0.82901 |
0.833389 |
bkai - foundation - models/vietnamese - bi - encoder |
hybrid |
0.827247 |
0.844781 |
0.846937 |
0.799219 |
0.809505 |
0.806771 |
bkai - foundation - models/vietnamese - bi - encoder |
dense |
0.814116 |
0.82965 |
0.839567 |
0.796615 |
0.805286 |
0.809572 |
AITeamVN/Vietnamese_Embedding |
hybrid |
0.788724 |
0.810062 |
0.820797 |
0.758333 |
0.77224 |
0.776461 |
BAAI/bge - m3 |
dense |
0.784056 |
0.80665 |
0.817016 |
0.763281 |
0.775859 |
0.780293 |
BAAI/bge - m3 |
hybrid |
0.775239 |
0.797382 |
0.811962 |
0.747656 |
0.763333 |
0.77128 |
huyydangg/DEk21_hcmute_embedding |
dense |
0.752173 |
0.769259 |
0.785101 |
0.72474 |
0.734427 |
0.741076 |
📄 License
This model is released under the apache - 2.0 license.