vn-law-embedding open-source model - Free deployment facilitates efficient retrieval and Q&A of Vietnamese legal documents

Vn Law Embedding

Developed by truro7

Embedding model specifically designed for Vietnamese legal text retrieval, supporting efficient legal document retrieval and Q&A

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Vietnamese legal retrieval #RAG-optimized embedding #Matryoshka dimensionality reduction

Downloads 21.87k

Release Time : 9/7/2024

Model Overview

Text embedding model trained on Vietnamese legal questions and related legal document datasets for precise retrieval of legal documents to answer legal questions. Uses Matryoshka loss function to support dimensionality truncation optimization.

Model Features

Legal domain optimization

Specially trained and optimized for Vietnamese legal texts

Matryoshka dimensionality truncation

Supports dynamic adjustment of embedding dimensions (e.g., 128-dim) to accelerate retrieval

Efficient retrieval capability

Demonstrates excellent retrieval accuracy on Vietnamese legal datasets (Accuracy@10 reaches 90%)

Model Capabilities

Legal text feature extraction

Legal document similarity calculation

Legal Q&A retrieval enhancement

Vietnamese language semantic understanding

Use Cases

Legal intelligent Q&A

Legal clause retrieval

Quickly locate relevant legal provisions based on natural language questions

Achieves 82.3% recall@10 on test set

Legal document management

Similar case search

Find similar precedents or legal documents through semantic retrieval

🚀 VN Law Embedding

VN Law Embedding is a Vietnamese text embedding model crafted for Retrieval-Augmented Generation (RAG). Its primary purpose is to retrieve accurate legal documents in response to legal queries. The model is trained on a dataset of Vietnamese legal questions and their corresponding legal documents, and it's evaluated using an Information Retrieval Evaluator. During training, it employs Matryoshka loss and can be truncated to smaller dimensions, enabling faster comparisons between queries and documents without sacrificing performance.

📦 Installation

Since the model can be directly used via sentence-transformers, you can install the required library using the following command:

pip install sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
import torch
import torch.nn.functional as F

model = SentenceTransformer("truro7/vn-law-embedding", truncate_dim = 128)

query = "Trộm cắp sẽ bị xử lý như thế nào?" 

corpus = """

[100_2015_QH13]

LUẬT HÌNH SỰ
Điều 173. Tội trộm cắp tài sản

Khoản 1:

1. Người nào trộm cắp tài sản của người khác trị giá từ 2.000.000 đồng đến dưới 50.000.000 đồng hoặc dưới 2.000.000 đồng nhưng thuộc một trong các trường hợp sau đây, thì bị phạt cải tạo không giam giữ đến 03 năm hoặc phạt tù từ 06 tháng đến 03 năm:
a) Đã bị xử phạt vi phạm hành chính về hành vi chiếm đoạt tài sản mà còn vi phạm;
b) Đã bị kết án về tội này hoặc về một trong các tội quy định tại các điều 168, 169, 170, 171, 172, 174, 175 và 290 của Bộ luật này, chưa được xóa án tích mà còn vi phạm;
c) Gây ảnh hưởng xấu đến an ninh, trật tự, an toàn xã hội;
d) Tài sản là phương tiện kiếm sống chính của người bị hại và gia đình họ; tài sản là kỷ vật, di vật, đồ thờ cúng có giá trị đặc biệt về mặt tinh thần đối với người bị hại.
    
"""

embedding = torch.tensor([model.encode(query)])
corpus_embeddings = torch.tensor([model.encode(corpus)])

cosine_similarities = F.cosine_similarity(embedding, corpus_embeddings)

print(cosine_similarities.item()) #0.81

Advanced Usage

from sentence_transformers import SentenceTransformer
import torch
import torch.nn.functional as F

model = SentenceTransformer("truro7/vn-law-embedding", truncate_dim = 128)

all_docs = read_all_docs() # Read all legal documents -> list of document contents 
top_k = 3
embedding_docs = torch.load(vectordb_path, weights_only=False).to(self.device) # Vector database

query = "Trộm cắp sẽ bị xử lý như thế nào?" 
embedding = torch.tensor(model.encode(query))

cosine_similarities = F.cosine_similarity(embedding.unsqueeze(0).expand(self.embedding_docs.shape[0], 1, 128), self.embedding_docs, dim = -1).view(-1)
top_k = cosine_similarities.topk(k)

top_k_indices = top_k.indices
top_k_values = top_k.values

print(top_k_values)  #Similarity scores

for i in top_k_indices:     #Show top k relevant documents
    print(all_docs[i])
    print("___________________________________________")

📄 License

This project is licensed under the Apache-2.0 license.

📊 Model Information

Property	Details
Model Type	VN Law Embedding
Base Model	hiieu/halong_embedding
Library Name	sentence-transformers
Pipeline Tag	sentence-similarity
Training Datasets	truro7/vn-law-questions-and-corpus
Metrics	cosine_accuracy@1, cosine_accuracy@3, cosine_accuracy@5, cosine_accuracy@10, cosine_precision@1, cosine_precision@3, cosine_precision@5, cosine_precision@10, cosine_recall@1, cosine_recall@3, cosine_recall@5, cosine_recall@10, cosine_ndcg@10, cosine_mrr@10, cosine_map@100
Tags	legal, sentence-transformers, sentence-similarity, feature-extraction, generated_from_trainer, loss:MatryoshkaLoss, loss:MultipleNegativesRankingLoss

📈 Model Results

Task	Metric	Value
Information Retrieval	Cosine Accuracy@1	0.623
Information Retrieval	Cosine Accuracy@3	0.792
Information Retrieval	Cosine Accuracy@5	0.851
Information Retrieval	Cosine Accuracy@10	0.900
Information Retrieval	Cosine Precision@1	0.623
Information Retrieval	Cosine Precision@3	0.412
Information Retrieval	Cosine Precision@5	0.310
Information Retrieval	Cosine Precision@10	0.184
Information Retrieval	Cosine Recall@1	0.353
Information Retrieval	Cosine Recall@3	0.608
Information Retrieval	Cosine Recall@5	0.722
Information Retrieval	Cosine Recall@10	0.823
Information Retrieval	Cosine Ndcg@10	0.706
Information Retrieval	Cosine Mrr@10	0.717
Information Retrieval	Cosine Map@100	0.645

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご