gte-multilingual-legal-1e Open-source Sentence Transformer - Multilingual Semantic Similarity Processing for Legal and Administrative Texts

Gte Multilingual Legal 1e

Developed by anhtuansh

This is a fine-tuned sentence transformer model derived from Alibaba-NLP/gte-multilingual-base, specifically optimized for semantic similarity tasks in legal and administrative texts, with support for multilingual processing.

Text Embedding

Safetensors

Other#Legal Text Similarity #Multilingual Semantic Matching #High-Precision Vectorization

Downloads 26

Release Time : 2/11/2025

Model Overview

The model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for semantic text similarity, semantic search, paraphrase mining, text classification, clustering, and other tasks, particularly effective for legal and administrative document processing.

Model Features

Legal Text Optimization

Fine-tuned for legal and administrative documents, excelling in processing formal documents and legal provisions.

Long Text Processing Capability

Supports sequences up to 8192 tokens, ideal for handling lengthy legal texts.

High-Precision Semantic Matching

Achieves a cosine accuracy of 0.9997 on public administrative datasets, accurately identifying correlations between legal clauses.

Model Capabilities

Calculate sentence similarity

Semantic search

Text classification

Document clustering

Legal clause matching

Multilingual text processing

Use Cases

Legal Document Processing

Legal Clause Matching

Automatically matches relevant legal clauses to assist in legal research and document drafting.

Accurately identifies different expressions of clauses with similar legal effects.

Contract Clause Review

Compares the similarity between contract clauses and standard legal texts.

Detects deviations between contract clauses and standard legal texts.

Administrative Document Management

Policy Document Classification

Automatically classifies government documents based on content similarity.

Improves document management efficiency and reduces manual classification errors.

🚀 SentenceTransformer based on Alibaba-NLP/gte-multilingual-base

This model is a fine - tuned sentence - transformers model derived from [Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base). It maps sentences and paragraphs into a 768 - dimensional dense vector space, which can be applied to semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Maps sentences and paragraphs to a 768 - dimensional dense vector space.
Applicable to various natural language processing tasks such as semantic textual similarity, semantic search, etc.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("anhtuansh/gte-multilingual-legal-1e")
# Run inference
sentences = [
    'người tiếp_nhận hồ_sơ có trách_nhiệm gì trong quá_trình chứng_thực hợp_đồng , giao_dịch ?',
    'điều 20 . chứng_thực hợp_đồng , giao_dịch tại bộ_phận tiếp_nhận và trả kết_quả theo cơ_chế một cửa , một cửa liên_thông \n 1 . trường_hợp người yêu_cầu chứng_thực hợp_đồng , giao_dịch nộp hồ_sơ trực_tiếp tại bộ_phận tiếp_nhận và trả kết_quả theo cơ_chế một cửa , một cửa liên_thông , thì các bên phải ký trước mặt người tiếp_nhận hồ_sơ . trường_hợp người giao_kết_hợp_đồng , giao_dịch là đại_diện của tổ_chức tín_dụng , doanh_nghiệp đã đăng_ký chữ_ký mẫu tại cơ_quan thực_hiện chứng_thực , thì người đó có_thể ký trước vào hợp_đồng , giao_dịch . người tiếp_nhận hồ_sơ có trách_nhiệm đối_chiếu chữ_ký trong hợp_đồng , giao_dịch với chữ_ký mẫu . nếu thấy chữ_ký trong hợp_đồng , giao_dịch khác chữ_ký mẫu , thì yêu_cầu người đó ký trước mặt người tiếp_nhận hồ_sơ . người tiếp_nhận hồ_sơ phải chịu trách_nhiệm về việc các bên đã ký trước mặt mình . \n 2 . người tiếp_nhận hồ_sơ có trách_nhiệm kiểm_tra giấy_tờ , hồ_sơ .',
    'điều 8 . trị_giá tính thuế , thời_điểm tính thuế \n 1 . trị_giá tính thuế_xuất_khẩu , thuế_nhập_khẩu là trị_giá hải_quan theo quy_định của luật hải_quan . \n 2 . thời_điểm tính thuế_xuất_khẩu , thuế_nhập_khẩu là thời_điểm đăng_ký tờ khai hải_quan . đối_với hàng_hóa xuất_khẩu , nhập_khẩu thuộc đối_tượng không chịu thuế , miễn thuế_xuất_khẩu , thuế_nhập_khẩu hoặc áp_dụng thuế_suất , mức thuế tuyệt_đối trong hạn_ngạch thuế_quan nhưng được thay_đổi về đối_tượng không chịu thuế , miễn thuế , áp_dụng thuế_suất , mức thuế tuyệt_đối trong hạn_ngạch thuế_quan theo quy_định của pháp_luật thì thời_điểm tính thuế là thời_điểm đăng_ký tờ khai hải_quan mới . thời_điểm đăng_ký tờ khai hải_quan thực_hiện theo quy_định của pháp_luật về hải_quan .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	[Alibaba - NLP/gte - multilingual - base](https://huggingface.co/Alibaba - NLP/gte - multilingual - base)
Maximum Sequence Length	8192 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence - transformers)
Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence - transformers)

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Evaluation

Metrics

Triplet

Dataset: public_administrative
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9997

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご