tsdae-lemone-mbert-base Open-source Model - Free Conversion of French Legal Texts into 768-dimensional Vectors

Tsdae Lemone Mbert Base

Developed by louisbrulenaudet

This is a sentence transformer model based on mBERT, specifically optimized for the French legal domain, capable of converting legal texts into 768-dimensional vector representations.

Text Embedding FrenchOpen Source License:Apache-2.0 #French Legal Semantic Analysis #Multi-Code Adaptation #Denoising Autoencoder

Downloads 22

Release Time : 12/17/2023

Model Overview

The model is based on the multilingual BERT architecture, trained with domain adaptation on French legal texts, primarily used for semantic similarity calculation and feature extraction of legal texts.

Model Features

Legal Domain Adaptation

Specifically optimized for French legal texts, better understanding legal terminology and expressions.

Multi-Code Training

Training data covers 10 major French legal codes, spanning a wide range of legal domains.

Denoising Autoencoder

Utilizes TSDAE (Transformer-based Sequential Denoising Auto-Encoder) training method to enhance model robustness.

Model Capabilities

Legal text feature extraction

Legal document semantic search

Legal text clustering analysis

Legal document similarity calculation

Use Cases

Legal Intelligence

Legal Document Retrieval

Quickly find legal provisions semantically similar to the query.

Improves efficiency in legal research and consultation.

Legal Text Classification

Classify legal documents based on semantic features.

Automates document management workflows.

Legal Technology

Smart Legal Assistant

Provides legal professionals with relevant provision recommendations.

Enhances the quality of legal services.

🚀 Domain-adapted mBERT for French Legal Practice

This is a model designed for sentence similarity tasks. It maps sentences and paragraphs into a 768 - dimensional dense vector space, which can be used for clustering, semantic search, and other related tasks. The model is pre - trained on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. Then, it is fine - tuned for French legal domain adaptation, enabling it to learn the inner representation of the French legal language and extract useful features for downstream tasks.

🚀 Quick Start

Prerequisites

You need to have sentence - transformers installed to use this model conveniently. You can install it using the following command:

pip install -U sentence-transformers

Usage

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("louisbrulenaudet/tsdae-lemone-mbert-base")
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence - transformers, you can use the model as follows:

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")
model = AutoModel.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

✨ Features

Multilingual Pretraining: Pretrained on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
Domain Adaptation: Specifically adapted to the French legal domain, enabling it to better understand and process French legal texts.
Feature Extraction: Can extract features useful for downstream tasks, such as training classifiers.

📦 Installation

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("louisbrulenaudet/tsdae-lemone-mbert-base")
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")
model = AutoModel.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Training Parameters

DataLoader: torch.utils.data.dataloader.DataLoader of length 25000 with parameters:

{'batch_size': 4, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.DenoisingAutoEncoderLoss.DenoisingAutoEncoderLoss

Parameters of the fit() - Method:

{
    "epochs": 1,
    "evaluation_steps": 0,
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 3e-05
    },
    "scheduler": "constantlr",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0
}

Training Data

The training database consisted of 100,000 random sentences, each over 40 characters long, from the following French legal codes:

French Intellectual Property Code (Code de la propriété intellectuelle)
French Civil Code (Code civil)
French Labor Code (Code du travail)
French Monetary and Financial Code (Code monétaire et financier)
French Commercial Code (Code de commerce)
French Penal Code (Code pénal)
French Consumer Code (Code de la consommation)
French Environment Code (Code de l'environnement)
French General Tax Code (Code général des Impôts)
French Code of civil procedure (Code de procédure civile)

The number of sentences per code may not exceed 15,000.

The DenoisingAutoEncoderDataset is crafted to provide pairs of noisy and clean data instances. This arrangement allows the denoising autoencoder model to learn and enhance its ability to reconstruct or generate clean data from the corrupted versions provided in the dataset.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📚 Documentation

Citing & Authors

If you use this code in your research, please use the following BibTeX entry.

@misc{louisbrulenaudet2023,
  author =       {Louis Brulé Naudet},
  title =        {Domain-adapted mBERT for French Legal Practice},
  year =         {2023},
  howpublished = {\url{https://huggingface.co/louisbrulenaudet/tsdae-lemone-mbert-base}},
}

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.

📄 License

This project is licensed under the Apache - 2.0 license.

📦 Model Information

Property	Details
Model Type	Domain - adapted mBERT for French Legal Practice
Training Data	French Intellectual Property Code, French Civil Code, French Labor Code, French Monetary and Financial Code, French Commercial Code, French Penal Code, French Consumer Code, French Environment Code, French General Tax Code, French Code of civil procedure

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご