🚀 Domain-adapted mBERT for French Legal Practice
This is a model designed for sentence similarity tasks. It maps sentences and paragraphs into a 768 - dimensional dense vector space, which can be used for clustering, semantic search, and other related tasks. The model is pre - trained on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. Then, it is fine - tuned for French legal domain adaptation, enabling it to learn the inner representation of the French legal language and extract useful features for downstream tasks.
🚀 Quick Start
Prerequisites
You need to have sentence - transformers installed to use this model conveniently. You can install it using the following command:
pip install -U sentence-transformers
Usage
Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer("louisbrulenaudet/tsdae-lemone-mbert-base")
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Without sentence - transformers, you can use the model as follows:
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")
model = AutoModel.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = cls_pooling(model_output, encoded_input["attention_mask"])
print("Sentence embeddings:")
print(sentence_embeddings)
✨ Features
- Multilingual Pretraining: Pretrained on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
- Domain Adaptation: Specifically adapted to the French legal domain, enabling it to better understand and process French legal texts.
- Feature Extraction: Can extract features useful for downstream tasks, such as training classifiers.
📦 Installation
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer("louisbrulenaudet/tsdae-lemone-mbert-base")
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")
model = AutoModel.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-base")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = cls_pooling(model_output, encoded_input["attention_mask"])
print("Sentence embeddings:")
print(sentence_embeddings)
🔧 Technical Details
Training Parameters
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 25000 with parameters:
{'batch_size': 4, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.DenoisingAutoEncoderLoss.DenoisingAutoEncoderLoss
Parameters of the fit() - Method:
{
"epochs": 1,
"evaluation_steps": 0,
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 3e-05
},
"scheduler": "constantlr",
"steps_per_epoch": null,
"warmup_steps": 10000,
"weight_decay": 0
}
Training Data
The training database consisted of 100,000 random sentences, each over 40 characters long, from the following French legal codes:
- French Intellectual Property Code (Code de la propriété intellectuelle)
- French Civil Code (Code civil)
- French Labor Code (Code du travail)
- French Monetary and Financial Code (Code monétaire et financier)
- French Commercial Code (Code de commerce)
- French Penal Code (Code pénal)
- French Consumer Code (Code de la consommation)
- French Environment Code (Code de l'environnement)
- French General Tax Code (Code général des Impôts)
- French Code of civil procedure (Code de procédure civile)
The number of sentences per code may not exceed 15,000.
The DenoisingAutoEncoderDataset
is crafted to provide pairs of noisy and clean data instances. This arrangement allows the denoising autoencoder model to learn and enhance its ability to reconstruct or generate clean data from the corrupted versions provided in the dataset.
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
📚 Documentation
Citing & Authors
If you use this code in your research, please use the following BibTeX entry.
@misc{louisbrulenaudet2023,
author = {Louis Brulé Naudet},
title = {Domain-adapted mBERT for French Legal Practice},
year = {2023},
howpublished = {\url{https://huggingface.co/louisbrulenaudet/tsdae-lemone-mbert-base}},
}
Feedback
If you have any feedback, please reach out at louisbrulenaudet@icloud.com.
📄 License
This project is licensed under the Apache - 2.0 license.
📦 Model Information
Property |
Details |
Model Type |
Domain - adapted mBERT for French Legal Practice |
Training Data |
French Intellectual Property Code, French Civil Code, French Labor Code, French Monetary and Financial Code, French Commercial Code, French Penal Code, French Consumer Code, French Environment Code, French General Tax Code, French Code of civil procedure |