đ camembert-base-lleqa
This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. The model was trained on the LLeQA dataset for legal information retrieval in French.
đ Quick Start
đĻ Installation
To use this model, you need to install sentence-transformers:
pip install -U sentence-transformers
đģ Usage Examples
đ Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('maastrichtlawtech/camembert-base-lleqa')
embeddings = model.encode(sentences)
print(embeddings)
âī¸ Advanced Usage
Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, and then apply the appropriate pooling operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('maastrichtlawtech/camembert-base-lleqa')
model = AutoModel.from_pretrained('maastrichtlawtech/camembert-base-lleqa')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings)
đ Evaluation
We evaluated the model on the test set of LLeQA, which consists of 195 legal questions and a knowledge corpus of 27.9K candidate articles. We report the mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
MRR@10 |
NDCG@10 |
MAP@10 |
R@10 |
R@100 |
R@500 |
36.55 |
39.27 |
30.64 |
58.27 |
82.43 |
92.41 |
đ§ Technical Details
đ Background
We utilized the camembert-base model and fine-tuned it on 9.3K question-article pairs in French. We used a contrastive learning objective: given a short legal question, the model should predict which out of a set of sampled legal articles was actually paired with it in the dataset. Formally, we compute the cosine similarity from each possible pair in the batch and then apply the cross-entropy loss with a temperature of 0.05 by comparing with the true pairs.
âī¸ Hyperparameters
We trained the model on a single Tesla V100 GPU with 32GB of memory for 20 epochs (i.e., 5.4k steps) using a batch size of 32. We used the AdamW optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 50 steps, and linear decay of the learning rate. The sequence length was limited to 384 tokens.
đ Data
We used the Long-form Legal Question Answering (LLeQA) dataset to fine-tune the model. LLeQA is a French native dataset for studying legal information retrieval and question answering. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.
đ License
This model is licensed under the Apache-2.0 license.
đ Citation
@article{louis2023interpretable,
author = {Louis, Antoine and van Dijck, Gijs and Spanakis, Gerasimos},
title = {Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models},
journal = {CoRR},
volume = {abs/2309.17050},
year = {2023},
url = {https://arxiv.org/abs/2309.17050},
eprinttype = {arXiv},
eprint = {2309.17050},
}
đ Model Information
Property |
Details |
Model Type |
sentence-transformers |
Training Data |
LLeQA |
Metrics |
recall |
Tags |
feature-extraction, sentence-similarity |
Library Name |
sentence-transformers |
Inference |
true |