camembert-base-lleqa Open-source Model - Specifically Designed for French Legal Information Retrieval, Converting Text into Vectors

Camembert Base Lleqa

Developed by maastrichtlawtech

A French sentence embedding model based on CamemBERT, specifically optimized for French legal information retrieval tasks, capable of converting text into 768-dimensional vector space representations.

Text Embedding

PyTorch

FrenchOpen Source License:Apache-2.0 #French Legal Semantic Retrieval #High Recall Feature Extraction #Belgian Regulation Q&A

Downloads 25

Release Time : 9/28/2023

Model Overview

This model is a sentence embedding model fine-tuned on the French legal Q&A dataset LLeQA, suitable for tasks such as legal clause retrieval and semantic similarity calculation, effectively handling French legal texts.

Model Features

Legal Domain Optimization

Specifically fine-tuned for French legal texts, excelling in Belgian regulation retrieval tasks.

Efficient Semantic Encoding

Encodes sentences/paragraphs of any length into fixed 768-dimensional dense vectors, suitable for large-scale retrieval.

Contrastive Learning Training

Uses Q&A-clause contrastive learning objectives to enhance the model's ability to distinguish relevant legal clauses.

Model Capabilities

French Sentence Embedding

Semantic Similarity Calculation

Legal Clause Retrieval

Text Feature Extraction

Use Cases

Legal Information Retrieval

Citizen Legal Q&A System

Automatically retrieves relevant legal clauses based on natural language questions.

Achieved a 58.27% R@10 recall rate on the test set.

Regulation Clause Clustering

Performs semantic clustering analysis on legal provisions.

Document Processing

Legal Document Similarity Comparison

Calculates semantic similarity between different legal documents.

🚀 camembert-base-lleqa

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. The model was trained on the LLeQA dataset for legal information retrieval in French.

🚀 Quick Start

📦 Installation

To use this model, you need to install sentence-transformers:

pip install -U sentence-transformers

💻 Usage Examples

🔍 Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('maastrichtlawtech/camembert-base-lleqa')
embeddings = model.encode(sentences)
print(embeddings)

⚙️ Advanced Usage

Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, and then apply the appropriate pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('maastrichtlawtech/camembert-base-lleqa')
model = AutoModel.from_pretrained('maastrichtlawtech/camembert-base-lleqa')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings)

📊 Evaluation

We evaluated the model on the test set of LLeQA, which consists of 195 legal questions and a knowledge corpus of 27.9K candidate articles. We report the mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).

MRR@10	NDCG@10	MAP@10	R@10	R@100	R@500
36.55	39.27	30.64	58.27	82.43	92.41

🔧 Technical Details

📚 Background

We utilized the camembert-base model and fine-tuned it on 9.3K question-article pairs in French. We used a contrastive learning objective: given a short legal question, the model should predict which out of a set of sampled legal articles was actually paired with it in the dataset. Formally, we compute the cosine similarity from each possible pair in the batch and then apply the cross-entropy loss with a temperature of 0.05 by comparing with the true pairs.

⚙️ Hyperparameters

We trained the model on a single Tesla V100 GPU with 32GB of memory for 20 epochs (i.e., 5.4k steps) using a batch size of 32. We used the AdamW optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 50 steps, and linear decay of the learning rate. The sequence length was limited to 384 tokens.

📈 Data

We used the Long-form Legal Question Answering (LLeQA) dataset to fine-tune the model. LLeQA is a French native dataset for studying legal information retrieval and question answering. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.

📄 License

This model is licensed under the Apache-2.0 license.

📖 Citation

@article{louis2023interpretable,
  author = {Louis, Antoine and van Dijck, Gijs and Spanakis, Gerasimos},
  title = {Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models},
  journal = {CoRR},
  volume = {abs/2309.17050},
  year = {2023},
  url = {https://arxiv.org/abs/2309.17050},
  eprinttype = {arXiv},
  eprint = {2309.17050},
}

📋 Model Information

Property	Details
Model Type	sentence-transformers
Training Data	LLeQA
Metrics	recall
Tags	feature-extraction, sentence-similarity
Library Name	sentence-transformers
Inference	true

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご