Distil-ITA-Legal-BERT Open-source Model - Empowering the Italian Legal Field, a Must-have for Lightweight Information Processing!

Distil Ita Legal Bert

Developed by dlicari

A lightweight BERT model for the Italian legal domain built using knowledge distillation technology, featuring only 4 Transformer layers

Text Embedding

Transformers

#Italian legal text #Lightweight BERT #Sentence similarity

Downloads 353

Release Time : 12/10/2022

Model Overview

This is a lightweight sentence embedding model compressed from ITALIAN-LEGAL-BERT through knowledge distillation, specifically optimized for Italian legal texts, capable of generating semantic vector representations similar to the original teacher model

Model Features

Lightweight and efficient

With only 4 Transformer layers, it significantly reduces computational resource requirements compared to full BERT models

Legal domain optimization

Specifically trained for Italian legal texts, demonstrating superior semantic understanding in this field

Knowledge distillation technology

Achieves efficient compression by minimizing MSE loss with the teacher model (ITALIAN-LEGAL-BERT)

Model Capabilities

Generates sentence embeddings

Calculates sentence similarity

Legal text feature extraction

Semantic search

Text clustering

Use Cases

Legal text processing

Legal document similarity retrieval

Quickly find other documents semantically similar to the queried legal document

Legal case clustering analysis

Automatically classify and discover themes in large volumes of legal cases

Intelligent legal assistant

Relevant legal provisions recommendation

Automatically recommend related legal clauses based on user queries

🚀 DISTIL-ITA-LEGAL-BERT

A fast and lightweight student model created through knowledge distillation, capable of generating sentence embeddings similar to the more complex ITALIAN-LEGAL-BERT teacher model.

Model Image

🚀 Quick Start

We employed the knowledge distillation process to develop a swift and lightweight student model with only 4-levels of Transformers. This model can generate sentence embeddings comparable to those of the more intricate ITALIAN-LEGAL-BERT teacher model. It was optimized on the ITALIAN-LEGAL-BERT train set (3.7 GB) using the Sentence-BERT library by minimizing the mean square error (MSE) between its embeddings and those of the teacher model.

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be utilized for tasks such as clustering or semantic search.

✨ Features

Knowledge Distillation: Created a fast and lightweight model using knowledge distillation.
Similar Embeddings: Produces sentence embeddings similar to the ITALIAN-LEGAL-BERT teacher model.
Versatile Use: Can be used for clustering or semantic search tasks.

📦 Installation

Using this model is straightforward when you have sentence-transformers installed. Install it with the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('dlicari/distil-ita-legal-bert')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('dlicari/distil-ita-legal-bert')
model = AutoModel.from_pretrained('dlicari/distil-ita-legal-bert')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

For an automated evaluation of this model, visit the Sentence Embeddings Benchmark: https://seb.sbert.net

🔧 Technical Details

Training

The model was trained with the following parameters: DataLoader: torch.utils.data.dataloader.DataLoader of length 409633 with parameters:

{'batch_size': 24, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MSELoss.MSELoss

Parameters of the fit()-Method:

{
    "epochs": 4,
    "evaluation_steps": 5000,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "correct_bias": false,
        "eps": 1e-06,
        "lr": 0.0001
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This project is licensed under the afl-3.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご