Legal - Longformer - Base: An open-source legal language model for free deployment and efficient processing of long legal documents!

Legal Longformer Base

Developed by lexlms

A long-text legal language model based on the LexLM (Base Version) RoBERTa model, optimized for processing long legal documents

Large Language Model

Transformers

English#Legal Text Processing #Long Document Modeling #Multi-Jurisdiction Adaptation

Downloads 221

Release Time : 2/24/2023

Model Overview

This model is specifically designed for the legal domain to handle long texts, with extended positional embeddings to support longer document processing, suitable for legal text analysis, contract review, and similar scenarios

Model Features

Legal Domain Optimization

Trained specifically on legal texts for better understanding of legal terminology and expressions

Long Text Processing Capability

Supports processing of longer legal documents through extended positional embeddings

Multi-Jurisdiction Adaptability

Training data includes texts from multiple legal jurisdictions, providing good cross-jurisdictional adaptability

Model Capabilities

Legal Text Understanding

Legal Terminology Recognition

Long Document Processing

Legal Text Fill-Mask Prediction

Use Cases

Legal Document Analysis

Contract Clause Analysis

Analyzing key clauses and potential risk points in contract texts

Legal Document Fill-Mask

Predicting missing professional terms or phrases in legal documents

Legal Research

Case Law Analysis

Processing and analyzing lengthy court case documents

🚀 Legal Longformer (base)

This is a derivative model based on the LexLM RoBERTa model, designed for legal long - document processing.

🚀 Quick Start

You can use the following examples to quickly test the model:

# Example 1
text1 = "The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of police."
# Example 2
text2 = "This <mask> Agreement is between General Motors and John Murray."
# Example 3
text3 = "Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products."
# Example 4
text4 = "Because the Court granted <mask> before judgment, the Court effectively stands in the shoes of the Court of Appeals and reviews the defendants’ appeals."

✨ Features

Derivative Model: Based on the [LexLM (base)](https://huggingface.co/lexlms/legal - roberta - base) RoBERTa model.
Extended Positional Embeddings: The positional embeddings were extended by cloning the original embeddings multiple times following Beltagy et al. (2020).

📚 Documentation

Model description

LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best - practices in language model development:

We warm - start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
We continue pre - training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
We use a sentence sampler with exponential smoothing of the sub - corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub - corpora and we aim to preserve per - corpus capacity (avoid overfitting).
We consider mixed cased models, similar to all recently developed large PLMs.

Citation

Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.

@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = july,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}

📄 License

This model is licensed under cc - by - sa - 4.0.

Additional Information

Property	Details
Model Type	Legal Longformer (base)
Training Data	lexlms/lex_files
Pipeline Tag	fill - mask
Tags	legal, long - documents
Model Name	lexlms/legal - longformer - base
Original Model	[LexLM (base)](https://huggingface.co/lexlms/legal - roberta - base)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご