NER4Legal_SRB Open-Source Named Entity Recognition Model - Automatically Extract Key Information from Serbian Legal Documents

Ner4legal SRB

Developed by kalusev

A named entity recognition model optimized for Serbian legal documents, fine-tuned based on BERT architecture, used for automatically extracting key entity information from legal texts.

Sequence Labeling

Transformers

OtherOpen Source License:Apache-2.0 #Serbian Legal NER #High-precision Entity Recognition #Court Document Processing

Downloads 54

Release Time : 2/14/2025

Model Overview

This model is specifically designed to identify predefined entity categories in Serbian legal documents, supporting automated tasks such as document archiving and retrieval. It is suitable for users such as lawyers, law firms, and government agencies.

Model Features

Legal Domain Optimization

Trained specifically for Serbian legal documents, it can accurately identify specific entity categories in legal texts.

High-precision Performance

Achieves an average F1 score of 0.96 in cross-validation, demonstrating excellent performance.

Robustness Validation

Verified through adversarial text testing to ensure stability under noisy inputs.

Model Capabilities

Legal Text Entity Recognition

Serbian Language Processing

Court Ruling Analysis

Use Cases

Legal Document Processing

Court Ruling Archiving

Automatically identifies key information such as court names and case numbers in ruling documents.

Improves document classification and retrieval efficiency.

Legal Information Extraction

Extracts structured data such as involved parties and judgment results from legal documents.

Supports legal analysis and research.

🚀 NER4Legal_SRB

NER4Legal_SRB is a fine - tuned model for Named Entity Recognition (NER) in Serbian legal documents, leveraging a pre - trained BERT model to automate legal document processing tasks.

🚀 Quick Start

The NER4Legal_SRB model can be run on both CPU and GPU. You can use the following Python code to perform Named Entity Recognition:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

✨ Features

Legal Document Processing: Designed specifically for Serbian legal documents, including public court rulings, to automate tasks such as document archiving, search, and retrieval.
High Performance: Achieved a mean F1 score of 0.96 during cross - validation tests on the labeled dataset, demonstrating robustness and applicability to real - world scenarios.
CPU and GPU Support: Can be run on both CPU and GPU, providing flexibility for different computing environments.

📦 Installation

The model can be installed using the transformers library. You can install it via pip:

pip install transformers

📚 Documentation

Model Description

NER4Legal_SRB is a fine - tuned Named Entity Recognition (NER) model for Serbian legal documents. It is based on the pre - trained [classla/bcms - bertic](https://huggingface.co/classla/bcms - bertic) BERT model. The model was developed as part of the conference paper "Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development", which will be published at the 15th International Conference on Information Society and Technology in 2025.

Abstract

Advancements in NLP and LLMs have led to research on document processing tools. This work presents an LLM - based NER solution for Serbian legal documents. It uses a pre - trained BERT model, develops a novel dataset, and discusses performance metrics. Cross - validation tests with a mean F1 score of 0.96 confirm the solution's applicability and robustness.

Base Model

The model is fine - tuned from the [classla/bcms - bertic](https://huggingface.co/classla/bcms - bertic) pre - trained BERT model, which is designed for BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages.

Dataset

The model was fine - tuned on a manually labeled dataset of Serbian legal documents, including public court rulings. This dataset enables precise entity identification and classification in Serbian legal texts.

Performance Metrics

The model achieved a mean F1 score of 0.96 during cross - validation tests on the labeled dataset. For detailed evaluation information, please refer to the original conference paper.

🔧 Technical Details

The model leverages the pre - trained BERT architecture, which is well - known for its ability to capture semantic information from text. The pre - trained [classla/bcms - bertic](https://huggingface.co/classla/bcms - bertic) model was carefully adapted to the specific task of identifying and classifying entities in Serbian legal texts. The model was trained on a manually labeled dataset, which was specifically developed for this task.

📄 License

This model is released under the Apache - 2.0 license.

If you would like to use this software, please consider citing the following publication:

*Kalušev, V., Brkljač, B. (2025). Named entity recognition for Serbian legal documents: Design, methodology and dataset development. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9 - 12 March, 2025, Vol. -, ISBN -, accepted for publication

@inproceedings{KalusevNER2025,
    author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
    booktitle = {15th International Conference on Information Society and Technology (ICIST)},
    doi = {-},
    month = mar,
    pages = {1--16},
    title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
    year = {2025}
}

@misc{kalušev2025namedentityrecognitionserbian,
      title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
      author={Vladimir Kalušev and Branko Brkljač},
      year={2025},
      eprint={2502.10582},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10582},
}

Contributors

Vladimir Kalušev https://huggingface.co/kalusev
Branko Brkljač https://huggingface.co/brkljac, https://brkljac.github.io/

![SRB4Legal_NER performance in presence of noisy inputs](SRB4Legal_NER performance in presence of noisy inputs.jpg)

⚠️ Important Note

For detailed information about model evaluation and reported results, please consult the original conference paper.

💡 Usage Tip

If you encounter a "CUDA out of memory" error, the code will automatically switch to CPU mode. However, running on CPU may be slower.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご