NER4Legal_SRB开源命名实体识别模型 - 自动提取塞尔维亚语法律文书关键信息

首页

Ner4legal SRB

由 kalusev 开发

针对塞尔维亚语法律文书优化的命名实体识别模型，基于BERT架构微调，用于自动提取法律文本中的关键实体信息。

序列标注

Transformers

其他开源协议:Apache-2.0 #塞尔维亚法律NER #高精度实体识别 #法院文书处理

下载量 54

发布时间 : 2/14/2025

模型简介

该模型专门用于识别塞尔维亚法律文书中的预定义实体类别，支持文档归档、检索等自动化任务，适用于律师、律所、政府机构等用户群体。

模型特点

法律领域优化

专门针对塞尔维亚法律文书训练，能准确识别法律文本中的特定实体类别。

高精度性能

交叉验证平均F1值达0.96，表现出色。

鲁棒性验证

通过对抗文本测试验证了模型在噪声输入下的稳定性。

模型能力

法律文本实体识别

塞尔维亚语处理

法院裁决分析

使用案例

法律文档处理

法院裁决归档

自动识别裁决文书中的法院名称、案件编号等关键信息

提高文档分类和检索效率

法律信息提取

从法律文书中提取当事人、判决结果等结构化数据

支持法律分析和研究

🚀 NER4Legal_SRB

NER4Legal_SRB是一个经过微调的命名实体识别（NER）模型，专为处理塞尔维亚语法律文件而设计。它能够助力法律文件的归档、搜索和检索等任务自动化，提升工作效率。

🚀 快速开始

代码示例

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

✨ 主要特性

针对性强：专为塞尔维亚语法律文件的命名实体识别任务设计，能精准识别和分类法律文本中的实体。
性能出色：在交叉验证测试中，平均F1分数达到0.96，展现出强大的性能和实际应用能力。
硬件兼容性好：可以在CPU和GPU上运行，方便不同硬件条件的用户使用。

📦 安装指南

文档未提及具体安装步骤，可参考transformers库的通用安装方法。

💻 使用示例

基础用法

# 上述代码中的基础使用示例
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

高级用法

文档未提及高级用法相关内容。

📚 详细文档

模型描述

NER4Legal_SRB是一个用于处理塞尔维亚语法律文件的命名实体识别模型。该模型是会议论文“Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development”的一部分，该论文已被第15届国际信息社会与技术会议（2025年3月9 - 12日，塞尔维亚科帕奥尼克）接受发表。模型旨在自动化处理法律文件的任务，如文档归档、搜索和检索。它基于classla/bcms-bertic预训练的BERT模型进行微调，以适应塞尔维亚语法律文本中预定义的单词实体的识别和分类任务。

摘要

自然语言处理（NLP）领域，特别是大语言模型（LLMs）及其众多应用的最新进展，促使研究人员关注不同文档处理工具的设计以及文档归档、搜索和检索过程的改进。由于官方法律文件领域每天会产生大量数据，并且有众多感兴趣的从业者（律师、律师事务所、行政人员、国家机构和公民），因此为涉及法律文件的日常工作提供高效的自动化方法有望在不同领域产生重大影响。

在这项工作中，我们提出了一种基于大语言模型的解决方案，用于识别塞尔维亚语法律文件中的命名实体。该方案利用预训练的双向编码器表示变换器（BERT），并针对从文本内容中识别和分类特定数据点的任务进行了精心调整。除了为塞尔维亚语开发新的数据集（涉及公开的法院裁决）、介绍系统设计和应用方法外，论文还讨论了所取得的性能指标及其对客观评估所提出解决方案的意义。在手动标注的数据集上进行的交叉验证测试，平均F1分数达到0.96，以及在故意修改的文本输入示例上的额外结果，证实了所提出系统设计的适用性和所开发的命名实体识别解决方案的鲁棒性。

基础模型

该模型是在classla/bcms-bertic基础模型上进行微调的，这是一个为BCMS（波斯尼亚语、克罗地亚语、黑山语、塞尔维亚语）语言设计的预训练BERT模型。

数据集

该模型在一个手动标注的塞尔维亚语法律文件数据集上进行了微调，其中包括公开的法院裁决。该数据集是专门为这项任务开发的，以便能够精确识别和分类塞尔维亚语法律文本中的实体。

性能指标

在标注数据集的交叉验证测试中，模型的平均F1分数达到了0.96，这表明模型具有强大的性能，适用于实际场景。有关模型评估和报告结果的详细信息，请参考原始会议论文。

🔧 技术细节

文档未提及详细的技术实现细节。

📄 许可证

本项目采用Apache-2.0许可证。

贡献者

Vladimir Kalušev Hugging Face
Branko Brkljač Hugging Face

引用

如果您想使用此软件，请考虑引用以下出版物：

*Kalušev, V., Brkljač, B. (2025). Named entity recognition for Serbian legal documents: Design, methodology and dataset development. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9 - 12 March, 2025, Vol. -, ISBN -, accepted for publication

@inproceedings{KalusevNER2025,
    author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
    booktitle = {15th International Conference on Information Society and Technology (ICIST)},
    doi = {-},
    month = mar,
    pages = {1--16},
    title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
    year = {2025}
}

@misc{kalušev2025namedentityrecognitionserbian,
      title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
      author={Vladimir Kalušev and Branko Brkljač},
      year={2025},
      eprint={2502.10582},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10582},
}

![SRB4Legal_NER performance in presence of noisy inputs](SRB4Legal_NER performance in presence of noisy inputs.jpg)