NER4Legal_SRB開源命名實體識別模型 - 自動提取塞爾維亞語法律文書關鍵信息

首頁

Ner4legal SRB

由kalusev開發

針對塞爾維亞語法律文書優化的命名實體識別模型，基於BERT架構微調，用於自動提取法律文本中的關鍵實體信息。

序列標註

Transformers

其他開源協議:Apache-2.0 #塞爾維亞法律NER #高精度實體識別 #法院文書處理

下載量 54

發布時間 : 2/14/2025

模型概述

該模型專門用於識別塞爾維亞法律文書中的預定義實體類別，支持文檔歸檔、檢索等自動化任務，適用於律師、律所、政府機構等用戶群體。

模型特點

法律領域優化

專門針對塞爾維亞法律文書訓練，能準確識別法律文本中的特定實體類別。

高精度性能

交叉驗證平均F1值達0.96，表現出色。

魯棒性驗證

通過對抗文本測試驗證了模型在噪聲輸入下的穩定性。

模型能力

法律文本實體識別

塞爾維亞語處理

法院裁決分析

使用案例

法律文檔處理

法院裁決歸檔

自動識別裁決文書中的法院名稱、案件編號等關鍵信息

提高文檔分類和檢索效率

法律信息提取

從法律文書中提取當事人、判決結果等結構化數據

支持法律分析和研究

🚀 NER4Legal_SRB

NER4Legal_SRB是一個經過微調的命名實體識別（NER）模型，專為處理塞爾維亞語法律文件而設計。它能夠助力法律文件的歸檔、搜索和檢索等任務自動化，提升工作效率。

🚀 快速開始

代碼示例

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

✨ 主要特性

針對性強：專為塞爾維亞語法律文件的命名實體識別任務設計，能精準識別和分類法律文本中的實體。
性能出色：在交叉驗證測試中，平均F1分數達到0.96，展現出強大的性能和實際應用能力。
硬件兼容性好：可以在CPU和GPU上運行，方便不同硬件條件的用戶使用。

📦 安裝指南

文檔未提及具體安裝步驟，可參考transformers庫的通用安裝方法。

💻 使用示例

基礎用法

# 上述代碼中的基礎使用示例
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

高級用法

文檔未提及高級用法相關內容。

📚 詳細文檔

模型描述

NER4Legal_SRB是一個用於處理塞爾維亞語法律文件的命名實體識別模型。該模型是會議論文“Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development”的一部分，該論文已被第15屆國際信息社會與技術會議（2025年3月9 - 12日，塞爾維亞科帕奧尼克）接受發表。模型旨在自動化處理法律文件的任務，如文檔歸檔、搜索和檢索。它基於classla/bcms-bertic預訓練的BERT模型進行微調，以適應塞爾維亞語法律文本中預定義的單詞實體的識別和分類任務。

摘要

自然語言處理（NLP）領域，特別是大語言模型（LLMs）及其眾多應用的最新進展，促使研究人員關注不同文檔處理工具的設計以及文檔歸檔、搜索和檢索過程的改進。由於官方法律文件領域每天會產生大量數據，並且有眾多感興趣的從業者（律師、律師事務所、行政人員、國家機構和公民），因此為涉及法律文件的日常工作提供高效的自動化方法有望在不同領域產生重大影響。

在這項工作中，我們提出了一種基於大語言模型的解決方案，用於識別塞爾維亞語法律文件中的命名實體。該方案利用預訓練的雙向編碼器表示變換器（BERT），並針對從文本內容中識別和分類特定數據點的任務進行了精心調整。除了為塞爾維亞語開發新的數據集（涉及公開的法院裁決）、介紹系統設計和應用方法外，論文還討論了所取得的性能指標及其對客觀評估所提出解決方案的意義。在手動標註的數據集上進行的交叉驗證測試，平均F1分數達到0.96，以及在故意修改的文本輸入示例上的額外結果，證實了所提出系統設計的適用性和所開發的命名實體識別解決方案的魯棒性。

基礎模型

該模型是在classla/bcms-bertic基礎模型上進行微調的，這是一個為BCMS（波斯尼亞語、克羅地亞語、黑山語、塞爾維亞語）語言設計的預訓練BERT模型。

數據集

該模型在一個手動標註的塞爾維亞語法律文件數據集上進行了微調，其中包括公開的法院裁決。該數據集是專門為這項任務開發的，以便能夠精確識別和分類塞爾維亞語法律文本中的實體。

性能指標

在標註數據集的交叉驗證測試中，模型的平均F1分數達到了0.96，這表明模型具有強大的性能，適用於實際場景。有關模型評估和報告結果的詳細信息，請參考原始會議論文。

🔧 技術細節

文檔未提及詳細的技術實現細節。

📄 許可證

本項目採用Apache-2.0許可證。

貢獻者

Vladimir Kalušev Hugging Face
Branko Brkljač Hugging Face

引用

如果您想使用此軟件，請考慮引用以下出版物：

*Kalušev, V., Brkljač, B. (2025). Named entity recognition for Serbian legal documents: Design, methodology and dataset development. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9 - 12 March, 2025, Vol. -, ISBN -, accepted for publication

@inproceedings{KalusevNER2025,
    author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
    booktitle = {15th International Conference on Information Society and Technology (ICIST)},
    doi = {-},
    month = mar,
    pages = {1--16},
    title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
    year = {2025}
}

@misc{kalušev2025namedentityrecognitionserbian,
      title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
      author={Vladimir Kalušev and Branko Brkljač},
      year={2025},
      eprint={2502.10582},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10582},
}

![SRB4Legal_NER performance in presence of noisy inputs](SRB4Legal_NER performance in presence of noisy inputs.jpg)