mdeberta-v3-base多語言詞性標註模型 - 開源支持多種語言詞性標註任務

首頁

Mdeberta V3 Base Multilingual Pos Tagger

由jordigonzm開發

基於mDeBERTa-v3-base的多語言詞性標註模型，支持多種語言的詞性標註任務

序列標註

Safetensors

其他#多語言詞性標註 #高精度分詞 #停用詞識別

下載量 50

發布時間 : 2/2/2025

模型概述

該模型用於執行多語言詞性標註任務，能夠識別文本中每個詞彙的詞性類別，如名詞、動詞等。

模型特點

多語言支持

支持多種語言的詞性標註任務

高準確率

在詞性標註任務上表現優異，準確率高

基於DeBERTa架構

採用改進的Transformer架構，具有更強的上下文理解能力

模型能力

詞性標註

多語言文本處理

自然語言處理

使用案例

自然語言處理

文本分析

對文本進行詞性標註，用於後續的文本分析任務

準確識別文本中各詞彙的詞性類別

信息提取

從文本中提取名詞等關鍵信息

有效提取文本中的核心詞彙

🚀 多語言詞性標註模型

本項目提供了使用Hugging Face進行詞性標註的腳本，可提取文本中的詞性信息，還能自動檢測和提取名詞與停用詞。同時，對多語言詞性標註模型的評估框架和訓練配置進行了概述。

🚀 快速開始

本項目提供了使用Hugging Face進行詞性標註的腳本，可提取文本中的詞性信息，還能自動檢測和提取名詞與停用詞。

✨ 主要特性

多語言支持：能夠處理多種語言的詞性標註任務。
詞性分類：可以準確識別不同詞性的類別。
停用詞提取：自動檢測並提取文本中的名詞和停用詞。

📦 安裝指南

本項目依賴於Hugging Face的transformers庫，可使用以下命令進行安裝：

pip install transformers

💻 使用示例

基礎用法

以下代碼展示瞭如何使用Hugging Face的pipeline進行詞性標註：

from transformers import pipeline

# Load model and tokenizer
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"

# Run POS tagging
words = text.split(" ")
tokens = pos_pipeline(words)

# Print tokens and their categories
for word, group_token in zip(words, tokens):
    print(f"{word:<15}", end=" ")
    for token in group_token:
        print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
    print("\n" + "-" * 80)

高級用法

以下代碼展示瞭如何進行詞性標註並提取名詞和停用詞：

from transformers import pipeline

# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "Companies interested in providing the service must take care of signage and information boards."

# Run POS tagging
tokens = pos_pipeline(text)

# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
    print(f"{token['word']:10} → {token['entity']}")

# Reconstruct words correctly
words, buffer, labels = [], [], []

for token in tokens:
    raw_word = token["word"]

    if raw_word.startswith("▁"):  # New word starts
        if buffer:
            words.append("".join(buffer))  # Add the completed word
            labels.append(buffer_label)
        buffer = [raw_word.replace("▁", "")]
        buffer_label = token["entity"]
    else:
        buffer.append(raw_word)  # Continue word construction

# Add last word in buffer
if buffer:
    words.append("".join(buffer))
    labels.append(buffer_label)

# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
    print(f"{word:<15} → {label}")

# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"}  # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}  # Common stopword POS tags

# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]

# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)

📚 詳細文檔

多語言詞性標註概述

本報告概述了多語言詞性標註模型的評估框架和潛在的訓練配置。該模型基於Transformer架構，並在有限的訓練輪次後進行評估。

預期範圍

屬性	詳情
驗證損失	通常在`0.02`到`0.1`之間，具體取決於數據集的複雜性和正則化。
總體精度	預期範圍為`96%`到`99%`，受數據集多樣性和分詞質量的影響。
總體召回率	通常在`96%`到`99%`之間，受與精度類似的因素影響。
總體F1分數	預期範圍為`96%`到`99%`，平衡了精度和召回率。
總體準確率	可能在`97%`到`99.5%`之間，取決於語言變體和模型的魯棒性。
評估速度	通常為`100 - 150樣本/秒` 或 `25 - 40步/秒`，取決於批量大小和硬件。

訓練配置

屬性	詳情
模型	基於Transformer的架構（如BERT、RoBERTa、XLM - R）
訓練輪次	`2`到`5`，取決於收斂情況和驗證性能。
批量大小	`1`到`16`，平衡內存限制和穩定性。
學習率	`1e - 6`到`5e - 4`，根據優化動態和預熱策略進行調整。