mdeberta-v3-base多语言词性标注模型 - 开源支持多种语言词性标注任务

首页

Mdeberta V3 Base Multilingual Pos Tagger

由 jordigonzm 开发

基于mDeBERTa-v3-base的多语言词性标注模型，支持多种语言的词性标注任务

序列标注

Safetensors

其他#多语言词性标注 #高精度分词 #停用词识别

下载量 50

发布时间 : 2/2/2025

模型简介

该模型用于执行多语言词性标注任务，能够识别文本中每个词汇的词性类别，如名词、动词等。

模型特点

多语言支持

支持多种语言的词性标注任务

高准确率

在词性标注任务上表现优异，准确率高

基于DeBERTa架构

采用改进的Transformer架构，具有更强的上下文理解能力

模型能力

词性标注

多语言文本处理

自然语言处理

使用案例

自然语言处理

文本分析

对文本进行词性标注，用于后续的文本分析任务

准确识别文本中各词汇的词性类别

信息提取

从文本中提取名词等关键信息

有效提取文本中的核心词汇

🚀 多语言词性标注模型

本项目提供了使用Hugging Face进行词性标注的脚本，可提取文本中的词性信息，还能自动检测和提取名词与停用词。同时，对多语言词性标注模型的评估框架和训练配置进行了概述。

🚀 快速开始

本项目提供了使用Hugging Face进行词性标注的脚本，可提取文本中的词性信息，还能自动检测和提取名词与停用词。

✨ 主要特性

多语言支持：能够处理多种语言的词性标注任务。
词性分类：可以准确识别不同词性的类别。
停用词提取：自动检测并提取文本中的名词和停用词。

📦 安装指南

本项目依赖于Hugging Face的transformers库，可使用以下命令进行安装：

pip install transformers

💻 使用示例

基础用法

以下代码展示了如何使用Hugging Face的pipeline进行词性标注：

from transformers import pipeline

# Load model and tokenizer
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"

# Run POS tagging
words = text.split(" ")
tokens = pos_pipeline(words)

# Print tokens and their categories
for word, group_token in zip(words, tokens):
    print(f"{word:<15}", end=" ")
    for token in group_token:
        print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
    print("\n" + "-" * 80)

高级用法

以下代码展示了如何进行词性标注并提取名词和停用词：

from transformers import pipeline

# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "Companies interested in providing the service must take care of signage and information boards."

# Run POS tagging
tokens = pos_pipeline(text)

# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
    print(f"{token['word']:10} → {token['entity']}")

# Reconstruct words correctly
words, buffer, labels = [], [], []

for token in tokens:
    raw_word = token["word"]

    if raw_word.startswith("▁"):  # New word starts
        if buffer:
            words.append("".join(buffer))  # Add the completed word
            labels.append(buffer_label)
        buffer = [raw_word.replace("▁", "")]
        buffer_label = token["entity"]
    else:
        buffer.append(raw_word)  # Continue word construction

# Add last word in buffer
if buffer:
    words.append("".join(buffer))
    labels.append(buffer_label)

# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
    print(f"{word:<15} → {label}")

# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"}  # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}  # Common stopword POS tags

# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]

# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)

📚 详细文档

多语言词性标注概述

本报告概述了多语言词性标注模型的评估框架和潜在的训练配置。该模型基于Transformer架构，并在有限的训练轮次后进行评估。

预期范围

属性	详情
验证损失	通常在`0.02`到`0.1`之间，具体取决于数据集的复杂性和正则化。
总体精度	预期范围为`96%`到`99%`，受数据集多样性和分词质量的影响。
总体召回率	通常在`96%`到`99%`之间，受与精度类似的因素影响。
总体F1分数	预期范围为`96%`到`99%`，平衡了精度和召回率。
总体准确率	可能在`97%`到`99.5%`之间，取决于语言变体和模型的鲁棒性。
评估速度	通常为`100 - 150样本/秒` 或 `25 - 40步/秒`，取决于批量大小和硬件。

训练配置

属性	详情
模型	基于Transformer的架构（如BERT、RoBERTa、XLM - R）
训练轮次	`2`到`5`，取决于收敛情况和验证性能。
批量大小	`1`到`16`，平衡内存限制和稳定性。
学习率	`1e - 6`到`5e - 4`，根据优化动态和预热策略进行调整。