mdeberta - v3 - base多言语品词タグ付けモデル - オープンソースで複数の言語の品詞タグ付けタスクをサポート

ホーム

Mdeberta V3 Base Multilingual Pos Tagger

jordigonzmによって開発

mDeBERTa-v3-baseをベースとした多言語品詞タグ付けモデルで、複数言語の品詞タグ付けタスクをサポート

シーケンスラベリング

Safetensors

その他#多言語品詞タグ付け #高精度分かち書き #ストップワード識別

ダウンロード数 50

リリース時間 : 2/2/2025

モデル概要

このモデルは多言語品詞タグ付けタスクを実行するために使用され、テキスト内の各単語の品詞カテゴリ（名詞、動詞など）を識別できます。

モデル特徴

多言語サポート

複数言語の品詞タグ付けタスクをサポート

高精度

品詞タグ付けタスクで優れた性能を発揮し、高い精度を実現

DeBERTaアーキテクチャ採用

改良されたTransformerアーキテクチャを採用し、より強力な文脈理解能力を有する

モデル能力

品詞タグ付け

多言語テキスト処理

自然言語処理

使用事例

自然言語処理

テキスト分析

テキストの品詞タグ付けを行い、後続のテキスト分析タスクに利用

テキスト内の各単語の品詞カテゴリを正確に識別

情報抽出

テキストから名詞などのキー情報を抽出

テキストの核心語彙を効果的に抽出

🚀 POS Tagging - Token Segmentation & Categories

このプロジェクトは、Hugging Faceを使用してトークンとその品詞（POS）カテゴリを抽出するシンプルなスクリプトを提供します。また、文章から名詞とストップワードを自動的に検出して抽出する機能も備えています。

モデル情報

属性	詳細
モデルタイプ	Token Classification
ベースモデル	microsoft/mdeberta-v3-base
タグ	pos-tagging, multilingual, deberta, nlp

🚀 クイックスタート

品詞タグ付け - トークン分割とカテゴリ

from transformers import pipeline

# Load model and tokenizer
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"

# Run POS tagging
words = text.split(" ")
tokens = pos_pipeline(words)

# Print tokens and their categories
for word, group_token in zip(words, tokens):
    print(f"{word:<15}", end=" ")
    for token in group_token:
        print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
    print("\n" + "-" * 80)

ストップワード抽出を伴う品詞タグ付け

from transformers import pipeline

# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "Companies interested in providing the service must take care of signage and information boards."

# Run POS tagging
tokens = pos_pipeline(text)

# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
    print(f"{token['word']:10} → {token['entity']}")

# Reconstruct words correctly
words, buffer, labels = [], [], []

for token in tokens:
    raw_word = token["word"]

    if raw_word.startswith("▁"):  # New word starts
        if buffer:
            words.append("".join(buffer))  # Add the completed word
            labels.append(buffer_label)
        buffer = [raw_word.replace("▁", "")]
        buffer_label = token["entity"]
    else:
        buffer.append(raw_word)  # Continue word construction

# Add last word in buffer
if buffer:
    words.append("".join(buffer))
    labels.append(buffer_label)

# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
    print(f"{word:<15} → {label}")

# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"}  # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}  # Common stopword POS tags

# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]

# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)