MDeBERTa-v3-base Multilingual Part-of-Speech Tagging Model - Open-Source Support for Part-of-Speech Tagging Tasks in Multiple Languages

Home

Mdeberta V3 Base Multilingual Pos Tagger

Developed by jordigonzm

A multilingual POS tagging model based on mDeBERTa-v3-base, supporting POS tagging tasks in multiple languages

Sequence Labeling

Safetensors

Other#Multilingual POS Tagging #High-precision Word Segmentation #Stopword Recognition

Downloads 50

Release Time : 2/2/2025

Model Overview

This model is used for multilingual POS tagging tasks, capable of identifying the part-of-speech categories of each word in text, such as nouns, verbs, etc.

Model Features

Multilingual Support

Supports POS tagging tasks in multiple languages

High Accuracy

Excellent performance in POS tagging tasks with high accuracy

Based on DeBERTa Architecture

Utilizes an improved Transformer architecture with enhanced contextual understanding capabilities

Model Capabilities

POS Tagging

Multilingual Text Processing

Natural Language Processing

Use Cases

Natural Language Processing

Text Analysis

Performs POS tagging on text for subsequent text analysis tasks

Accurately identifies the part-of-speech categories of words in text

Information Extraction

Extracts key information such as nouns from text

Effectively extracts core vocabulary from text

🚀 Multilingual POS Tagging Model

This project provides scripts for POS tagging and stopword extraction using a multilingual model, and outlines evaluation and training configurations.

🚀 Quick Start

The project offers two main functionalities: basic POS tagging and POS tagging with stopword extraction. You can use the following scripts to perform these tasks.

💻 Usage Examples

Basic Usage

from transformers import pipeline

# Load model and tokenizer
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"

# Run POS tagging
words = text.split(" ")
tokens = pos_pipeline(words)

# Print tokens and their categories
for word, group_token in zip(words, tokens):
    print(f"{word:<15}", end=" ")
    for token in group_token:
        print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
    print("\n" + "-" * 80)

Advanced Usage

from transformers import pipeline

# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "Companies interested in providing the service must take care of signage and information boards."

# Run POS tagging
tokens = pos_pipeline(text)

# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
    print(f"{token['word']:10} → {token['entity']}")

# Reconstruct words correctly
words, buffer, labels = [], [], []

for token in tokens:
    raw_word = token["word"]

    if raw_word.startswith("▁"):  # New word starts
        if buffer:
            words.append("".join(buffer))  # Add the completed word
            labels.append(buffer_label)
        buffer = [raw_word.replace("▁", "")]
        buffer_label = token["entity"]
    else:
        buffer.append(raw_word)  # Continue word construction

# Add last word in buffer
if buffer:
    words.append("".join(buffer))
    labels.append(buffer_label)

# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
    print(f"{word:<15} → {label}")

# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"}  # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}  # Common stopword POS tags

# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]

# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)

📚 Documentation

Overview

This report outlines the evaluation framework and potential training configurations for a multilingual POS tagging model. The model is based on a Transformer architecture and is assessed after a limited number of training epochs.

Expected Ranges

Property	Details
Validation Loss	Typically between `0.02` and `0.1`, depending on dataset complexity and regularization.
Overall Precision	Expected to range from `96%` to `99%`, influenced by dataset diversity and tokenization quality.
Overall Recall	Generally between `96%` and `99%`, subject to similar factors as precision.
Overall F1-score	Expected range: `96%` to `99%`, balancing precision and recall.
Overall Accuracy	Can vary between `97%` and `99.5%`, contingent on language variations and model robustness.
Evaluation Speed	Typically `100 - 150 samples/sec` \| `25 - 40 steps/sec`, depending on batch size and hardware.

Training Configurations

Property	Details
Model	Transformer-based architecture (e.g., BERT, RoBERTa, XLM - R)
Training Epochs	`2` to `5`, depending on convergence and validation performance.
Batch Size	`1` to `16`, balancing memory constraints and stability.
Learning Rate	`1e-6` to `5e-4`, adjusted based on optimization dynamics and warm - up strategies.

Model Information

Property	Details
Base Model	microsoft/mdeberta-v3-base
Pipeline Tag	token-classification
Tags	pos-tagging, multilingual, deberta, nlp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご