🚀 Multilingual POS Tagging Model
This project provides scripts for POS tagging and stopword extraction using a multilingual model, and outlines evaluation and training configurations.
🚀 Quick Start
The project offers two main functionalities: basic POS tagging and POS tagging with stopword extraction. You can use the following scripts to perform these tasks.
💻 Usage Examples
Basic Usage
from transformers import pipeline
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"
words = text.split(" ")
tokens = pos_pipeline(words)
for word, group_token in zip(words, tokens):
print(f"{word:<15}", end=" ")
for token in group_token:
print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
print("\n" + "-" * 80)
Advanced Usage
from transformers import pipeline
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")
text = "Companies interested in providing the service must take care of signage and information boards."
tokens = pos_pipeline(text)
print("\nTokens POS tagging:")
for token in tokens:
print(f"{token['word']:10} → {token['entity']}")
words, buffer, labels = [], [], []
for token in tokens:
raw_word = token["word"]
if raw_word.startswith("▁"):
if buffer:
words.append("".join(buffer))
labels.append(buffer_label)
buffer = [raw_word.replace("▁", "")]
buffer_label = token["entity"]
else:
buffer.append(raw_word)
if buffer:
words.append("".join(buffer))
labels.append(buffer_label)
print("\nPOS tagging results:")
for word, label in zip(words, labels):
print(f"{word:<15} → {label}")
noun_tags = {"NOUN", "PROPN"}
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)
📚 Documentation
Overview
This report outlines the evaluation framework and potential training configurations for a multilingual POS tagging model. The model is based on a Transformer architecture and is assessed after a limited number of training epochs.
Expected Ranges
Property |
Details |
Validation Loss |
Typically between 0.02 and 0.1 , depending on dataset complexity and regularization. |
Overall Precision |
Expected to range from 96% to 99% , influenced by dataset diversity and tokenization quality. |
Overall Recall |
Generally between 96% and 99% , subject to similar factors as precision. |
Overall F1-score |
Expected range: 96% to 99% , balancing precision and recall. |
Overall Accuracy |
Can vary between 97% and 99.5% , contingent on language variations and model robustness. |
Evaluation Speed |
Typically 100 - 150 samples/sec | 25 - 40 steps/sec , depending on batch size and hardware. |
Training Configurations
Property |
Details |
Model |
Transformer-based architecture (e.g., BERT, RoBERTa, XLM - R) |
Training Epochs |
2 to 5 , depending on convergence and validation performance. |
Batch Size |
1 to 16 , balancing memory constraints and stability. |
Learning Rate |
1e-6 to 5e-4 , adjusted based on optimization dynamics and warm - up strategies. |
Model Information
Property |
Details |
Base Model |
microsoft/mdeberta-v3-base |
Pipeline Tag |
token-classification |
Tags |
pos-tagging, multilingual, deberta, nlp |