🚀 UmBERTo Wikipedia Uncased
UmBERTo is a Roberta-based Language Model trained on large Italian Corpora. It uses two innovative approaches: SentencePiece and Whole Word Masking, and is now available on Hugging Face.
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
🚀 Quick Start
UmBERTo is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at github.com/huggingface/transformers
📦 Dataset
The UmBERTo-Wikipedia-Uncased model is trained on a relatively small corpus (~7GB) extracted from Wikipedia-ITA.
📚 Pre-trained model
Model |
WWM |
Cased |
Tokenizer |
Vocab Size |
Train Steps |
Download |
umberto-wikipedia-uncased-v1 |
YES |
YES |
SPM |
32K |
100k |
Link |
This model was trained with SentencePiece and Whole Word Masking.
💪 Downstream Tasks
These results refer to the umberto-wikipedia-uncased model. All details are available on the Umberto official page.
Named Entity Recognition (NER)
Dataset |
F1 |
Precision |
Recall |
Accuracy |
ICAB-EvalITA07 |
86.240 |
85.939 |
86.544 |
98.534 |
WikiNER-ITA |
90.483 |
90.328 |
90.638 |
98.661 |
Part of Speech (POS)
Dataset |
F1 |
Precision |
Recall |
Accuracy |
UD_Italian-ISDT |
98.563 |
98.508 |
98.618 |
98.717 |
UD_Italian-ParTUT |
97.810 |
97.835 |
97.784 |
98.060 |
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)
outputs = umberto(input_ids)
last_hidden_states = outputs[0]
Advanced Usage
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-wikipedia-uncased-v1",
tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
📖 Citation
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
- UD Italian-ISDT Dataset Github
- UD Italian-ParTUT Dataset Github
- I-CAB (Italian Content Annotation Bank), EvalITA Page
- WIKINER Page , Paper
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
👥 Authors
Loreto Parisi: loreto at musixmatch dot com
, loretoparisi
Simone Francia: simone.francia at musixmatch dot com
, simonefrancia
Paolo Magnani: paul.magnani95 at gmail dot com
, paulthemagno
👀 About Musixmatch AI
We do Machine Learning and Artificial Intelligence @musixmatch
Follow us on Twitter Github