🚀 UmBERTo Commoncrawl Cased
UmBERTo is a Roberta-based Language Model trained on large Italian Corpora. It uses two innovative approaches: SentencePiece and Whole Word Masking. It's now available on Hugging Face.
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
🚀 Quick Start
UmBERTo is a powerful language model. You can quickly start using it by following the usage examples below.
✨ Features
- Training Data: Utilizes the Italian subcorpus of OSCAR as the training set.
- Innovative Approaches: Trained with SentencePiece and Whole Word Masking.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
Load UmBERTo with AutoModel, Autotokenizer:
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)
outputs = umberto(input_ids)
last_hidden_states = outputs[0]
Advanced Usage
Predict masked token:
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="Musixmatch/umberto-commoncrawl-cased-v1",
tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
)
result = fill_mask("Umberto Eco è <mask> un grande scrittore")
📚 Documentation
Dataset
UmBERTo-Commoncrawl-Cased uses the Italian subcorpus of OSCAR as the training set for the language model. The deduplicated version of the Italian corpus consists of 70 GB of plain text data, 210M sentences with 11B words. The sentences have been filtered and shuffled at the line level for NLP research.
Pre-trained model
Property |
Details |
Model |
umberto-commoncrawl-cased-v1 |
WWM |
YES |
Cased |
YES |
Tokenizer |
SPM |
Vocab Size |
32K |
Train Steps |
125k |
Download |
Link |
This model was trained with SentencePiece and Whole Word Masking.
Downstream Tasks
These results refer to the umberto-commoncrawl-cased model. All details are on the Umberto official page.
Named Entity Recognition (NER)
Dataset |
F1 |
Precision |
Recall |
Accuracy |
ICAB-EvalITA07 |
87.565 |
86.596 |
88.556 |
98.690 |
WikiNER-ITA |
92.531 |
92.509 |
92.553 |
99.136 |
Part of Speech (POS)
Dataset |
F1 |
Precision |
Recall |
Accuracy |
UD_Italian-ISDT |
98.870 |
98.861 |
98.879 |
98.977 |
UD_Italian-ParTUT |
98.786 |
98.812 |
98.760 |
98.903 |
Citation
All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
- UD Italian-ISDT Dataset Github
- UD Italian-ParTUT Dataset Github
- I-CAB (Italian Content Annotation Bank), EvalITA Page
- WIKINER Page , Paper
@inproceedings {magnini2006annotazione,
title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
booktitle = {Proc.of SILFI 2006},
year = {2006}
}
@inproceedings {magnini2006cab,
title = {I - CAB: the Italian Content Annotation Bank.},
author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
booktitle = {LREC},
pages = {963--968},
year = {2006},
organization = {Citeseer}
}
👥 Authors
- Loreto Parisi:
loreto at musixmatch dot com
, loretoparisi
- Simone Francia:
simone.francia at musixmatch dot com
, simonefrancia
- Paolo Magnani:
paul.magnani95 at gmail dot com
, paulthemagno
🌟 About Musixmatch AI
We do Machine Learning and Artificial Intelligence @musixmatch
Follow us on Twitter Github