Umberto-commoncrawl-cased-v1 Open-Source Italian Language Model - Empowering Italian Text Processing Applications

Umberto Commoncrawl Cased V1

Developed by Musixmatch

Italian language model based on the Roberta architecture, trained with SentencePiece tokenization and Whole Word Masking

Large Language Model

Transformers

Other#Italian NLP #Whole Word Masking #SentencePiece Tokenization

Downloads 13.19k

Release Time : 3/2/2022

Model Overview

UmBERTo is a language model based on the Roberta architecture, trained on a large-scale Italian corpus, focusing on Italian natural language processing tasks.

Model Features

Whole Word Masking

Utilizes Whole Word Masking technology to enhance the model's understanding of complete semantic units

SentencePiece Tokenization

Uses the SentencePiece tokenizer to effectively handle special characters and vocabulary in Italian

Large-scale Training Data

Trained on the OSCAR Italian sub-corpus, containing 70GB of plain text data and 11 billion words

Model Capabilities

Named Entity Recognition

Part-of-Speech Tagging

Italian Text Understanding

Use Cases

Text Analysis

Named Entity Recognition

Identify entities such as person names, locations, and organizations in Italian text

Achieved an F1 score of 87.565 on the ICAB-EvalITA07 dataset and 92.531 on the WikiNER-ITA dataset

Part-of-Speech Tagging

Annotate parts of speech for words in Italian text

Achieved an accuracy of 98.977% on the UD_Italian-ISDT dataset

🚀 UmBERTo Commoncrawl Cased

UmBERTo is a Roberta-based Language Model trained on large Italian Corpora. It uses two innovative approaches: SentencePiece and Whole Word Masking. It's now available on Hugging Face.

Marco Lodola, Monument to Umberto Eco, Alessandria 2019

🚀 Quick Start

UmBERTo is a powerful language model. You can quickly start using it by following the usage examples below.

✨ Features

Training Data: Utilizes the Italian subcorpus of OSCAR as the training set.
Innovative Approaches: Trained with SentencePiece and Whole Word Masking.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Load UmBERTo with AutoModel, Autotokenizer:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output

Advanced Usage

Predict masked token:

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="Musixmatch/umberto-commoncrawl-cased-v1",
    tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
)

result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> Umberto Eco è considerato un grande scrittore</s>', 'score': 0.18599839508533478, 'token': 5032}
# {'sequence': '<s> Umberto Eco è stato un grande scrittore</s>', 'score': 0.17816807329654694, 'token': 471}
# {'sequence': '<s> Umberto Eco è sicuramente un grande scrittore</s>', 'score': 0.16565583646297455, 'token': 2654}
# {'sequence': '<s> Umberto Eco è indubbiamente un grande scrittore</s>', 'score': 0.0932890921831131, 'token': 17908}
# {'sequence': '<s> Umberto Eco è certamente un grande scrittore</s>', 'score': 0.054701317101716995, 'token': 5269}

📚 Documentation

Dataset

UmBERTo-Commoncrawl-Cased uses the Italian subcorpus of OSCAR as the training set for the language model. The deduplicated version of the Italian corpus consists of 70 GB of plain text data, 210M sentences with 11B words. The sentences have been filtered and shuffled at the line level for NLP research.

Pre-trained model

Property	Details
Model	`umberto-commoncrawl-cased-v1`
WWM	YES
Cased	YES
Tokenizer	SPM
Vocab Size	32K
Train Steps	125k
Download	Link

This model was trained with SentencePiece and Whole Word Masking.

Downstream Tasks

These results refer to the umberto-commoncrawl-cased model. All details are on the Umberto official page.

Named Entity Recognition (NER)

Dataset	F1	Precision	Recall	Accuracy
ICAB-EvalITA07	87.565	86.596	88.556	98.690
WikiNER-ITA	92.531	92.509	92.553	99.136

Part of Speech (POS)

Dataset	F1	Precision	Recall	Accuracy
UD_Italian-ISDT	98.870	98.861	98.879	98.977
UD_Italian-ParTUT	98.786	98.812	98.760	98.903

Citation

All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.

UD Italian-ISDT Dataset Github
UD Italian-ParTUT Dataset Github
I-CAB (Italian Content Annotation Bank), EvalITA Page
WIKINER Page , Paper

@inproceedings {magnini2006annotazione,
    title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
    author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
    booktitle = {Proc.of SILFI 2006},
    year = {2006}
}
@inproceedings {magnini2006cab,
    title = {I - CAB: the Italian Content Annotation Bank.},
    author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
    booktitle = {LREC},
    pages = {963--968},
    year = {2006},
    organization = {Citeseer}
}

👥 Authors

Loreto Parisi: loreto at musixmatch dot com, loretoparisi
Simone Francia: simone.francia at musixmatch dot com, simonefrancia
Paolo Magnani: paul.magnani95 at gmail dot com, paulthemagno

🌟 About Musixmatch AI

Musxmatch Ai mac app icon-128 We do Machine Learning and Artificial Intelligence @musixmatch Follow us on Twitter Github

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご