Umberto-wikipedia-uncased-v1 Open-source Italian Language Model - Suitable for Multiple Types of Natural Language Processing Tasks

Umberto Wikipedia Uncased V1

Developed by Musixmatch

UmBERTo is an Italian language model based on the Roberta architecture, trained using SentencePiece and whole word masking techniques, suitable for various natural language processing tasks.

Large Language Model

Transformers

Other#Italian language processing #Whole word masking #Named entity recognition

Downloads 1,079

Release Time : 3/2/2022

Model Overview

This model is an Italian pre-trained language model based on the Roberta architecture, specifically trained on the Italian Wikipedia corpus, suitable for downstream tasks such as named entity recognition and part-of-speech tagging.

Model Features

Whole Word Masking

Utilizes Whole Word Masking (WWM) technique for pre-training, enhancing the model's understanding of complete vocabulary.

SentencePiece Tokenization

Uses SentencePiece as the tokenizer with a 32K vocabulary size, effectively handling Italian text.

Wikipedia Corpus Training

Specifically trained on the Italian Wikipedia corpus, demonstrating strong comprehension of Italian text.

Model Capabilities

Italian text understanding

Named entity recognition

Part-of-speech tagging

Masked word prediction

Use Cases

Natural Language Processing

Named Entity Recognition

Identify entities such as person names and locations in Italian text

Achieved F1 scores of 86.240 on ICAB-EvalITA07 and 90.483 on WikiNER-ITA datasets

Part-of-Speech Tagging

Tag parts of speech for words in Italian text

Achieved 98.717% accuracy on the UD_Italian-ISDT dataset

Text Completion

Predict masked words in sentences

🚀 UmBERTo Wikipedia Uncased

UmBERTo is a Roberta-based Language Model trained on large Italian Corpora. It uses two innovative approaches: SentencePiece and Whole Word Masking, and is now available on Hugging Face.

Marco Lodola, Monument to Umberto Eco, Alessandria 2019

🚀 Quick Start

UmBERTo is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at github.com/huggingface/transformers

📦 Dataset

The UmBERTo-Wikipedia-Uncased model is trained on a relatively small corpus (~7GB) extracted from Wikipedia-ITA.

📚 Pre-trained model

Model	WWM	Cased	Tokenizer	Vocab Size	Train Steps	Download
`umberto-wikipedia-uncased-v1`	YES	YES	SPM	32K	100k	Link

This model was trained with SentencePiece and Whole Word Masking.

💪 Downstream Tasks

These results refer to the umberto-wikipedia-uncased model. All details are available on the Umberto official page.

Named Entity Recognition (NER)

Dataset	F1	Precision	Recall	Accuracy
ICAB-EvalITA07	86.240	85.939	86.544	98.534
WikiNER-ITA	90.483	90.328	90.638	98.661

Part of Speech (POS)

Dataset	F1	Precision	Recall	Accuracy
UD_Italian-ISDT	98.563	98.508	98.618	98.717
UD_Italian-ParTUT	97.810	97.835	97.784	98.060

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output

Advanced Usage

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="Musixmatch/umberto-wikipedia-uncased-v1",
    tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)

result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}

📖 Citation

All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.

UD Italian-ISDT Dataset Github
UD Italian-ParTUT Dataset Github
I-CAB (Italian Content Annotation Bank), EvalITA Page
WIKINER Page , Paper

@inproceedings {magnini2006annotazione,
    title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
    author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
    booktitle = {Proc.of SILFI 2006},
    year = {2006}
}
@inproceedings {magnini2006cab,
    title = {I - CAB: the Italian Content Annotation Bank.},
    author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
    booktitle = {LREC},
    pages = {963--968},
    year = {2006},
    organization = {Citeseer}
}

👥 Authors

Loreto Parisi: loreto at musixmatch dot com, loretoparisi Simone Francia: simone.francia at musixmatch dot com, simonefrancia Paolo Magnani: paul.magnani95 at gmail dot com, paulthemagno

👀 About Musixmatch AI

Musxmatch Ai mac app icon-128 We do Machine Learning and Artificial Intelligence @musixmatch Follow us on Twitter Github

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご