U

Umberto Commoncrawl Cased V1

Developed by Musixmatch
Italian language model based on the Roberta architecture, trained with SentencePiece tokenization and Whole Word Masking
Downloads 13.19k
Release Time : 3/2/2022

Model Overview

UmBERTo is a language model based on the Roberta architecture, trained on a large-scale Italian corpus, focusing on Italian natural language processing tasks.

Model Features

Whole Word Masking
Utilizes Whole Word Masking technology to enhance the model's understanding of complete semantic units
SentencePiece Tokenization
Uses the SentencePiece tokenizer to effectively handle special characters and vocabulary in Italian
Large-scale Training Data
Trained on the OSCAR Italian sub-corpus, containing 70GB of plain text data and 11 billion words

Model Capabilities

Named Entity Recognition
Part-of-Speech Tagging
Italian Text Understanding

Use Cases

Text Analysis
Named Entity Recognition
Identify entities such as person names, locations, and organizations in Italian text
Achieved an F1 score of 87.565 on the ICAB-EvalITA07 dataset and 92.531 on the WikiNER-ITA dataset
Part-of-Speech Tagging
Annotate parts of speech for words in Italian text
Achieved an accuracy of 98.977% on the UD_Italian-ISDT dataset
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase