it5-large Open Source Model - Empowering Italian Language Processing, Free Sequence-to-Sequence Conversion

It5 Large

Developed by gsarti

IT5 is the first family of sequence-to-sequence Transformer models specifically pretrained at scale for Italian, following the T5 model approach.

Large Language Model OtherOpen Source License:Apache-2.0 #Italian language generation #Sequence-to-sequence #Large-scale pretraining

Downloads 37

Release Time : 3/2/2022

Model Overview

The IT5 model family consists of sequence-to-sequence Transformer models specifically designed for Italian, suitable for various natural language understanding and generation tasks.

Model Features

Italian-specific pretraining

The first sequence-to-sequence Transformer model specifically pretrained at scale for Italian

Improved T5 architecture

Based on google/t5-v1_1-large with improved configuration, using gated GELU activation function

Large-scale training data

Trained on the cleaned Italian mC4 corpus (approximately 41 billion words)

Multi-framework support

Provides PyTorch, Flax, and TensorFlow versions

Model Capabilities

Italian text understanding

Italian text generation

Sequence-to-sequence task processing

Use Cases

Natural Language Processing

Italian text summarization

Generate concise summaries of Italian texts

Italian machine translation

Supports translation tasks between Italian and other languages

Italian question answering systems

Build Italian question-answering applications

🚀 Italian T5 Large 🇮🇹

The IT5 model family is the first attempt at pretraining large - scale sequence - to - sequence transformer models for the Italian language, following the approach of the original T5 model.

This model is part of the project "IT5: Text-to-Text Pretraining for Italian Language Understanding and Generation" (to be released), by Gabriele Sarti and Malvina Nissim, supported by Huggingface and with TPU usage sponsored by Google's TPU Research Cloud. All training was done on a single TPU3v8 - VM machine on Google Cloud. Check the Tensorboard tab of the repository for a training process overview.

The inference widget is deactivated because the model requires task - specific seq2seq fine - tuning on a downstream task to be practical.

✨ Features

Model Variants

This repository holds the checkpoints for the base version of the model. The model was trained for one epoch (1.05M steps) on the Thoroughly Cleaned Italian mC4 Corpus (~41B words, ~275GB) using 🤗 Datasets and the google/t5 - v1_1 - large improved configuration. The training procedure is available [on Github](https://github.com/gsarti/t5 - flax - gcp).

The following table summarizes the parameters for all available models:

Property	`it5-small`	`it5-base`	`it5-large` (this one)	`it5-base-oscar`
`dataset`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`oscar/unshuffled_deduplicated_it`
`architecture`	`google/t5-v1_1-small`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`t5-base`
`learning rate`	5e - 3	5e - 3	5e - 3	1e - 2
`steps`	1'050'000	1'050'000	2'100'000	258'000
`training time`	36 hours	101 hours	370 hours	98 hours
`ff projection`	`gated-gelu`	`gated-gelu`	`gated-gelu`	`relu`
`tie embeds`	`false`	`false`	`false`	`true`
`optimizer`	adafactor	adafactor	adafactor	adafactor
`max seq. length`	512	512	512	512
`per-device batch size`	16	16	8	16
`tot. batch size`	128	128	64	128
`weigth decay`	1e - 3	1e - 3	1e - 2	1e - 3
`validation split size`	15K examples	15K examples	15K examples	15K examples

The high training time of it5 - base - oscar was due to a bug in the training script.

For individual model parameters, refer to the config.json file in the respective repositories.

📦 Installation

This section is skipped as there is no specific installation command in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-large")

Note: You will need to fine - tune the model on your downstream seq2seq task to use it.

Advanced Usage

Flax and Tensorflow versions of the model are also available:

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-large")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-large")

📚 Documentation

Limitations

⚠️ Important Note

Due to the nature of the web - scraped corpus on which IT5 models were trained, it is likely that their usage could reproduce and amplify pre - existing biases in the data, resulting in potentially harmful content such as racial or gender stereotypes and conspiracist views. For this reason, the study of such biases is explicitly encouraged, and model usage should ideally be restricted to research - oriented and non - user - facing endeavors.

Model Curators

For problems or updates on this model, please contact gabriele.sarti996@gmail.com.

Citation Information

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
    abstract = "We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご