it5-base-oscar Open-source Model - A Powerful Sequence-to-Sequence Translation Tool for Italian Language Processing

It5 Base Oscar

Developed by gsarti

The first large-scale sequence-to-sequence Transformer model pre-trained specifically for Italian, based on the OSCAR corpus

Large Language Model OtherOpen Source License:Apache-2.0 #Italian pre-training #Sequence-to-sequence #Text generation

Downloads 19

Release Time : 3/2/2022

Model Overview

This model is the base version of the IT5 model family, specifically pre-trained for Italian using the T5 architecture, suitable for various sequence-to-sequence tasks.

Model Features

Italian-specific pre-training

The first large-scale pre-trained sequence-to-sequence Transformer model for Italian

Based on OSCAR corpus

Trained using the Italian portion of the OSCAR corpus

Improved tokenizer

Utilizes SentencePieceUnigramTokenizer trained on the Italian portion of mC4

TPU-optimized training

Training completed on Google Cloud's TPU3v8-VM machines, sponsored by Google TPU Research Cloud

Model Capabilities

Italian text understanding

Italian text generation

Sequence-to-sequence conversion

Use Cases

Natural Language Processing

Natural language inference

Can be used for natural language inference tasks, such as premise-hypothesis relationship judgment

See fine-tuned model gsarti/it5-base-nli

Text summarization

Can be used for automatic summarization of Italian text

Machine translation

Can be used for Italian-related translation tasks

🚀 Italian T5 Base (Oscar) 🇮🇹

This repository contains the model formerly known as gsarti/t5-base-it.

The IT5 model family is the first attempt to pretrain large - scale sequence - to - sequence transformer models for the Italian language, following the approach of the original T5 model.

This model is part of the project "IT5: Large-Scale Text-to-Text Pretraining for Italian Language Understanding and Generation" (to be released), by Gabriele Sarti with the support of Huggingface and TPU usage sponsored by Google's TPU Research Cloud. All training was done on a single TPU3v8 - VM machine on Google Cloud. Check the Tensorboard tab of the repository for an overview of the training process.

The inference widget is deactivated because the model needs task - specific seq2seq fine - tuning on a downstream task to be practical. The model gsarti/it5-base-nli shows an example of this model fine - tuned on a downstream NLI task.

✨ Features

Model variants

This repository has the checkpoints for a base version of the model trained on the OSCAR corpus using 🤗 Datasets. The original t5 - base model configuration was used, except the dropout_rate parameter, which was set at 0 instead of 0.1 during pre - training, following the implementation of t5 - v1.1. The tokenizer is a SentencePieceUnigramTokenizer trained on the first 2M sentences of the Italian portion of the mC4 corpus. An improved version of the model trained on the Thoroughly Cleaned Italian mC4 Corpus (~41B words, ~275GB) is also available as gsarti/it5-base. The training procedure is available on Github.

The following table summarizes the parameters for all available models:

Property	`it5-small`	`it5-base`	`it5-large`	`it5-base-oscar` (this one)
`dataset`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`oscar/unshuffled_deduplicated_it`
`architecture`	`google/t5-v1_1-small`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`t5-base`
`learning rate`	5e - 3	5e - 3	5e - 3	1e - 2
`steps`	1,050,000	1,050,000	2,100,000	258,000
`training time`	36 hours	101 hours	370 hours	98 hours
`ff projection`	`gated-gelu`	`gated-gelu`	`gated-gelu`	`relu`
`tie embeds`	`false`	`false`	`false`	`true`
`optimizer`	adafactor	adafactor	adafactor	adafactor
`max seq. length`	512	512	512	512
`per - device batch size`	16	16	8	16
`tot. batch size`	128	128	64	128
`weigth decay`	1e - 3	1e - 3	1e - 2	1e - 3
`validation split size`	15K examples	15K examples	15K examples	15K examples

The high training time of it5-base-oscar was due to a bug in the training script.

For a list of individual model parameters, refer to the config.json file in the respective repositories.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("gsarti/it5-base-oscar")
model = T5ForConditionalGeneration.from_pretrained("gsarti/it5-base-oscar")

Note: You will need to fine - tune the model on your downstream seq2seq task to use it. See an example here.

Advanced Usage

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-base-oscar")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-base-oscar")

📚 Documentation

Limitations

Due to the nature of the web - scraped corpus on which IT5 models were trained, it is likely that their usage could reproduce and amplify pre - existing biases in the data, resulting in potentially harmful content such as racial or gender stereotypes and conspiracist views. For this reason, the study of such biases is explicitly encouraged, and model usage should ideally be restricted to research - oriented and non - user - facing endeavors.

Model curators

For problems or updates on this model, please contact gabriele.sarti996@gmail.com.

📄 License

This model is released under the apache - 2.0 license.

🔧 Technical Details

No specific technical details (more than 50 words of specific technical explanations) are provided in the original document, so this section is skipped.

📖 Citation Information

@article{sarti-nissim-2022-it5,
    title={IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation},
    author={Sarti, Gabriele and Nissim, Malvina},
    journal={ArXiv preprint 2203.03759},
    url={https://arxiv.org/abs/2203.03759},
    year={2022},
    month={mar}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご