it5-large开源模型 - 助力意大利语处理，免费实现序列到序列转换

首页

It5 Large

由 gsarti 开发

IT5是首个针对意大利语大规模预训练的序列到序列Transformer模型家族，遵循T5模型的方法。

大型语言模型其他开源协议:Apache-2.0 #意大利语生成 #序列到序列 #大规模预训练

下载量 37

发布时间 : 3/2/2022

模型简介

IT5模型家族是专门为意大利语设计的序列到序列Transformer模型，适用于各种自然语言理解和生成任务。

模型特点

意大利语专用预训练

首个专门针对意大利语大规模预训练的序列到序列Transformer模型

改进的T5架构

基于google/t5-v1_1-large改进配置，使用门控GELU激活函数

大规模训练数据

在清理过的意大利语mC4语料库（约410亿词）上训练

多框架支持

提供PyTorch、Flax和TensorFlow版本

模型能力

意大利语文本理解

意大利语文本生成

序列到序列任务处理

使用案例

自然语言处理

意大利语文本摘要

生成意大利语文本的简洁摘要

意大利语机器翻译

支持意大利语与其他语言之间的翻译任务

意大利语问答系统

构建意大利语问答应用

🚀 意大利语T5大模型🇮🇹

意大利语T5（IT5）模型家族是首次针对意大利语进行大规模序列到序列Transformer模型预训练的尝试，其采用了与原始 T5模型相同的方法。该模型能够助力意大利语相关的自然语言处理任务，如文本生成、理解等，为意大利语的自然语言处理研究和应用提供了强大的工具。

🚀 快速开始

本项目由 Gabriele Sarti 和 Malvina Nissim 发起，在 Huggingface 的支持下，借助 Google 的 TPU研究云提供的TPU资源完成训练。所有训练均在 Google Cloud 的单台 TPU3v8 - VM 机器上进行。你可以参考仓库的 Tensorboard 标签了解训练过程的概况。

推理小部件已停用，因为该模型需要在下游任务上进行特定任务的序列到序列微调才能在实际中发挥作用。

✨ 主要特性

模型变体

本仓库包含模型 base 版本的检查点。该模型在深度清理的意大利语mC4语料库（约410亿个单词，约275GB）上使用 🤗 Datasets 和 google/t5 - v1_1 - large 改进配置进行了一个周期（105万步）的训练。训练过程可在 [Github](https://github.com/gsarti/t5 - flax - gcp) 上查看。

以下表格总结了所有可用模型的参数：

属性	`it5-small`	`it5-base`	`it5-large`（本模型）	`it5-base-oscar`
`数据集`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`oscar/unshuffled_deduplicated_it`
`架构`	`google/t5-v1_1-small`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`t5-base`
`学习率`	5e - 3	5e - 3	5e - 3	1e - 2
`步数`	1050000	1050000	2100000	258000
`训练时间`	36小时	101小时	370小时	98小时
`前馈投影`	`gated-gelu`	`gated-gelu`	`gated-gelu`	`relu`
`绑定嵌入`	`false`	`false`	`false`	`true`
`优化器`	adafactor	adafactor	adafactor	adafactor
`最大序列长度`	512	512	512	512
`每设备批次大小`	16	16	8	16
`总批次大小`	128	128	64	128
`权重衰减`	1e - 3	1e - 3	1e - 2	1e - 3
`验证集分割大小`	15K个示例	15K个示例	15K个示例	15K个示例

it5 - base - oscar 训练时间较长是由于训练脚本中的一个错误导致的。

如需查看单个模型的参数列表，请参考各自仓库中的 config.json 文件。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-large")

注意：你需要在下游序列到序列任务上对模型进行微调才能使用它。

模型的 Flax 和 Tensorflow 版本也可用：

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-large")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-large")

📚 详细文档

局限性

由于 IT5 模型是在网络抓取的语料库上进行训练的，其使用可能会重现并放大数据中已有的偏差，从而产生潜在的有害内容，如种族或性别刻板印象以及阴谋论观点。因此，建议对这些偏差进行研究，并且理想情况下，模型的使用应仅限于面向研究且不直接面向用户的项目。

模型维护者

如果你在使用该模型时遇到问题或需要更新，请联系 gabriele.sarti996@gmail.com。

引用信息

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
    abstract = "We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.",
}