it5-large開源模型 - 助力意大利語處理，免費實現序列到序列轉換

首頁

It5 Large

由gsarti開發

IT5是首個針對意大利語大規模預訓練的序列到序列Transformer模型家族，遵循T5模型的方法。

大型語言模型其他開源協議:Apache-2.0 #意大利語生成 #序列到序列 #大規模預訓練

下載量 37

發布時間 : 3/2/2022

模型概述

IT5模型家族是專門為意大利語設計的序列到序列Transformer模型，適用於各種自然語言理解和生成任務。

模型特點

意大利語專用預訓練

首個專門針對意大利語大規模預訓練的序列到序列Transformer模型

改進的T5架構

基於google/t5-v1_1-large改進配置，使用門控GELU激活函數

大規模訓練數據

在清理過的意大利語mC4語料庫（約410億詞）上訓練

多框架支持

提供PyTorch、Flax和TensorFlow版本

模型能力

意大利語文本理解

意大利語文本生成

序列到序列任務處理

使用案例

自然語言處理

意大利語文本摘要

生成意大利語文本的簡潔摘要

意大利語機器翻譯

支持意大利語與其他語言之間的翻譯任務

意大利語問答系統

構建意大利語問答應用

🚀 意大利語T5大模型🇮🇹

意大利語T5（IT5）模型家族是首次針對意大利語進行大規模序列到序列Transformer模型預訓練的嘗試，其採用了與原始 T5模型相同的方法。該模型能夠助力意大利語相關的自然語言處理任務，如文本生成、理解等，為意大利語的自然語言處理研究和應用提供了強大的工具。

🚀 快速開始

本項目由 Gabriele Sarti 和 Malvina Nissim 發起，在 Huggingface 的支持下，藉助 Google 的 TPU研究雲提供的TPU資源完成訓練。所有訓練均在 Google Cloud 的單臺 TPU3v8 - VM 機器上進行。你可以參考倉庫的 Tensorboard 標籤瞭解訓練過程的概況。

推理小部件已停用，因為該模型需要在下游任務上進行特定任務的序列到序列微調才能在實際中發揮作用。

✨ 主要特性

模型變體

本倉庫包含模型 base 版本的檢查點。該模型在深度清理的意大利語mC4語料庫（約410億個單詞，約275GB）上使用 🤗 Datasets 和 google/t5 - v1_1 - large 改進配置進行了一個週期（105萬步）的訓練。訓練過程可在 [Github](https://github.com/gsarti/t5 - flax - gcp) 上查看。

以下表格總結了所有可用模型的參數：

屬性	`it5-small`	`it5-base`	`it5-large`（本模型）	`it5-base-oscar`
`數據集`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`oscar/unshuffled_deduplicated_it`
`架構`	`google/t5-v1_1-small`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`t5-base`
`學習率`	5e - 3	5e - 3	5e - 3	1e - 2
`步數`	1050000	1050000	2100000	258000
`訓練時間`	36小時	101小時	370小時	98小時
`前饋投影`	`gated-gelu`	`gated-gelu`	`gated-gelu`	`relu`
`綁定嵌入`	`false`	`false`	`false`	`true`
`優化器`	adafactor	adafactor	adafactor	adafactor
`最大序列長度`	512	512	512	512
`每設備批次大小`	16	16	8	16
`總批次大小`	128	128	64	128
`權重衰減`	1e - 3	1e - 3	1e - 2	1e - 3
`驗證集分割大小`	15K個示例	15K個示例	15K個示例	15K個示例

it5 - base - oscar 訓練時間較長是由於訓練腳本中的一個錯誤導致的。

如需查看單個模型的參數列表，請參考各自倉庫中的 config.json 文件。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-large")

注意：你需要在下游序列到序列任務上對模型進行微調才能使用它。

模型的 Flax 和 Tensorflow 版本也可用：

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-large")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-large")

📚 詳細文檔

侷限性

由於 IT5 模型是在網絡抓取的語料庫上進行訓練的，其使用可能會重現並放大數據中已有的偏差，從而產生潛在的有害內容，如種族或性別刻板印象以及陰謀論觀點。因此，建議對這些偏差進行研究，並且理想情況下，模型的使用應僅限於面向研究且不直接面向用戶的項目。

模型維護者

如果你在使用該模型時遇到問題或需要更新，請聯繫 gabriele.sarti996@gmail.com。

引用信息

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
    abstract = "We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.",
}