it5-base開源意大利語模型 - 基於T5架構開啟意語處理新應用

首頁

It5 Base

由gsarti開發

IT5是首個針對意大利語進行大規模序列到序列Transformer模型預訓練的嘗試，基於T5模型架構。

大型語言模型其他開源協議:Apache-2.0 #意大利語生成 #序列到序列 #大規模預訓練

下載量 389

發布時間 : 3/2/2022

模型概述

該模型是意大利語文本到文本轉換模型的基礎版本，主要用於意大利語的理解和生成任務，需要在下游任務上進行微調才能使用。

模型特點

意大利語專用預訓練

首個專門針對意大利語進行大規模預訓練的序列到序列Transformer模型

基於改進版T5架構

採用google/t5-v1_1-base改進配置，使用門控GELU激活函數

大規模訓練數據

在清理過的意大利語mC4語料庫（約410億詞）上訓練

多框架支持

提供PyTorch、Flax和TensorFlow三種實現版本

模型能力

意大利語文本理解

意大利語文本生成

序列到序列轉換

使用案例

文本生成

新聞摘要

對意大利語新聞文章進行自動摘要

需要微調後使用

文本轉換

語言改寫

意大利語文本的改寫和簡化

需要微調後使用

🚀 意大利語T5基礎模型🇮🇹

意大利語T5（IT5）模型家族是首次針對意大利語進行大規模序列到序列Transformer模型預訓練的嘗試，其採用了原始 T5模型的方法。該模型能夠助力意大利語相關的自然語言處理任務，如文本生成、理解等，為意大利語的處理提供了強大的工具。

🚀 快速開始

模型變體

本倉庫包含了模型 base 版本的檢查點。該模型在深度清理的意大利語mC4語料庫（約410億個單詞，約275GB）上使用 🤗 Datasets 和 google/t5-v1_1-base 改進配置進行了一個輪次（105萬步）的訓練。另一個在 OSCAR語料庫上訓練的版本也可通過名稱 gsarti/it5-base-oscar 獲取。訓練過程可在 Github 上查看。

以下表格總結了所有可用模型的參數：

屬性	`it5-small`	`it5-base`（本模型）	`it5-large`	`it5-base-oscar`
數據集	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`oscar/unshuffled_deduplicated_it`
架構	`google/t5-v1_1-small`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`t5-base`
學習率	5e - 3	5e - 3	5e - 3	1e - 2
步數	1050000	1050000	2100000	258000
訓練時間	36小時	101小時	370小時	98小時
前饋投影	`gated - gelu`	`gated - gelu`	`gated - gelu`	`relu`
嵌入綁定	`false`	`false`	`false`	`true`
優化器	adafactor	adafactor	adafactor	adafactor
最大序列長度	512	512	512	512
每設備批量大小	16	16	8	16
總批量大小	128	128	64	128
權重衰減	1e - 3	1e - 3	1e - 2	1e - 3
驗證集分割大小	15000個示例	15000個示例	15000個示例	15000個示例

it5-base-oscar 訓練時間較長是由於訓練腳本中的一個bug 導致的。若需查看單個模型的參數列表，請參考各自倉庫中的 config.json 文件。

使用模型

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-base")

⚠️ 重要提示

你需要在下游的序列到序列任務上對模型進行微調才能使用它。可參考此處的示例。

模型的Flax和Tensorflow版本同樣可用：

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-base")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-base")

🔧 侷限性

由於IT5模型是在網絡抓取的語料庫上進行訓練的，使用這些模型可能會重現並放大數據中已有的偏差，從而產生潛在的有害內容，如種族或性別刻板印象以及陰謀論觀點。因此，我們鼓勵對這些偏差進行研究，並且理想情況下，模型的使用應僅限於面向研究且不直接面向用戶的項目。

📄 許可證

本模型採用Apache 2.0許可證。

🛠️ 模型維護者

若你在使用此模型過程中遇到問題或需要更新，請聯繫 gabriele.sarti996@gmail.com。

📚 引用信息

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
    abstract = "We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.",
}