T0pp開源模型 - 以小體積實現英語零樣本任務泛化，效果超GPT-3！

首頁

T0pp

由bigscience開發

T0pp是基於T5架構的110億參數編碼器-解碼器模型，在英語自然語言提示的零樣本任務泛化上表現優異，超越GPT-3且體積更小。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #零樣本提示學習 #多任務NLP泛化 #英語自然語言推理

下載量 7,426

發布時間 : 3/2/2022

模型概述

T0系列模型通過多任務混合訓練實現強大的零樣本任務泛化能力，可直接通過自然語言描述執行多樣化NLP任務。

模型特點

零樣本任務泛化

通過自然語言提示直接執行未見過的任務，無需特定任務微調

多任務訓練

在60+個NLP數據集上訓練，涵蓋問答、分類、生成等多種任務類型

高效架構

相比GPT-3實現相當性能的同時模型體積縮小16倍

模型能力

文本分類

問答系統

文本生成

指代消解

邏輯推理

情感分析

複述識別

語義相似度判斷

使用案例

客戶服務

評論情感分析

自動判斷用戶評論的情感傾向

輸入示例：'這是最好的鑄鐵煎鍋' → 輸出：'正面'

教育

邏輯謎題解答

解決基於文字描述的邏輯排列問題

輸入示例：書架書本排列條件 → 輸出正確順序

內容分析

指代消解

識別文本中代詞的指代對象

輸入示例：'奧巴馬提名希拉里...他選擇她...' → 輸出：'希拉里·克林頓'

🚀 T0系列模型

T0* 模型在英文自然語言提示下展現出零樣本任務泛化能力，在許多任務上超越了GPT - 3，同時模型規模小了16倍。它是一系列基於編碼器 - 解碼器架構的模型，在大量不同的自然語言提示任務上進行訓練，能夠處理多種自然語言指定的未見過的任務。

🚀 快速開始

你可以通過自然語言指定查詢，使用這些模型對任務進行推理，模型會生成預測結果。例如，你可以詢問“Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy”，模型有望生成“Positive”。

以下是一些你可以嘗試的示例：

“A is the son's of B's uncle. What is the family relationship between A and B?”
“Question A: How is air traffic controlled? Question B: How do you become an air traffic controller? Pick one: these questions are duplicates or not duplicates.”
“Is the word 'table' used in the same meaning in the two following sentences? Sentence A: you can leave the books on the table over there. Sentence B: the tables in this book are very hard to read.”
“Max: Know any good websites to buy clothes from? Payton: Sure :) LINK 1, LINK 2, LINK 3 Max: That's a lot of them! Payton: Yeah, but they have different things so I usually buy things from 2 or 3 of them. Max: I'll check them out. Thanks. Who or what are Payton and Max referring to when they say 'them'?”
“On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book. The red book is to the right of the gray book. The black book is to the left of the blue book. The blue book is to the left of the gray book. The purple book is the second from the right. Which book is the leftmost book?”
“Reorder the words in this sentence: justin and name bieber years is my am I 27 old.”

✨ 主要特性

零樣本任務泛化：T0* 模型在英文自然語言提示下，能夠對完全未見過的任務進行推理，在很多任務上表現優於GPT - 3，且模型規模小很多。
多任務訓練：基於大量不同的自然語言提示任務進行訓練，涵蓋了多種NLP任務。

📦 安裝指南

文檔未提及具體安裝步驟，可參考官方倉庫 bigscience - workshop/t - zero 進行安裝。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")

inputs = tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

如果你想使用其他檢查點，請替換 AutoTokenizer 和 AutoModelForSeq2SeqLM 中的路徑。

⚠️ 重要提示

該模型使用bf16激活進行訓練，因此強烈不建議使用fp16進行推理，建議使用fp32或bf16。

📚 詳細文檔

模型描述

T0* 模型是一系列編碼器 - 解碼器模型，基於 [T5](https://huggingface.co/google/t5 - v1_1 - large) 預訓練語言模型，在大量不同的自然語言提示任務上進行微調。輸入文本被送入編碼器，目標文本由解碼器生成，通過標準的最大似然訓練進行微調，以自迴歸方式生成目標文本。

模型參數

模型	參數數量
T0	110億
T0p	110億
T0pp	110億
T0_single_prompt	110億
T0_original_task_only	110億
T0_3B	30億

訓練過程

預訓練模型：基於 [T5](https://huggingface.co/google/t5 - v1_1 - large) 預訓練語言模型，該模型在 C4 上以掩碼語言建模目標進行預訓練。使用公開可用的 [語言模型適配的T5檢查點](https://github.com/google - research/text - to - text - transfer - transformer/blob/main/released_checkpoints.md#lm - adapted - t511lm100k)，這些檢查點是通過在標準語言建模目標下對T5進行額外100,000步訓練得到的。
微調細節：
- 微調步數：12,200
- 輸入序列長度：1024
- 目標序列長度：256
- 批量大小：1024個序列
- 優化器：Adafactor
- 學習率：1e - 3
- 丟棄率：0.1
- 採樣策略：與每個數據集中的示例數量成比例（將任何超過500,000個示例的數據集視為有500,000/num_templates 個示例）
- 示例分組：使用打包技術將多個訓練示例組合成一個序列，以達到最大序列長度

訓練數據

不同的T0變體在不同的數據集混合上進行訓練：

模型	訓練數據集
T0	- 多項選擇問答：CommonsenseQA、DREAM、QUAIL、QuaRTz、Social IQA、WiQA、Cosmos、QASC、Quarel、SciQ、Wiki Hop - 抽取式問答：Adversarial QA、Quoref、DuoRC、ROPES - 閉卷問答：Hotpot QA*、Wiki QA - 結構到文本：Common Gen、Wiki Bio - 情感分析：Amazon、App Reviews、IMDB、Rotten Tomatoes、Yelp - 摘要生成：CNN Daily Mail、Gigaword、MultiNews、SamSum、XSum - 主題分類：AG News、DBPedia、TREC - 釋義識別：MRPC、PAWS、QQP
T0p	與T0相同，額外增加了GPT - 3評估套件中的數據集： - 多項選擇問答：ARC、OpenBook QA、PiQA、RACE、HellaSwag - 抽取式問答：SQuAD v2 - 閉卷問答：Trivia QA、Web Questions
T0pp	與T0p相同，額外增加了SuperGLUE中的一些數據集（不包括NLI集）： - BoolQ - COPA - MultiRC - ReCoRD - WiC - WSC
T0_single_prompt	與T0相同，但每個訓練數據集僅使用一個提示
T0_original_task_only	與T0相同，但僅使用原始任務模板
T0_3B	與T0相同，但從T5 - LM XL（30億參數）預訓練模型開始

為了可重複性，我們在 P3數據集中發佈了用於訓練（和評估）的數據。提示示例可在數據集頁面找到。

*：由於輸入序列長度較長，我們將Hotpot QA重新轉換為閉卷問答任務。

評估數據

我們在一組保留任務上評估模型：

任務類別	數據集
自然語言推理	ANLI、CB、RTE
共指消解	WSC、Winogrande
詞義消歧	WiC
句子完成	COPA、HellaSwag、Story Cloze

我們還在 [BIG - bench基準測試](https://github.com/google/BIG - bench) 的一個子集上評估T0、T0p和T0pp：

代碼描述任務
概念組合
印度教知識json
已知未知
語言識別
邏輯網格謎題任務
邏輯演繹
常見誤解
電影對話相同或不同
新穎概念
Strategyqa
形式謬誤三段論否定
VitaminC
Winowhy多項選擇

侷限性

計算資源要求高：T0* 系列模型規模較大（30億或110億參數），加載和進行推理需要相當的計算資源。使用多個GPU時，可以使用 .parallelize()。
提示效果差異：不同的提示可能導致不同的性能，需要進一步研究不同提示對語言模型的有效性。
任務適用性有限：由於分詞設計的選擇，模型無法對涉及代碼或非英文文本的任務進行推理。

偏差與公平性

儘管在微調時有意排除了可能包含有害內容的數據集，但訓練的模型並非無偏差。基於一些實驗，T0++ 可能會生成可歸類為陰謀論、有偏差、冒犯性或過度強調性話題的答案。

我們通過兩種方式評估模型：一是評估模型識別或標記性別偏差的能力，二是評估模型再現這些偏差的程度。

識別性別偏差能力評估：使用WinoGender Schemas（也稱為SuperGLUE下的AX - g）和CrowS - Pairs評估模型識別性別偏差的能力。
- CrowS - Pairs： | 模型 | 平均準確率 | 中位數準確率 | | ---- | ---- | ---- | | T0 | 59.2 | 83.8 | | T0p | 57.6 | 83.8 | | T0pp | 62.7 | 64.4 | | T0_single_prompt | 57.6 | 69.5 | | T0_original_task_only | 47.1 | 37.8 | | T0_3B | 56.9 | 82.6 |
- WinoGender： | 模型 | 平均準確率 | 中位數準確率 | | ---- | ---- | ---- | | T0 | 84.2 | 84.3 | | T0p | 80.1 | 80.6 | | T0pp | 89.2 | 90.0 | | T0_single_prompt | 81.6 | 84.6 | | T0_original_task_only | 83.7 | 83.8 | | T0_3B | 69.7 | 69.4 |
再現性別偏差程度評估：使用WinoBias Schemas評估模型再現性別偏差的程度。WinoBias Schemas有兩種類型（type1和type2），分為支持刻板印象和反對刻板印象子集。 | 模型 | 子集 | 支持刻板印象平均準確率 | 反對刻板印象平均準確率 | 差異 | 支持刻板印象中位數準確率 | 反對刻板印象中位數準確率 | 差異 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | T0 | Type 1 | 68.0 | 61.9 | 6.0 | 71.7 | 61.9 | 9.8 | | T0 | Type 2 | 79.3 | 76.4 | 2.8 | 79.3 | 75.0 | 4.3 | | T0p | Type 1 | 66.6 | 57.2 | 9.4 | 71.5 | 62.6 | 8.8 | | T0p | Type 2 | 77.7 | 73.4 | 4.3 | 86.1 | 81.3 | 4.8 | | T0pp | Type 1 | 63.8 | 55.9 | 7.9 | 72.7 | 63.4 | 9.3 | | T0pp | Type 2 | 66.8 | 63.0 | 3.9 | 79.3 | 74.0 | 5.3 | | T0_single_prompt | Type 1 | 73.7 | 60.5 | 13.2 | 79.3 | 60.6 | 18.7 | | T0_single_prompt | Type 2 | 77.7 | 69.6 | 8.0 | 80.8 | 69.7 | 11.1 | | T0_original_task_only | Type 1 | 78.1 | 67.7 | 10.4 | 81.8 | 67.2 | 14.6 | | T0_original_task_only | Type 2 | 85.2 | 82.3 | 2.9 | 89.6 | 85.4 | 4.3 | | T0_3B | Type 1 | 82.3 | 70.1 | 12.2 | 83.6 | 62.9 | 20.7 | | T0_3B | Type 2 | 83.8 | 76.5 | 7.3 | 85.9 | 75 | 10.9 |

🔧 技術細節

模型架構：基於Transformer的編碼器 - 解碼器架構。
訓練目標：標準的最大似然訓練，以自迴歸方式生成目標文本。
數據處理：將大量英文監督數據集轉換為提示，每個數據集使用多個不同表述的模板。

📄 許可證

該項目使用Apache 2.0許可證。

📖 BibTeX引用

@misc{sanh2021multitask,
      title={Multitask Prompted Training Enables Zero-Shot Task Generalization},
      author={Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Teven Le Scao and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal Nayak and Debajyoti Datta and Jonathan Chang and Mike Tian-Jian Jiang and Han Wang and Matteo Manica and Sheng Shen and Zheng Xin Yong and Harshit Pandey and Rachel Bawden and Thomas Wang and Trishala Neeraj and Jos Rozen and Abheesht Sharma and Andrea Santilli and Thibault Fevry and Jason Alan Fries and Ryan Teehan and Stella Biderman and Leo Gao and Tali Bers and Thomas Wolf and Alexander M. Rush},
      year={2021},
      eprint={2110.08207},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}