T0pp开源模型 - 以小体积实现英语零样本任务泛化，效果超GPT-3！

首页

T0pp

由 bigscience 开发

T0pp是基于T5架构的110亿参数编码器-解码器模型，在英语自然语言提示的零样本任务泛化上表现优异，超越GPT-3且体积更小。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #零样本提示学习 #多任务NLP泛化 #英语自然语言推理

下载量 7,426

发布时间 : 3/2/2022

模型简介

T0系列模型通过多任务混合训练实现强大的零样本任务泛化能力，可直接通过自然语言描述执行多样化NLP任务。

模型特点

零样本任务泛化

通过自然语言提示直接执行未见过的任务，无需特定任务微调

多任务训练

在60+个NLP数据集上训练，涵盖问答、分类、生成等多种任务类型

高效架构

相比GPT-3实现相当性能的同时模型体积缩小16倍

模型能力

文本分类

问答系统

文本生成

指代消解

逻辑推理

情感分析

复述识别

语义相似度判断

使用案例

客户服务

评论情感分析

自动判断用户评论的情感倾向

输入示例：'这是最好的铸铁煎锅' → 输出：'正面'

教育

逻辑谜题解答

解决基于文字描述的逻辑排列问题

输入示例：书架书本排列条件 → 输出正确顺序

内容分析

指代消解

识别文本中代词的指代对象

输入示例：'奥巴马提名希拉里...他选择她...' → 输出：'希拉里·克林顿'

🚀 T0系列模型

T0* 模型在英文自然语言提示下展现出零样本任务泛化能力，在许多任务上超越了GPT - 3，同时模型规模小了16倍。它是一系列基于编码器 - 解码器架构的模型，在大量不同的自然语言提示任务上进行训练，能够处理多种自然语言指定的未见过的任务。

🚀 快速开始

你可以通过自然语言指定查询，使用这些模型对任务进行推理，模型会生成预测结果。例如，你可以询问“Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy”，模型有望生成“Positive”。

以下是一些你可以尝试的示例：

“A is the son's of B's uncle. What is the family relationship between A and B?”
“Question A: How is air traffic controlled? Question B: How do you become an air traffic controller? Pick one: these questions are duplicates or not duplicates.”
“Is the word 'table' used in the same meaning in the two following sentences? Sentence A: you can leave the books on the table over there. Sentence B: the tables in this book are very hard to read.”
“Max: Know any good websites to buy clothes from? Payton: Sure :) LINK 1, LINK 2, LINK 3 Max: That's a lot of them! Payton: Yeah, but they have different things so I usually buy things from 2 or 3 of them. Max: I'll check them out. Thanks. Who or what are Payton and Max referring to when they say 'them'?”
“On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book. The red book is to the right of the gray book. The black book is to the left of the blue book. The blue book is to the left of the gray book. The purple book is the second from the right. Which book is the leftmost book?”
“Reorder the words in this sentence: justin and name bieber years is my am I 27 old.”

✨ 主要特性

零样本任务泛化：T0* 模型在英文自然语言提示下，能够对完全未见过的任务进行推理，在很多任务上表现优于GPT - 3，且模型规模小很多。
多任务训练：基于大量不同的自然语言提示任务进行训练，涵盖了多种NLP任务。

📦 安装指南

文档未提及具体安装步骤，可参考官方仓库 bigscience - workshop/t - zero 进行安装。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")

inputs = tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

如果你想使用其他检查点，请替换 AutoTokenizer 和 AutoModelForSeq2SeqLM 中的路径。

⚠️ 重要提示

该模型使用bf16激活进行训练，因此强烈不建议使用fp16进行推理，建议使用fp32或bf16。

📚 详细文档

模型描述

T0* 模型是一系列编码器 - 解码器模型，基于 [T5](https://huggingface.co/google/t5 - v1_1 - large) 预训练语言模型，在大量不同的自然语言提示任务上进行微调。输入文本被送入编码器，目标文本由解码器生成，通过标准的最大似然训练进行微调，以自回归方式生成目标文本。

模型参数

模型	参数数量
T0	110亿
T0p	110亿
T0pp	110亿
T0_single_prompt	110亿
T0_original_task_only	110亿
T0_3B	30亿

训练过程

预训练模型：基于 [T5](https://huggingface.co/google/t5 - v1_1 - large) 预训练语言模型，该模型在 C4 上以掩码语言建模目标进行预训练。使用公开可用的 [语言模型适配的T5检查点](https://github.com/google - research/text - to - text - transfer - transformer/blob/main/released_checkpoints.md#lm - adapted - t511lm100k)，这些检查点是通过在标准语言建模目标下对T5进行额外100,000步训练得到的。
微调细节：
- 微调步数：12,200
- 输入序列长度：1024
- 目标序列长度：256
- 批量大小：1024个序列
- 优化器：Adafactor
- 学习率：1e - 3
- 丢弃率：0.1
- 采样策略：与每个数据集中的示例数量成比例（将任何超过500,000个示例的数据集视为有500,000/num_templates 个示例）
- 示例分组：使用打包技术将多个训练示例组合成一个序列，以达到最大序列长度

训练数据

不同的T0变体在不同的数据集混合上进行训练：

模型	训练数据集
T0	- 多项选择问答：CommonsenseQA、DREAM、QUAIL、QuaRTz、Social IQA、WiQA、Cosmos、QASC、Quarel、SciQ、Wiki Hop - 抽取式问答：Adversarial QA、Quoref、DuoRC、ROPES - 闭卷问答：Hotpot QA*、Wiki QA - 结构到文本：Common Gen、Wiki Bio - 情感分析：Amazon、App Reviews、IMDB、Rotten Tomatoes、Yelp - 摘要生成：CNN Daily Mail、Gigaword、MultiNews、SamSum、XSum - 主题分类：AG News、DBPedia、TREC - 释义识别：MRPC、PAWS、QQP
T0p	与T0相同，额外增加了GPT - 3评估套件中的数据集： - 多项选择问答：ARC、OpenBook QA、PiQA、RACE、HellaSwag - 抽取式问答：SQuAD v2 - 闭卷问答：Trivia QA、Web Questions
T0pp	与T0p相同，额外增加了SuperGLUE中的一些数据集（不包括NLI集）： - BoolQ - COPA - MultiRC - ReCoRD - WiC - WSC
T0_single_prompt	与T0相同，但每个训练数据集仅使用一个提示
T0_original_task_only	与T0相同，但仅使用原始任务模板
T0_3B	与T0相同，但从T5 - LM XL（30亿参数）预训练模型开始

为了可重复性，我们在 P3数据集中发布了用于训练（和评估）的数据。提示示例可在数据集页面找到。

*：由于输入序列长度较长，我们将Hotpot QA重新转换为闭卷问答任务。

评估数据

我们在一组保留任务上评估模型：

任务类别	数据集
自然语言推理	ANLI、CB、RTE
共指消解	WSC、Winogrande
词义消歧	WiC
句子完成	COPA、HellaSwag、Story Cloze

我们还在 [BIG - bench基准测试](https://github.com/google/BIG - bench) 的一个子集上评估T0、T0p和T0pp：

代码描述任务
概念组合
印度教知识json
已知未知
语言识别
逻辑网格谜题任务
逻辑演绎
常见误解
电影对话相同或不同
新颖概念
Strategyqa
形式谬误三段论否定
VitaminC
Winowhy多项选择

局限性

计算资源要求高：T0* 系列模型规模较大（30亿或110亿参数），加载和进行推理需要相当的计算资源。使用多个GPU时，可以使用 .parallelize()。
提示效果差异：不同的提示可能导致不同的性能，需要进一步研究不同提示对语言模型的有效性。
任务适用性有限：由于分词设计的选择，模型无法对涉及代码或非英文文本的任务进行推理。

偏差与公平性

尽管在微调时有意排除了可能包含有害内容的数据集，但训练的模型并非无偏差。基于一些实验，T0++ 可能会生成可归类为阴谋论、有偏差、冒犯性或过度强调性话题的答案。

我们通过两种方式评估模型：一是评估模型识别或标记性别偏差的能力，二是评估模型再现这些偏差的程度。

识别性别偏差能力评估：使用WinoGender Schemas（也称为SuperGLUE下的AX - g）和CrowS - Pairs评估模型识别性别偏差的能力。
- CrowS - Pairs： | 模型 | 平均准确率 | 中位数准确率 | | ---- | ---- | ---- | | T0 | 59.2 | 83.8 | | T0p | 57.6 | 83.8 | | T0pp | 62.7 | 64.4 | | T0_single_prompt | 57.6 | 69.5 | | T0_original_task_only | 47.1 | 37.8 | | T0_3B | 56.9 | 82.6 |
- WinoGender： | 模型 | 平均准确率 | 中位数准确率 | | ---- | ---- | ---- | | T0 | 84.2 | 84.3 | | T0p | 80.1 | 80.6 | | T0pp | 89.2 | 90.0 | | T0_single_prompt | 81.6 | 84.6 | | T0_original_task_only | 83.7 | 83.8 | | T0_3B | 69.7 | 69.4 |
再现性别偏差程度评估：使用WinoBias Schemas评估模型再现性别偏差的程度。WinoBias Schemas有两种类型（type1和type2），分为支持刻板印象和反对刻板印象子集。 | 模型 | 子集 | 支持刻板印象平均准确率 | 反对刻板印象平均准确率 | 差异 | 支持刻板印象中位数准确率 | 反对刻板印象中位数准确率 | 差异 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | T0 | Type 1 | 68.0 | 61.9 | 6.0 | 71.7 | 61.9 | 9.8 | | T0 | Type 2 | 79.3 | 76.4 | 2.8 | 79.3 | 75.0 | 4.3 | | T0p | Type 1 | 66.6 | 57.2 | 9.4 | 71.5 | 62.6 | 8.8 | | T0p | Type 2 | 77.7 | 73.4 | 4.3 | 86.1 | 81.3 | 4.8 | | T0pp | Type 1 | 63.8 | 55.9 | 7.9 | 72.7 | 63.4 | 9.3 | | T0pp | Type 2 | 66.8 | 63.0 | 3.9 | 79.3 | 74.0 | 5.3 | | T0_single_prompt | Type 1 | 73.7 | 60.5 | 13.2 | 79.3 | 60.6 | 18.7 | | T0_single_prompt | Type 2 | 77.7 | 69.6 | 8.0 | 80.8 | 69.7 | 11.1 | | T0_original_task_only | Type 1 | 78.1 | 67.7 | 10.4 | 81.8 | 67.2 | 14.6 | | T0_original_task_only | Type 2 | 85.2 | 82.3 | 2.9 | 89.6 | 85.4 | 4.3 | | T0_3B | Type 1 | 82.3 | 70.1 | 12.2 | 83.6 | 62.9 | 20.7 | | T0_3B | Type 2 | 83.8 | 76.5 | 7.3 | 85.9 | 75 | 10.9 |

🔧 技术细节

模型架构：基于Transformer的编码器 - 解码器架构。
训练目标：标准的最大似然训练，以自回归方式生成目标文本。
数据处理：将大量英文监督数据集转换为提示，每个数据集使用多个不同表述的模板。

📄 许可证

该项目使用Apache 2.0许可证。

📖 BibTeX引用

@misc{sanh2021multitask,
      title={Multitask Prompted Training Enables Zero-Shot Task Generalization},
      author={Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Teven Le Scao and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal Nayak and Debajyoti Datta and Jonathan Chang and Mike Tian-Jian Jiang and Han Wang and Matteo Manica and Sheng Shen and Zheng Xin Yong and Harshit Pandey and Rachel Bawden and Thomas Wang and Trishala Neeraj and Jos Rozen and Abheesht Sharma and Andrea Santilli and Thibault Fevry and Jason Alan Fries and Ryan Teehan and Stella Biderman and Leo Gao and Tali Bers and Thomas Wolf and Alexander M. Rush},
      year={2021},
      eprint={2110.08207},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}