switch-c-2048开源模型 - 1.6万亿参数助力高效语言任务处理

首页

Switch C 2048

由 google 开发

基于掩码语言建模任务训练的混合专家(MoE)模型，参数规模达1.6万亿，采用类似T5的架构但前馈层替换为稀疏MLP层

大型语言模型

Transformers

英语开源协议:Apache-2.0 #万亿参数规模 #混合专家架构 #掩码语言建模

下载量 73

发布时间 : 11/4/2022

模型简介

Switch Transformers是通过混合专家架构扩展的文本生成模型，在预训练任务上相比标准T5模型展现出更好的扩展性和训练效率

模型特点

混合专家架构

前馈层被替换为包含2048个专家MLP的稀疏层，实现参数高效扩展

高效训练

相比T5-XXL模型实现4倍训练加速

大规模参数

模型参数规模达1.6万亿，需要3.1TB存储空间

模型能力

文本生成

掩码语言建模

使用案例

文本补全

掩码文本生成

根据包含掩码标记的输入文本生成完整内容

示例输入输出展示模型能合理填充缺失内容

🚀 Switch Transformers C - 2048专家模型（3.1 TB对应1.6T参数）

Switch Transformers是一种基于专家混合（Mixture of Experts, MoE）架构的语言模型，在掩码语言建模（Masked Language Modeling, MLM）任务上进行训练。它在训练速度和微调任务表现上优于经典的T5模型，能够有效推动语言模型向万亿参数规模发展。

模型图片

🚀 快速开始

Switch Transformers是在掩码语言建模（MLM）任务上训练的专家混合（MoE）模型。该模型架构与经典的T5相似，但前馈层被包含“专家”多层感知机（MLP）的稀疏MLP层所取代。根据原论文，该模型在微调任务上比T5表现更好，同时能实现更快的训练（具有可扩展性）。

正如摘要开头几行所述：

我们通过在“巨型清洁爬取语料库”（Colossal Clean Crawled Corpus）上预训练高达万亿参数的模型，推动了当前语言模型的规模发展，并比T5 - XXL模型实现了4倍的加速。

免责声明：本模型卡片的内容由Hugging Face团队撰写，部分内容从原论文复制粘贴而来。

✨ 主要特性

模型类型：语言模型
适用语言（NLP）：英语
许可证：Apache 2.0
相关模型：所有FLAN - T5检查点
原始检查点：所有原始FLAN - T5检查点
更多信息资源：

📦 安装指南

请注意，这些检查点是在掩码语言建模（MLM）任务上训练的。因此，这些检查点不能直接用于下游任务。你可以查看FLAN - T5以使用微调后的权重，或者按照此笔记本来微调你自己的MoE模型。

下面是一些在transformers中使用该模型的示例脚本 - 请记住，该模型极其庞大，因此你可以考虑使用accelerate进行磁盘卸载：

在CPU上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

使用不同精度在GPU上运行模型

BF16

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", torch_dtype=torch.bfloat16, offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

📚 详细文档

直接使用和下游使用

更多详细信息请参阅研究论文。

超出适用范围的使用

需要更多信息。

偏差、风险和局限性

需要更多信息。

伦理考虑和风险

需要更多信息。

已知局限性

需要更多信息。

敏感使用

需要更多信息。

训练详情

训练数据

该模型在“巨型清洁爬取语料库”（Colossal Clean Crawled Corpus, C4）数据集上进行掩码语言建模任务的训练，训练过程与T5相同。

训练过程

根据原论文中的模型卡片，该模型在TPU v3或TPU v4集群上进行训练，使用了t5x代码库和jax。

评估

测试数据、因素和指标

作者在各种任务上对模型进行了评估，并将结果与T5进行了比较。以下表格展示了一些定量评估结果：完整详情请查看研究论文。

结果

Switch Transformers的完整结果请参阅研究论文中的表5。

环境影响

可以使用Lacoste等人（2019）提出的机器学习影响计算器来估算碳排放。

硬件类型：谷歌云TPU集群 - TPU v3或TPU v4 | 芯片数量 ≥ 4。
使用时长：需要更多信息
云服务提供商：GCP
计算区域：需要更多信息
碳排放：需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}