switch-c-2048開源模型 - 1.6萬億參數助力高效語言任務處理

首頁

Switch C 2048

由google開發

基於掩碼語言建模任務訓練的混合專家(MoE)模型，參數規模達1.6萬億，採用類似T5的架構但前饋層替換為稀疏MLP層

大型語言模型

Transformers

英語開源協議:Apache-2.0 #萬億參數規模 #混合專家架構 #掩碼語言建模

下載量 73

發布時間 : 11/4/2022

模型概述

Switch Transformers是通過混合專家架構擴展的文本生成模型，在預訓練任務上相比標準T5模型展現出更好的擴展性和訓練效率

模型特點

混合專家架構

前饋層被替換為包含2048個專家MLP的稀疏層，實現參數高效擴展

高效訓練

相比T5-XXL模型實現4倍訓練加速

大規模參數

模型參數規模達1.6萬億，需要3.1TB存儲空間

模型能力

文本生成

掩碼語言建模

使用案例

文本補全

掩碼文本生成

根據包含掩碼標記的輸入文本生成完整內容

示例輸入輸出展示模型能合理填充缺失內容

🚀 Switch Transformers C - 2048專家模型（3.1 TB對應1.6T參數）

Switch Transformers是一種基於專家混合（Mixture of Experts, MoE）架構的語言模型，在掩碼語言建模（Masked Language Modeling, MLM）任務上進行訓練。它在訓練速度和微調任務表現上優於經典的T5模型，能夠有效推動語言模型向萬億參數規模發展。

模型圖片

🚀 快速開始

Switch Transformers是在掩碼語言建模（MLM）任務上訓練的專家混合（MoE）模型。該模型架構與經典的T5相似，但前饋層被包含“專家”多層感知機（MLP）的稀疏MLP層所取代。根據原論文，該模型在微調任務上比T5表現更好，同時能實現更快的訓練（具有可擴展性）。

正如摘要開頭幾行所述：

我們通過在“巨型清潔爬取語料庫”（Colossal Clean Crawled Corpus）上預訓練高達萬億參數的模型，推動了當前語言模型的規模發展，並比T5 - XXL模型實現了4倍的加速。

免責聲明：本模型卡片的內容由Hugging Face團隊撰寫，部分內容從原論文複製粘貼而來。

✨ 主要特性

模型類型：語言模型
適用語言（NLP）：英語
許可證：Apache 2.0
相關模型：所有FLAN - T5檢查點
原始檢查點：所有原始FLAN - T5檢查點
更多信息資源：

📦 安裝指南

請注意，這些檢查點是在掩碼語言建模（MLM）任務上訓練的。因此，這些檢查點不能直接用於下游任務。你可以查看FLAN - T5以使用微調後的權重，或者按照此筆記本來微調你自己的MoE模型。

下面是一些在transformers中使用該模型的示例腳本 - 請記住，該模型極其龐大，因此你可以考慮使用accelerate進行磁盤卸載：

在CPU上運行模型

點擊展開

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上運行模型

點擊展開

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

使用不同精度在GPU上運行模型

BF16

點擊展開

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", torch_dtype=torch.bfloat16, offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

點擊展開

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

📚 詳細文檔

直接使用和下游使用

更多詳細信息請參閱研究論文。

超出適用範圍的使用

需要更多信息。

偏差、風險和侷限性

需要更多信息。

倫理考慮和風險

需要更多信息。

已知侷限性

需要更多信息。

敏感使用

需要更多信息。

訓練詳情

訓練數據

該模型在“巨型清潔爬取語料庫”（Colossal Clean Crawled Corpus, C4）數據集上進行掩碼語言建模任務的訓練，訓練過程與T5相同。

訓練過程

根據原論文中的模型卡片，該模型在TPU v3或TPU v4集群上進行訓練，使用了t5x代碼庫和jax。

評估

測試數據、因素和指標

作者在各種任務上對模型進行了評估，並將結果與T5進行了比較。以下表格展示了一些定量評估結果：完整詳情請查看研究論文。

結果

Switch Transformers的完整結果請參閱研究論文中的表5。

環境影響

可以使用Lacoste等人（2019）提出的機器學習影響計算器來估算碳排放。

硬件類型：谷歌雲TPU集群 - TPU v3或TPU v4 | 芯片數量 ≥ 4。
使用時長：需要更多信息
雲服務提供商：GCP
計算區域：需要更多信息
碳排放：需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}