Switch-C-2048 Open-Source Model - 1.6 Trillion Parameters Empower Efficient Language Task Processing

Switch C 2048

Developed by google

A Mixture of Experts (MoE) model trained on masked language modeling tasks, with a parameter scale of 1.6 trillion. It uses an architecture similar to T5 but replaces the feed - forward layer with a sparse MLP layer.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Trillion-scale parameters #Mixture of Experts architecture #Masked language modeling

Downloads 73

Release Time : 11/4/2022

Model Overview

Switch Transformers is a text generation model extended by the Mixture of Experts architecture, showing better scalability and training efficiency on pre - training tasks compared to the standard T5 model.

Model Features

Mixture of Experts architecture

The feed - forward layer is replaced with a sparse layer containing 2048 expert MLPs, enabling efficient parameter expansion.

Efficient training

Achieves 4x training acceleration compared to the T5 - XXL model.

Large - scale parameters

The model has a parameter scale of 1.6 trillion and requires 3.1TB of storage space.

Model Capabilities

Text generation

Masked language modeling

Use Cases

Text completion

Masked text generation

Generate complete content based on the input text containing masked tokens.

The example input - output shows that the model can reasonably fill in the missing content.

🚀 Model Card for Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB)

Switch Transformers is a Mixture of Experts (MoE) model trained on Masked Language Modeling (MLM) task, enabling faster training and better performance on fine - tuned tasks compared to T5.

model image

📋 TL;DR

Switch Transformers is a Mixture of Experts (MoE) model trained on Masked Language Modeling (MLM) task. The model architecture is similar to the classic T5, but with the Feed Forward layers replaced by the Sparse MLP layers containing "experts" MLP. According to the original paper the model enables faster training (scaling properties) while being better than T5 on fine - tuned tasks. As mentioned in the first few lines of the abstract :

we advance the current scale of language models by pre - training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5 - XXL model.

Disclaimer: Content from this model card has been written by the Hugging Face team, and parts of it were copy pasted from the original paper.

✨ Model Details

Model Description

Property	Details
Model Type	Language model
Language(s) (NLP)	English
License	Apache 2.0
Related Models	All FLAN - T5 Checkpoints
Original Checkpoints	All Original FLAN - T5 Checkpoints
Resources for more information	- Research paper - GitHub Repo - Hugging Face Switch Transformers Docs (Similar to T5)

💻 Usage

Note that these checkpoints has been trained on Masked - Language Modeling (MLM) task. Therefore the checkpoints are not "ready - to - use" for downstream tasks. You may want to check FLAN - T5 for running fine - tuned weights or fine - tune your own MoE following this notebook

Find below some example scripts on how to use the model in transformers - bear in mind that the model is extremely large, so you may consider using disk offload from accelerate:

Basic Usage

Using the Pytorch model

Running the model on a CPU

Click to expand

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

Advanced Usage

Running the model on a GPU

Click to expand

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

Running the model on a GPU using different precisions

BF16

Click to expand

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", torch_dtype=torch.bfloat16, offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

Click to expand

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-c-2048")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-c-2048", device_map="auto", offload_folder=<OFFLOAD_FOLDER>)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

📈 Uses

Direct Use and Downstream Use

See the research paper for further details.

Out - of - Scope Use

More information needed.

⚠️ Bias, Risks, and Limitations

Ethical considerations and risks

More information needed.

Known Limitations

More information needed.

Sensitive Use

More information needed.

🔧 Training Details

Training Data

The model was trained on a Masked Language Modeling task, on Colossal Clean Crawled Corpus (C4) dataset, following the same procedure as T5.

Training Procedure

According to the model card from the original paper the model has been trained on TPU v3 or TPU v4 pods, using t5x codebase together with jax.

📊 Evaluation

Testing Data, Factors & Metrics

The authors evaluated the model on various tasks and compared the results against T5. See the table below for some quantitative evaluation: For full details, please check the research paper.

Results

For full results for Switch Transformers, see the research paper, Table 5.

🌱 Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Property	Details
Hardware Type	Google Cloud TPU Pods - TPU v3 or TPU v4
Hours used	More information needed
Cloud Provider	GCP
Compute Region	More information needed
Carbon Emitted	More information needed

📖 Citation

BibTeX:

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご