🚀 T5-Efficient-MINI (Deep-Narrow version)
T5-Efficient-MINI is a variation of Google's original T5 following the T5 model architecture. It's a pretrained-only checkpoint released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model shows that a Deep-Narrow architecture can offer better downstream performance compared to other models with a similar parameter count.
✨ Features
- Deep-Narrow Architecture: A Deep-Narrow model architecture is more favorable for downstream performance compared to other architectures of similar parameter count.
- Model Depth Definition: Model depth is defined as the number of transformer blocks stacked sequentially, and word embeddings are processed by each block in sequence.
📦 Installation
No installation steps are provided in the original document.
💻 Usage Examples
No code examples are provided in the original document.
📚 Documentation
Details model architecture
This model checkpoint - t5-efficient-mini - is of model type Mini with no variations. It has 31.23 million parameters, requiring ca. 124.92 MB of memory in full precision (fp32) or 62.46 MB in half precision (fp16 or bf16).
The following table shows a summary of the original T5 model architectures:
Model |
nl (el/dl) |
ff |
dm |
kv |
nh |
#Params |
Tiny |
4/4 |
1024 |
256 |
32 |
4 |
16M |
Mini |
4/4 |
1536 |
384 |
32 |
8 |
31M |
Small |
6/6 |
2048 |
512 |
32 |
8 |
60M |
Base |
12/12 |
3072 |
768 |
64 |
12 |
220M |
Large |
24/24 |
4096 |
1024 |
64 |
16 |
738M |
Xl |
24/24 |
16384 |
1024 |
128 |
32 |
3B |
XXl |
24/24 |
65536 |
1024 |
128 |
128 |
11B |
The following table explains the abbreviations used:
Property |
Details |
nl |
Number of transformer blocks (depth) |
dm |
Dimension of embedding vector (output vector of transformers block) |
kv |
Dimension of key/value projection matrix |
nh |
Number of attention heads |
ff |
Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) |
el |
Number of transformer blocks in the encoder (encoder depth) |
dl |
Number of transformer blocks in the decoder (decoder depth) |
sh |
Signifies that attention heads are shared |
skv |
Signifies that key-values projection matrices are tied |
If a model checkpoint has no specific el or dl, both the number of encoder- and decoder layers correspond to nl.
Pre-Training
The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective.
Fine-Tuning
⚠️ Important Note
This model is a pretrained checkpoint and has to be fine-tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks.
You can follow one of the following examples to fine-tune the model:
PyTorch:
Tensorflow:
JAX/Flax:
Downstream Performance
TODO: Add table if available
Computational Complexity
TODO: Add table if available
More information
We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint. As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.
📄 License
This project is licensed under the apache-2.0 license.