🚀 T5-Efficient-BASE-FF6000 (Deep-Narrow version)
T5-Efficient-BASE-FF6000 is a variant of Google's original T5 that adheres to the T5 model architecture. It's a pretrained-only checkpoint, released alongside the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.
In essence, the paper shows that a Deep-Narrow model architecture outperforms other architectures with a similar parameter count in downstream tasks.
🚀 Quick Start
This model is a pretrained-only checkpoint. To use it for practical tasks, you need to fine-tune it. It was pretrained in English, so it's suitable for English NLP tasks. You can refer to the following fine-tuning examples:
- PyTorch:
- Tensorflow:
- JAX/Flax:
✨ Features
The paper recommends a Deep-Narrow strategy, where increasing the model's depth first is beneficial for downstream performance. A deep and narrow model is generally more efficient than other architectures with a similar parameter count.
As the paper states:
We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. We generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to consider.
📚 Documentation
🔧 Technical Details
Model Architecture Details
The model checkpoint t5-efficient-base-ff6000 is of type Base with the following features:
- ff is 6000
It has 336.18 million parameters, requiring ca. 1344.71 MB of memory in full precision (fp32) or 672.36 MB in half precision (fp16 or bf16).
The following table summarizes the original T5 model architectures:
Model |
nl (el/dl) |
ff |
dm |
kv |
nh |
#Params |
Tiny |
4/4 |
1024 |
256 |
32 |
4 |
16M |
Mini |
4/4 |
1536 |
384 |
32 |
8 |
31M |
Small |
6/6 |
2048 |
512 |
32 |
8 |
60M |
Base |
12/12 |
3072 |
768 |
64 |
12 |
220M |
Large |
24/24 |
4096 |
1024 |
64 |
16 |
738M |
Xl |
24/24 |
16384 |
1024 |
128 |
32 |
3B |
XXl |
24/24 |
65536 |
1024 |
128 |
128 |
11B |
The following abbreviations are used:
Abbreviation |
Definition |
nl |
Number of transformer blocks (depth) |
dm |
Dimension of embedding vector (output vector of transformers block) |
kv |
Dimension of key/value projection matrix |
nh |
Number of attention heads |
ff |
Dimension of intermediate vector within transformer block (size of feed - forward projection matrix) |
el |
Number of transformer blocks in the encoder (encoder depth) |
dl |
Number of transformer blocks in the decoder (decoder depth) |
sh |
Signifies that attention heads are shared |
skv |
Signifies that key - values projection matrices are tied |
If a model checkpoint has no specific el or dl, the number of encoder and decoder layers equals nl.
Pre - Training
The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span - based masked language modeling (MLM) objective.
Fine - Tuning
⚠️ Important Note
This model is a pretrained checkpoint and needs to be fine - tuned for practical use. It was pretrained in English, so it's only useful for English NLP tasks.
More Information
We highly recommend reading the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers for a more in - depth understanding of this model checkpoint. As mentioned in this issue, checkpoints with sh or skv model architecture variations haven't been ported to Transformers due to limited practical use and lack of detailed descriptions. These checkpoints are stored here and may be ported in the future.
📄 License
This project is licensed under the Apache 2.0 license.