🚀 T5-Efficient-SMALL-NL22 (Deep-Narrow version)
T5-Efficient-SMALL-NL22 is a variant of Google's original T5, following the T5 model architecture. It's a pretrained-only checkpoint, released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model shows that a Deep-Narrow architecture can offer better downstream performance compared to other architectures with a similar number of parameters.
✨ Features
The paper suggests that a Deep-Narrow model architecture is more favorable for downstream performance than other architectures with a similar parameter count. Specifically, increasing the model's depth before other forms of scaling can lead to better Pareto-efficiency, with the relative gain diminishing after 32 - 36 layers.
⚠️ Important Note
The efficiency here relates to any one compute dimension, i.e., params, FLOPs, or throughput (speed). The paper reports all three key efficiency metrics, leaving the decision of which compute dimension to consider to the practitioner.
📚 Documentation
Details model architecture
This model checkpoint - t5-efficient-small-nl22 - is of the Small model type with the variation that nl is 22. It has 178.04 million parameters, requiring ca. 712.16 MB of memory in full precision (fp32) or 356.08 MB in half precision (fp16 or bf16).
Property |
Details |
Model Type |
Small |
Number of Parameters |
178.04 million |
Memory Requirement (fp32) |
ca. 712.16 MB |
Memory Requirement (fp16 or bf16) |
356.08 MB |
A summary of the original T5 model architectures:
Model |
nl (el/dl) |
ff |
dm |
kv |
nh |
#Params |
Tiny |
4/4 |
1024 |
256 |
32 |
4 |
16M |
Mini |
4/4 |
1536 |
384 |
32 |
8 |
31M |
Small |
6/6 |
2048 |
512 |
32 |
8 |
60M |
Base |
12/12 |
3072 |
768 |
64 |
12 |
220M |
Large |
24/24 |
4096 |
1024 |
64 |
16 |
738M |
Xl |
24/24 |
16384 |
1024 |
128 |
32 |
3B |
XXl |
24/24 |
65536 |
1024 |
128 |
128 |
11B |
Abbreviations used:
Abbreviation |
Definition |
nl |
Number of transformer blocks (depth) |
dm |
Dimension of embedding vector (output vector of transformers block) |
kv |
Dimension of key/value projection matrix |
nh |
Number of attention heads |
ff |
Dimension of intermediate vector within transformer block (size of feed - forward projection matrix) |
el |
Number of transformer blocks in the encoder (encoder depth) |
dl |
Number of transformer blocks in the decoder (decoder depth) |
sh |
Signifies that attention heads are shared |
skv |
Signifies that key - values projection matrices are tied |
If a model checkpoint has no specific el or dl, both the number of encoder- and decoder layers correspond to nl.
Pre-Training
The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective.
Fine-Tuning
⚠️ Important Note
This model is a pretrained checkpoint and needs to be fine-tuned for practical use. It was pretrained in English and is only useful for English NLP tasks.
You can follow these examples to fine-tune the model:
PyTorch:
Tensorflow:
JAX/Flax:
📄 License
This project is licensed under the Apache-2.0 license.