🚀 T5-Efficient-SMALL-DM768 (Deep-Narrow version)
T5-Efficient-SMALL-DM768 is a variation of Google's original T5, following the T5 model architecture. It's a pretrained-only checkpoint, released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model shows that a Deep-Narrow architecture can achieve better downstream performance compared to other architectures with a similar number of parameters.
✨ Features
The paper suggests that a Deep-Narrow model architecture is more favorable for downstream performance than other architectures with a similar parameter count. It recommends increasing the model's depth before considering other forms of uniform scaling. A deep and narrow model is generally more efficient, and the relative gain of Pareto - efficiency diminishes as the number of layers increases, converging at 32 to 36 layers.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples are provided in the original document, so this section is skipped.
📚 Documentation
Details model architecture
This model checkpoint - t5-efficient-small-dm768 - is of model type Small with the following variations:
It has 90.77 million parameters and requires ca. 363.1 MB of memory in full precision (fp32) or 181.55 MB of memory in half precision (fp16 or bf16).
A summary of the original T5 model architectures can be seen in the following table:
Property |
Details |
Model Type |
This model checkpoint is of type Small with dm = 768 |
Training Data |
Pretrained on the Colossal, Cleaned version of Common Crawl (C4) |
Number of Parameters |
90.77 million |
Memory Requirement (fp32) |
ca. 363.1 MB |
Memory Requirement (fp16 or bf16) |
181.55 MB |
The following table shows the abbreviations used in the model architecture description:
Property |
Details |
nl |
Number of transformer blocks (depth) |
dm |
Dimension of embedding vector (output vector of transformers block) |
kv |
Dimension of key/value projection matrix |
nh |
Number of attention heads |
ff |
Dimension of intermediate vector within transformer block (size of feed - forward projection matrix) |
el |
Number of transformer blocks in the encoder (encoder depth) |
dl |
Number of transformer blocks in the decoder (decoder depth) |
sh |
Signifies that attention heads are shared |
skv |
Signifies that key - values projection matrices are tied |
If a model checkpoint has no specific el or dl, both the number of encoder - and decoder layers correspond to nl.
Pre - Training
The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span - based masked language modeling (MLM) objective.
Fine - Tuning
⚠️ Important Note
This model is a pretrained checkpoint and has to be fine - tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks.
You can follow one of the following examples on how to fine - tune the model:
PyTorch:
Tensorflow:
JAX/Flax:
Downstream Performance
No table is available in the original document, so this section is skipped.
Computational Complexity
No table is available in the original document, so this section is skipped.
More information
💡 Usage Tip
We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint.
As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.
📄 License
The license of this model is apache - 2.0
.