T5-Efficient-SMALL-DM768 Open-source Model - Enhance the Performance of Downstream Applications through Deep Architecture

T5 Efficient Small Dm768

Developed by google

T5-Efficient-SMALL-DM768 is a variant of Google's original T5, adopting a deep narrow architecture that prioritizes increasing model depth to enhance downstream performance.

Large Language Model EnglishOpen Source License:Apache-2.0 #Deep Narrow Architecture #English Pre-training #Efficient Parameter Utilization

Downloads 49

Release Time : 3/2/2022

Model Overview

This is a pre-trained checkpoint optimized with a deep narrow strategy, suitable for English NLP tasks and requires fine-tuning before practical use.

Model Features

Deep Narrow Architecture

Prioritizes increasing model depth over width to optimize downstream task performance.

Efficient Pre-training

Pre-trained on the C4 dataset using masked language modeling objectives with spans.

Parameter Efficiency

Outperforms other architectures with similar parameter counts in terms of performance.

Model Capabilities

Text generation

Text summarization

Question answering

Text classification

Use Cases

Text Processing

Text Summarization

Generate concise summaries of input texts.

Question Answering

Answer questions based on context.

Classification Tasks

Text Classification

Classify texts into categories.

🚀 T5-Efficient-SMALL-DM768 (Deep-Narrow version)

T5-Efficient-SMALL-DM768 is a variation of Google's original T5, following the T5 model architecture. It's a pretrained-only checkpoint, released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model shows that a Deep-Narrow architecture can achieve better downstream performance compared to other architectures with a similar number of parameters.

✨ Features

The paper suggests that a Deep-Narrow model architecture is more favorable for downstream performance than other architectures with a similar parameter count. It recommends increasing the model's depth before considering other forms of uniform scaling. A deep and narrow model is generally more efficient, and the relative gain of Pareto - efficiency diminishes as the number of layers increases, converging at 32 to 36 layers.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Details model architecture

This model checkpoint - t5-efficient-small-dm768 - is of model type Small with the following variations:

dm is 768

It has 90.77 million parameters and requires ca. 363.1 MB of memory in full precision (fp32) or 181.55 MB of memory in half precision (fp16 or bf16).

A summary of the original T5 model architectures can be seen in the following table:

Property	Details
Model Type	This model checkpoint is of type Small with dm = 768
Training Data	Pretrained on the Colossal, Cleaned version of Common Crawl (C4)
Number of Parameters	90.77 million
Memory Requirement (fp32)	ca. 363.1 MB
Memory Requirement (fp16 or bf16)	181.55 MB

The following table shows the abbreviations used in the model architecture description:

Property	Details
nl	Number of transformer blocks (depth)
dm	Dimension of embedding vector (output vector of transformers block)
kv	Dimension of key/value projection matrix
nh	Number of attention heads
ff	Dimension of intermediate vector within transformer block (size of feed - forward projection matrix)
el	Number of transformer blocks in the encoder (encoder depth)
dl	Number of transformer blocks in the decoder (decoder depth)
sh	Signifies that attention heads are shared
skv	Signifies that key - values projection matrices are tied

If a model checkpoint has no specific el or dl, both the number of encoder - and decoder layers correspond to nl.

Pre - Training

The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span - based masked language modeling (MLM) objective.

Fine - Tuning

⚠️ Important Note

This model is a pretrained checkpoint and has to be fine - tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks.

You can follow one of the following examples on how to fine - tune the model:

PyTorch:

Summarization
Question Answering
Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder - decoder model.

Tensorflow:

Summarization
Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder - decoder model.

JAX/Flax:

Summarization
Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder - decoder model.

Downstream Performance

No table is available in the original document, so this section is skipped.

Computational Complexity

No table is available in the original document, so this section is skipped.

More information

💡 Usage Tip

We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint.

As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.

📄 License

The license of this model is apache - 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご