t5-efficient-tiny Open-source Model - Enhance Performance of Downstream Tasks, Focus on In-depth Optimization

T5 Efficient Tiny

Developed by google

T5-Efficient-TINY is a deep-narrow variant of Google's T5 model, focusing on improving downstream task performance by increasing model depth rather than width.

Large Language Model EnglishOpen Source License:Apache-2.0 #Deep Narrow Architecture #English Pretraining #Efficient Parameter Utilization

Downloads 8,337

Release Time : 3/2/2022

Model Overview

This is a pretrained-only T5 model checkpoint with a deep-narrow architecture design, suitable for fine-tuning English NLP tasks.

Model Features

Deep Narrow Architecture

Prioritizes increasing model depth over width, providing better downstream task performance at the same parameter scale

Efficient Pretraining

Pretrained on the C4 dataset with 524,288 steps of masked language modeling

Compact Size

Only 15.58M parameters, approximately 62.32MB memory usage in full precision, suitable for resource-constrained environments

Model Capabilities

Text generation

Text summarization

Question answering

Text classification (requires adaptation)

Use Cases

Text Processing

News Summarization

Automatically condense long articles into concise summaries

Open-domain QA

Answer user questions based on given text

🚀 T5-Efficient-TINY (Deep-Narrow version)

T5-Efficient-TINY is a variant of Google's original T5, adhering to the T5 model architecture. It's a pretrained-only checkpoint, released alongside the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.

In essence, the paper suggests that a Deep-Narrow model architecture outperforms other architectures with a similar parameter count in downstream tasks.

Here's a quote from the paper:

We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally be more efficient compared to a large model. We generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to consider.

More precisely, model depth is defined as the number of sequentially stacked transformer blocks. A sequence of word embeddings is processed sequentially by each transformer block.

🚀 Quick Start

✨ Features

Based on the T5 model architecture, it offers a Deep-Narrow variant for better downstream performance.
Pretrained on a large-scale dataset, suitable for English NLP tasks after fine-tuning.

📦 Installation

As this is a model checkpoint in the Hugging Face ecosystem, you can install the necessary dependencies using pip:

pip install transformers

💻 Usage Examples

Since this is a pretrained model that needs fine-tuning, here are some fine-tuning examples in different frameworks:

PyTorch

Summarization: Summarization Example
Question Answering: Question Answering Example
Text Classification: Text Classification Example (Note: Slight adaptation is needed for encoder-decoder models.)

Tensorflow

Summarization: Summarization Example
Text Classification: Text Classification Example (Note: Slight adaptation is needed for encoder-decoder models.)

JAX/Flax

Summarization: Summarization Example
Text Classification: Text Classification Example (Note: Slight adaptation is needed for encoder-decoder models.)

📚 Documentation

Details model architecture

This model checkpoint - t5-efficient-tiny - is of model type Tiny with no variations. It has 15.58 million parameters and requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16).

Here's a summary of the original T5 model architectures:

Model	nl (el/dl)	ff	dm	kv	nh	#Params
Tiny	4/4	1024	256	32	4	16M
Mini	4/4	1536	384	32	8	31M
Small	6/6	2048	512	32	8	60M
Base	12/12	3072	768	64	12	220M
Large	24/24	4096	1024	64	16	738M
Xl	24/24	16384	1024	128	32	3B
XXl	24/24	65536	1024	128	128	11B

Here are the definitions of the abbreviations:

Property	Details
nl	Number of transformer blocks (depth)
dm	Dimension of embedding vector (output vector of transformers block)
kv	Dimension of key/value projection matrix
nh	Number of attention heads
ff	Dimension of intermediate vector within transformer block (size of feed-forward projection matrix)
el	Number of transformer blocks in the encoder (encoder depth)
dl	Number of transformer blocks in the decoder (decoder depth)
sh	Signifies that attention heads are shared
skv	Signifies that key-values projection matrices are tied

If a model checkpoint has no specific el or dl, both the number of encoder- and decoder layers equal nl.

Pre-Training

The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective.

Fine-Tuning

⚠️ Important Note

This model is a pretrained checkpoint and must be fine-tuned for practical use. It was pretrained in English, so it's only suitable for English NLP tasks.

🔧 Technical Details

The concept of model depth is crucial in this model. It refers to the number of sequentially stacked transformer blocks, which process word embeddings sequentially. The paper suggests that increasing model depth can lead to better Pareto-efficiency, especially in the range of 32 to 36 layers.

📄 License

This model is released under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご