T5-Efficient-BASE-FF9000 Open-Source Model: Deep and Narrow Architecture Delivers Superior Performance in Downstream Tasks!

T5 Efficient Base Ff9000

Developed by google

T5-Efficient-BASE-FF9000 is a variant of Google's original T5, adopting a deep narrow architecture that delivers superior performance on downstream tasks with similar parameter scales.

Large Language Model EnglishOpen Source License:Apache-2.0 #Deep Narrow Architecture #English Pre-training #Efficient Scaling

Downloads 16

Release Time : 3/2/2022

Model Overview

This is a pre-trained model based on the T5 architecture, utilizing a deep narrow design strategy that prioritizes increasing model depth for enhanced efficiency. The model is pre-trained on the English C4 dataset and is suitable for various English NLP tasks.

Model Features

Deep Narrow Architecture

Adopts a tall and thin (deep and narrow) model design, which is more efficient than the base model, excelling in three key efficiency metrics: parameter count, FLOPs, and speed.

Efficient Pre-training

Pre-trained for 524,288 steps on the large-scale cleaned Common Crawl (C4) dataset using a masked language modeling objective with spans.

Flexible Fine-tuning

Can serve as a base model for fine-tuning on various downstream tasks such as summarization, question answering, and text classification.

Model Capabilities

Text Generation

Text Summarization

Question Answering

Text Classification

Use Cases

Text Generation

Automatic Summarization

Automatically generate concise summaries from long documents

Question Answering

Open-domain Question Answering

Answer user questions based on given text

Text Classification

🚀 T5-Efficient-BASE-FF9000 (Deep-Narrow version)

T5-Efficient-BASE-FF9000 is a variant of Google's original T5, adhering to the T5 model architecture. It's a pretrained-only checkpoint, released alongside the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model showcases that a Deep-Narrow architecture can enhance downstream performance compared to other models with similar parameter counts.

🚀 Quick Start

This model is a pretrained-only checkpoint. For practical use, it needs to be fine - tuned. It was pretrained in English, so it's suitable for English NLP tasks. You can refer to the following examples for fine - tuning:

PyTorch

Summarization
Question Answering
Text Classification - Note: Adapt the training example for encoder - decoder models.

Tensorflow

Summarization
Text Classification - Note: Adapt the training example for encoder - decoder models.

JAX/Flax

Summarization
Text Classification - Note: Adapt the training example for encoder - decoder models.

✨ Features

The paper suggests that a Deep-Narrow model architecture is more beneficial for downstream performance than other architectures with similar parameter counts. To quote the paper:

We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto - frontier as shown in earlier sections of the paper. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. We generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto - efficiency diminishes as we increase the layers, converging at 32 to 36 layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to consider.

📚 Documentation

Details model architecture

This model checkpoint - t5 - efficient - base - ff9000 - is of the Base model type with the following variations:

ff is 9000

It has 449.42 million parameters, requiring ca. 1797.7 MB of memory in full precision (fp32) or 898.85 MB of memory in half precision (fp16 or bf16).

Model Architecture Table

Model	nl (el/dl)	ff	dm	kv	nh	#Params
Tiny	4/4	1024	256	32	4	16M
Mini	4/4	1536	384	32	8	31M
Small	6/6	2048	512	32	8	60M
Base	12/12	3072	768	64	12	220M
Large	24/24	4096	1024	64	16	738M
Xl	24/24	16384	1024	128	32	3B
XXl	24/24	65536	1024	128	128	11B

Abbreviation Definitions

Abbreviation	Definition
nl	Number of transformer blocks (depth)
dm	Dimension of embedding vector (output vector of transformers block)
kv	Dimension of key/value projection matrix
nh	Number of attention heads
ff	Dimension of intermediate vector within transformer block (size of feed - forward projection matrix)
el	Number of transformer blocks in the encoder (encoder depth)
dl	Number of transformer blocks in the decoder (decoder depth)
sh	Signifies that attention heads are shared
skv	Signifies that key - values projection matrices are tied

If a model checkpoint has no specific el or dl, both the number of encoder - and decoder layers correspond to nl.

Pre - Training

The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span - based masked language modeling (MLM) objective.

Fine - Tuning

Note: This model is a pretrained checkpoint and must be fine - tuned for practical use. It was pretrained in English, so it's only useful for English NLP tasks.

More information

We highly recommend reading the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers for a more in - depth understanding of this model checkpoint. As explained in the issue, checkpoints with sh or skv model architecture variations haven't been ported to Transformers due to limited practical use and lack of detailed descriptions. These checkpoints are stored here and might be ported in the future.

📄 License

This model is released under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご