T5 High-Performance Large - NH32 Open-Source Model - Focused on Improving Downstream Task Performance, Free to Use!

T5 Efficient Large Nh32

Developed by google

T5 Efficient Large-NH32 is a deep-narrow variant of Google's T5 model, focusing on improving downstream task performance by increasing model depth.

Large Language Model EnglishOpen Source License:Apache-2.0 #Deep Narrow Architecture #English Pre-training #Large-scale Parameters

Downloads 16

Release Time : 3/2/2022

Model Overview

This model is a pre-trained checkpoint based on the T5 architecture, adopting a deep-narrow design strategy that prioritizes increasing model depth over width to enhance parameter efficiency.

Model Features

Deep Narrow Architecture

Features a 32-layer depth design, more efficient than traditional architectures with similar parameter scales

Parameter Efficiency

Optimizes the depth-to-width ratio to achieve better performance with the same number of parameters

Pre-training Foundation

Large-scale pre-training on the C4 dataset provides robust language understanding capabilities

Model Capabilities

Text generation

Text summarization

Question answering systems

Text classification

Machine translation

Use Cases

Text Processing

Document Summarization

Automatically condenses long documents into concise summaries

Question Answering System

Answers user questions based on given text

Content Generation

Text Paraphrasing

Rewrites text while preserving the original semantics

🚀 T5-Efficient-LARGE-NH32 (Deep-Narrow version)

T5-Efficient-LARGE-NH32 is a variation of Google's original T5 that follows the T5 model architecture. It is a pretrained-only checkpoint released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model demonstrates that a Deep-Narrow architecture can offer better downstream performance compared to other architectures with a similar parameter count.

🚀 Quick Start

This is a pretrained-only model. To use it for practical NLP tasks, you need to fine-tune it. You can follow the fine-tuning examples provided in different frameworks below.

✨ Features

Deep-Narrow Architecture: The paper suggests that increasing model depth before other forms of scaling can lead to better Pareto-efficiency in terms of params, FLOPs, or throughput.
Pretrained on C4: The model was pretrained on the Colossal, Cleaned version of Common Crawl (C4) using span-based masked language modeling.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Since this is a pretrained-only model, you need to fine-tune it for specific tasks. Here are some fine-tuning examples in different frameworks:

PyTorch

Summarization
Question Answering
Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model.

Tensorflow

Summarization
Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model.

JAX/Flax

Summarization
Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model.

📚 Documentation

Details model architecture

This model checkpoint - t5-efficient-large-nh32 - is of model type Large with the following variations:

nh is 32

It has 1039.72 million parameters and thus requires ca. 4158.86 MB of memory in full precision (fp32) or 2079.43 MB of memory in half precision (fp16 or bf16).

A summary of the original T5 model architectures can be seen here:

Model	nl (el/dl)	ff	dm	kv	nh	#Params
Tiny	4/4	1024	256	32	4	16M
Mini	4/4	1536	384	32	8	31M
Small	6/6	2048	512	32	8	60M
Base	12/12	3072	768	64	12	220M
Large	24/24	4096	1024	64	16	738M
Xl	24/24	16384	1024	128	32	3B
XXl	24/24	65536	1024	128	128	11B

The following abbreviations are used:

Property	Details
nl	Number of transformer blocks (depth)
dm	Dimension of embedding vector (output vector of transformers block)
kv	Dimension of key/value projection matrix
nh	Number of attention heads
ff	Dimension of intermediate vector within transformer block (size of feed-forward projection matrix)
el	Number of transformer blocks in the encoder (encoder depth)
dl	Number of transformer blocks in the decoder (decoder depth)
sh	Signifies that attention heads are shared
skv	Signifies that key-values projection matrices are tied

If a model checkpoint has no specific, el or dl than both the number of encoder- and decoder layers correspond to nl.

Pre-Training

The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective.

Fine-Tuning

Note: This model is a pretrained checkpoint and has to be fine-tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks.

More information

We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint. As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.

🔧 Technical Details

To be more precise, model depth is defined as the number of transformer blocks that are stacked sequentially. A sequence of word embeddings is therefore processed sequentially by each transformer block.

The paper indicates that a Deep-Narrow model architecture is favorable for downstream performance compared to other model architectures of similar parameter count. To quote the paper:

We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. We generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to consider.

📄 License

This model is released under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご