T5-Efficient-TINY-NL2 Open-source Model - Optimized Architecture to Enhance Performance in Downstream Tasks

T5 Efficient Tiny Nl2

Developed by google

T5-Efficient-TINY-NL2 is a variant of Google's original T5, adopting a deep narrow architecture focused on enhancing downstream task performance.

Large Language Model EnglishOpen Source License:Apache-2.0 #Deep Narrow Architecture #English Pre-training #Efficient Parameter Utilization

Downloads 334

Release Time : 3/2/2022

Model Overview

This is a pre-trained checkpoint only, utilizing a deep narrow architecture that prioritizes increasing model depth for efficiency. Suitable for English NLP tasks and requires fine-tuning for specific applications.

Model Features

Deep Narrow Architecture

Prioritizes increasing model depth over width to improve efficiency in downstream task performance.

Efficient Pre-training

Pre-trained for 524,288 steps on the C4 dataset using a masked language modeling objective with spans.

Lightweight

Only 11.9 million parameters, with small memory footprint, suitable for resource-constrained environments.

Model Capabilities

Text generation

Question answering

Summarization

Text classification

Use Cases

Text Processing

Summarization

Generate concise summaries from long texts.

Question Answering

Answer questions based on given texts.

Classification Tasks

Text Classification

Label texts with categories.

🚀 T5-Efficient-TINY-NL2 (Deep-Narrow version)

T5-Efficient-TINY-NL2 is a variant of Google's original T5, following the T5 model architecture. It's a pretrained-only checkpoint, released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. This model shows that a Deep-Narrow architecture can offer better downstream performance compared to other models with a similar parameter count.

✨ Features

The paper suggests that increasing the model depth (number of stacked transformer blocks) before other dimensions can lead to better Pareto-efficiency. Specifically, a deep and narrow model is generally more efficient than a base or large model. However, the relative gain in efficiency diminishes as more layers are added, converging at 32 - 36 layers.

📚 Documentation

Details model architecture

The t5-efficient-tiny-nl2 checkpoint is a Tiny model with nl set to 2. It has 11.9 million parameters, requiring ca. 47.61 MB of memory in full precision (fp32) or 23.81 MB in half precision (fp16 or bf16).

Here's a summary of the original T5 model architectures:

Model	nl (el/dl)	ff	dm	kv	nh	#Params
Tiny	4/4	1024	256	32	4	16M
Mini	4/4	1536	384	32	8	31M
Small	6/6	2048	512	32	8	60M
Base	12/12	3072	768	64	12	220M
Large	24/24	4096	1024	64	16	738M
Xl	24/24	16384	1024	128	32	3B
XXl	24/24	65536	1024	128	128	11B

Abbreviations used:

Property	Details
nl	Number of transformer blocks (depth)
dm	Dimension of embedding vector (output vector of transformers block)
kv	Dimension of key/value projection matrix
nh	Number of attention heads
ff	Dimension of intermediate vector within transformer block (size of feed - forward projection matrix)
el	Number of transformer blocks in the encoder (encoder depth)
dl	Number of transformer blocks in the decoder (decoder depth)
sh	Signifies that attention heads are shared
skv	Signifies that key - values projection matrices are tied

If a model checkpoint has no specific el or dl, both encoder and decoder layers equal nl.

Pre - Training

The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span - based masked language modeling (MLM) objective.

Fine - Tuning

⚠️ Important Note

This model is a pretrained checkpoint and needs to be fine - tuned for practical use. It was pretrained in English, so it's only suitable for English NLP tasks.

You can follow these examples for fine - tuning:

PyTorch:

Summarization
[Question Answering](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question - answering/run_seq2seq_qa.py)
[Text Classification](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text - classification) - Note: You'll need to slightly adapt the training example for an encoder - decoder model.

Tensorflow:

Summarization
[Text Classification](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text - classification) - Note: Adapt the training example for an encoder - decoder model.

JAX/Flax:

Summarization
[Text Classification](https://github.com/huggingface/transformers/tree/master/examples/flax/text - classification) - Note: Adapt the training example for an encoder - decoder model.

More information

We highly recommend reading the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers for a deeper understanding. As mentioned in this issue, checkpoints with sh or skv architecture variations haven't been ported to Transformers due to limited practical use and lack of detailed description. These checkpoints are stored here and might be ported in the future.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご