đ T5ForSequenceClassification
T5ForSequenceClassification adapts the original T5 architecture for sequence classification tasks. T5, initially designed for text - to - text tasks, can handle any NLP task when converted to a text - to - text format, including sequence classification. By removing the decoder, this model halves the original number of parameters and efficiently optimizes for sequence classification.
đ Quick Start
T5ForSequenceClassification supports zero - shot classification tasks. It can be directly used for:
- Topic classification
- Intent recognition
- Boolean question answering
- Sentiment analysis
- And any other text classification tasks.
Since the T5ForClassification class is not currently supported by the transformers library, you can't directly use this model on the Hub. To use T5ForSequenceClassification, you need to install additional packages and model weights. You can find instructions [here](https://github.com/AntoineBlanot/zero - nlp).
⨠Features
Why use T5ForSequenceClassification?
Models based on the [BERT](https://huggingface.co/bert - large - uncased) architecture, such as [RoBERTa](https://huggingface.co/roberta - large) and [DeBERTa](https://huggingface.co/microsoft/deberta - v2 - xxlarge), perform well on sequence classification tasks but have a limited number of parameters (up to ~1.5B). In contrast, models based on the T5 architecture can scale up to ~11B parameters, and recent innovations in this architecture are continuously improving.
T5ForClassification vs T5
T5ForClassification Architecture:
- Encoder: same as the original T5
- Decoder: only the first layer (for pooling purpose)
- Classification head: a simple Linear layer on top of the decoder
Benefits and Drawbacks:
- (+) Retains T5's encoding strength
- (+) Halves the parameter size
- (+) Provides interpretable outputs (class logits)
- (+) Avoids generation mistakes and has faster prediction (no generation latency)
- (-) Loses text - to - text ability
đ Documentation
Table of Contents
- Usage
- Why use T5ForSequenceClassification?
- T5ForClassification vs T5
- Results
đ§ Technical Details
T5 was originally built for text - to - text tasks and excels in it. It can handle any NLP task if it has been converted to a text - to - text format, including sequence classification task! You can find [here](https://huggingface.co/google/flan - t5 - base?text=Premise%3A++At+my+age+you+will+probably+have+learnt+one+lesson.+Hypothesis%3A++It%27s+not+certain+how+many+lessons+you%27ll+learn+by+your+thirties.+Does+the+premise+entail+the+hypothesis%3F) how the original T5 is used for sequence classification task.
Our motivation for building T5ForSequenceClassification is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require text generation, so a large decoder is unnecessary. By removing the decoder, we can half the original number of parameters (thus half the computation cost) and efficiently optimize the network for the given task.
đ Results
Results on the validation data of training tasks
Dataset |
Accuracy |
F1 |
MNLI (m) |
0.923 |
0.923 |
MNLI (mm) |
0.922 |
0.922 |
SNLI |
0.942 |
0.942 |
SciTail |
0.966 |
0.647 |
Results on validation data of unseen tasks (zero - shot)
Dataset |
Accuracy |
F1 |
? |
? |
? |
đ Acknowledgments
Special thanks to philschmid for making a Flan - T5 - xxl [checkpoint](https://huggingface.co/philschmid/flan - t5 - xxl - sharded - fp16) in fp16.
đĻ Dataset and Metrics
Property |
Details |
Datasets |
multi_nli, snli, scitail |
Metrics |
accuracy, f1 |
Pipeline Tag |
zero - shot - classification |
Language |
en |
Model Index |
AntoineBlanot/flan - t5 - xxl - classif - 3way |