🚀 T5-Spanish-Efficient-TINY (NEW Deep-Narrow Spanish Version - March 2024)
T5-Efficient-TINY is a variation of Google's original T5 that follows the T5 model architecture. It has been trained by Javier Albarracín from Quantico AI. The original version was shared in the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.
This version of the model has been trained from scratch using a Spanish dataset. This version NEEDS FINE-TUNING as it has not been trained on any specific task. The advantage of this model is that it is in Spanish and can be used for training simple tasks. Due to its relatively low complexity and weight of <29mb, it is ideal for CPU usage.
It has its own Spanish tokenizer (only lowercase letters) with a size of 5000 tokens.
✨ Features
Model Architecture Details
This model - T5-spanish-efficient-tiny - is of the Tiny type with variations in the dimension and size of the feed forward layers. It has 17.94 million parameters and requires 29 MB of memory in full precision (fp32) or 15 MB of memory in half precision (fp16 or bf16).
This Spanish model has been created with lighter features than the original Tiny model.
Model |
nl (el/dl) |
ff |
dm |
kv |
nh |
#Params |
This |
4/3 |
512 |
320 |
64 |
4 |
7M |
A summary of the original T5 model can be seen below:
Model |
nl (el/dl) |
ff |
dm |
kv |
nh |
#Params |
Tiny |
4/4 |
1024 |
256 |
32 |
4 |
16M |
Mini |
4/4 |
1536 |
384 |
32 |
8 |
31M |
Small |
6/6 |
2048 |
512 |
32 |
8 |
60M |
Base |
12/12 |
3072 |
768 |
64 |
12 |
220M |
Large |
24/24 |
4096 |
1024 |
64 |
16 |
738M |
Xl |
24/24 |
16384 |
1024 |
128 |
32 |
3B |
XXl |
24/24 |
65536 |
1024 |
128 |
128 |
11B |
The abbreviations used:
Property |
Details |
nl |
Number of transformer blocks (depth) |
dm |
Dimension of embedding vector (output vector of transformers block) |
kv |
Dimension of key/value projection matrix |
nh |
Number of attention heads |
ff |
Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) |
el |
Number of transformer blocks in the encoder (encoder depth) |
dl |
Number of transformer blocks in the decoder (decoder depth) |
sh |
Signifies that attention heads are shared |
skv |
Signifies that key-values projection matrices are tied |
If a model checkpoint has no specific el or dl, then both the number of encoder- and decoder layers correspond to nl.
Pre-Training
It has been pre-trained with 2MM random records from the Spanish version of the MSMARCO dataset.
Fine-Tuning
⚠️ Important Note
This model requires fine-tuning to work. Here are some examples of how to do it:
PyTorch:
Tensorflow:
JAX/Flax:
📄 License
This project is licensed under the Apache-2.0 license.