đ HattoFlanT5-Large: A Multilingual Model
HattoFlanT5-Large is a multilingual model that extends the vocabulary of Flan - T5 and conducts continual pretraining on a large - scale dataset. It supports multiple languages including Vietnamese, English, and Chinese, and can be applied to various NLP tasks such as summarization, translation, and question - answering.
đ Quick Start
Installation
This model is based on the transformers
library. You can install it using the following command:
pip install transformers
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
model.cuda()
⨠Features
- Extended Vocabulary: Utilized SentencePiece to retrain a tokenizer for Vietnamese, English, and Chinese, and merged it with Flan - T5's original vocabulary, resulting in a vocabulary of 106611 tokens.
- Continual Pretraining: Conducted single - epoch continual pretraining on a dataset over 100 GB, including news, Wikipedia, books, and legal documents from multiple languages.
- Multilingual Support: Supports Vietnamese, English, and Chinese, and can be used for various NLP tasks such as summarization, translation, and question - answering.
đĻ Installation
This model relies on the transformers
library. You can install it via pip
:
pip install transformers
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
model.cuda()
đ Documentation
Extend vocabulary and Pretrain
We used SentencePiece to retrain a tokenizer for Vietnamese, English, and Chinese. The vocabulary of this newly trained tokenizer was then merged with Flan - T5's original vocabulary, removing duplicate tokens. The final merged vocabulary has 106611 tokens.
For single - epoch continual pretraining (also known as incremental pretraining), we used the Flan - T5 - Large model. This pretraining was carried out on a diverse dataset larger than 100 GB, which includes:
- NewsCorpus
- Vietnamese Wikipedia
- Vietnamese books
- Vietnamese legal documents
- Vietnamese legal text
- English Wikipedia
- Chinese Text
Finetune and Benchmark
The model was fine - tuned and benchmarked on the following datasets:
- Wikilingua
- Vietnews
- Pho_NER
- .....
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
Property |
Details |
Model Type |
Flan - T5 - Large with extended vocabulary |
Training Data |
NewsCorpus, Vietnamese Wikipedia, Vietnamese books, Vietnamese legal documents, Vietnamese legal text, English Wikipedia, Chinese Text |
Supported Languages |
Vietnamese, English, Chinese |
Library Name |
transformers |
Tags |
t5, flant5, summarization, translation, question - answering |
Pipeline Tag |
fill - mask |