Vietnamese-FlanT5-Large Open Source Model - Supports Multiple Languages for Summarization, Translation, and Question-Answering Tasks

Vietnamese FlanT5 Large

Developed by Hatto

A multilingual sequence-to-sequence model based on Flan-T5-Large, supporting Vietnamese, English, and Chinese, suitable for tasks such as summarization, translation, and question answering.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual summarization #Legal text processing #Incremental pretraining model

Downloads 116

Release Time : 11/22/2023

Model Overview

This model is a multilingual model incrementally pretrained on the basis of Flan-T5-Large, enhancing multilingual processing capabilities by expanding the vocabulary and incorporating Vietnamese, English, and Chinese data.

Model Features

Multilingual Support

Supports Vietnamese, English, and Chinese processing by retraining the tokenizer and expanding the vocabulary.

Incremental Pretraining

Conducted single-round continuous pretraining on Flan-T5-Large, incorporating over 100GB of diverse data.

Vocabulary Expansion

Retrained the tokenizer using SentencePiece, resulting in a combined vocabulary of 106,611 tokens.

Model Capabilities

Text summarization

Machine translation

Question answering system

Mask filling

Use Cases

Natural Language Processing

News Summarization

Generate concise summaries of Vietnamese news content.

Multilingual Translation

Perform text translation between Vietnamese, English, and Chinese.

Legal Text Processing

Legal Document Analysis

Process Vietnamese legal documents and texts.

🚀 HattoFlanT5-Large: A Multilingual Model

HattoFlanT5-Large is a multilingual model that extends the vocabulary of Flan - T5 and conducts continual pretraining on a large - scale dataset. It supports multiple languages including Vietnamese, English, and Chinese, and can be applied to various NLP tasks such as summarization, translation, and question - answering.

🚀 Quick Start

Installation

This model is based on the transformers library. You can install it using the following command:

pip install transformers

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")  
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
model.cuda()

✨ Features

Extended Vocabulary: Utilized SentencePiece to retrain a tokenizer for Vietnamese, English, and Chinese, and merged it with Flan - T5's original vocabulary, resulting in a vocabulary of 106611 tokens.
Continual Pretraining: Conducted single - epoch continual pretraining on a dataset over 100 GB, including news, Wikipedia, books, and legal documents from multiple languages.
Multilingual Support: Supports Vietnamese, English, and Chinese, and can be used for various NLP tasks such as summarization, translation, and question - answering.

📦 Installation

This model relies on the transformers library. You can install it via pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")  
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
model.cuda()

📚 Documentation

Extend vocabulary and Pretrain

We used SentencePiece to retrain a tokenizer for Vietnamese, English, and Chinese. The vocabulary of this newly trained tokenizer was then merged with Flan - T5's original vocabulary, removing duplicate tokens. The final merged vocabulary has 106611 tokens.

For single - epoch continual pretraining (also known as incremental pretraining), we used the Flan - T5 - Large model. This pretraining was carried out on a diverse dataset larger than 100 GB, which includes:

NewsCorpus
Vietnamese Wikipedia
Vietnamese books
Vietnamese legal documents
Vietnamese legal text
English Wikipedia
Chinese Text

Finetune and Benchmark

The model was fine - tuned and benchmarked on the following datasets:

Wikilingua
Vietnews
Pho_NER
.....

📄 License

This project is licensed under the Apache - 2.0 license.

📄 Citation

Hatto
Ipcoms

Property	Details
Model Type	Flan - T5 - Large with extended vocabulary
Training Data	NewsCorpus, Vietnamese Wikipedia, Vietnamese books, Vietnamese legal documents, Vietnamese legal text, English Wikipedia, Chinese Text
Supported Languages	Vietnamese, English, Chinese
Library Name	transformers
Tags	t5, flant5, summarization, translation, question - answering
Pipeline Tag	fill - mask

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご