Bengali-t5-base Open Source Model - Free Support for Bengali Text Processing Tasks

Bengali T5 Base

Developed by flax-community

T5 base model trained on the Bengali portion of the MT5 dataset, developed by the HuggingFace community

Large Language Model #Bengali T5 #Denoising Pretraining #11 billion tokens

Downloads 57

Release Time : 3/2/2022

Model Overview

This is a T5 base model specifically trained for Bengali language, using denoising objectives for pretraining, suitable as a foundation for downstream tasks

Model Features

Bengali-specific

Pretrained model specifically optimized for Bengali language

Large-scale training

Trained using approximately 11 billion tokens of Bengali data

TPU-accelerated training

Utilized Google's TPU computing power for efficient training

Model Capabilities

Text denoising

Language model pretraining

Bengali text processing

Use Cases

Natural Language Processing

Bengali text generation

Can be fine-tuned as a base model for Bengali text generation

Requires prefix language model fine-tuning to obtain generation capabilities

Downstream task fine-tuning

Can be used as a base model for various Bengali NLP tasks

🚀 bengali-t5-base

bengali-t5-base is a model trained on the Bengali portion of the MT5 dataset. We utilized the T5-base model for this project.

This project was part of the Flax/Jax Community Week, organized by HuggingFace, with TPU usage sponsored by Google.

The model was trained on approximately ~11B tokens (64 size batch, 512 tokens, 350k steps).

🚀 Quick Start

✨ Features

Trained on the Bengali portion of the MT5 dataset.
Based on the T5-base model architecture.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Load Tokenizer

>>> tokenizer = transformers.AutoTokenizer.from_pretrained("flax-community/bengali-t5-base")
>>> tokenizer.encode("আমি বাংলার গান গাই")
>>> tokenizer.decode([93, 1912, 814, 5995, 3, 1])

[93, 1912, 814, 5995, 3, 1]
'আমি বাংলার গান গাই </s>'

Load Model

>>> config  = T5Config.from_pretrained("flax-community/bengali-t5-base")
>>> model = FlaxT5ForConditionalGeneration.from_pretrained("flax-community/bengali-t5-base", config=config)

📚 Documentation

The model is trained on de-noising objectives following the scripts here and here. Currently, this model doesn't have any generation capability. If you want this model to have generation capability, please do a finetuning on the prefix-LM objective mentioned in the paper.

You can view the tensorboard log in the Training metrics tab.

Please note that we haven't finetuned the model on any downstream tasks.

🔧 Technical Details

The model training details include:

Trained on around ~11B tokens with a batch size of 64, 512 tokens per step, and 350k steps.
Trained on de-noising objectives using the provided scripts.

📄 License

No license information is provided in the original document.

Proposal

Project Proposal

Participants

Useful links

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご