GPT-2-Tamil Open-Source Pre-trained Language Model - Free Deployment for Tamil Text Generation

Home

Gpt 2 Tamil

Developed by abinayam

Tamil pretrained language model based on GPT-2 architecture, supporting text generation tasks

Large Language Model

Transformers

Other#Tamil text generation #Causal language modeling #Multi-dataset pretraining

Downloads 292

Release Time : 3/2/2022

Model Overview

This is a GPT-2 model specifically trained for Tamil language, suitable for text generation and language understanding tasks. The model is trained using Flax/Jax framework and can be converted to PyTorch format for usage.

Model Features

Tamil Optimization

Specially trained for Tamil language, performs well on Tamil text generation tasks

Multi-framework Support

Supports both Flax/Jax and PyTorch frameworks, facilitating deployment in different environments

Open-source Datasets

Trained using publicly available Tamil datasets like oscar and IndicNLP

Model Capabilities

Tamil text generation

Sentence continuation

Language model fine-tuning

Use Cases

Text Generation

Story Continuation

Generate coherent story continuations based on given Tamil text prompts

Can generate multiple continuation versions in different styles

Content Creation Assistance

Assist Tamil content creators in generating creative text

Educational Applications

Language Learning

Provide language model practice for Tamil learners

🚀 GPT2-Tamil

This project is part of the Flax/Jax community week by Huggingface. Its goal is to pretrain a GPT - 2 language model specifically for the Tamil language, offering a powerful tool for Tamil language processing tasks.

🚀 Quick Start

To set up the project, run the following command:

pip install -r requirements.txt

✨ Features

Pretrained for Tamil: The model is pretrained on the Tamil language using a causal language modeling (CLM) objective.
Flexible Usage: Can be used for next - sentence prediction in its raw form and is suitable for fine - tuning on downstream tasks.

📦 Installation

To perform training, follow these steps:

Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.):

>>> export MODEL_DIR=<model_dir>

Create the config.json by running the following command:

>>> python src/create_config.py

Create the tokenizer by running the following command:

>>> python src/train_tokenizer.py

Once the config and tokenizer are created, run the following script to start training the flax model:

>>> python scripts/train_gpt2-oscar-tamil.sh

💻 Usage Examples

Basic Usage

To perform language generation using the model, a pipeline can be used directly.

First, convert the flax model to PyTorch using the following command:

python src/convert_flax_to_pytorch.py

Use the following snippet to perform language generation:

 >>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
 >>> model_name = 'abinayam/gpt-2-tamil'
 >>> model = AutoModelWithLMHead.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
 >>> set_seed(42)
 >>> input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
 >>> max_len = 300
 >>> no_seq = 5
 >>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
 >>> sequence = generator(input_text, max_length=max_len, num_return_sequences=no_seq)

📚 Documentation

Model

The model is a pretrained language model on the Tamil language using a causal language modeling (CLM) objective.

Dataset Used

The GPT - 2 model is trained on oscar dataset - ta and IndicNLP dataset - ta.

Intended uses & limitations

You can use the raw model for next sentence prediction, but it's mostly intended to be fine - tuned on a downstream task. See the model hub to look for fine - tuned versions on a task that interests you.

📄 License

The README does not provide license information, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご