GPT2-medium-Indonesian Open-source Model - Free Support for Indonesian Text Generation Tasks

Gpt2 Medium Indonesian

Developed by flax-community

A medium-sized GPT-2 pre-trained model for Indonesian language, trained with causal language modeling objective, suitable for Indonesian text generation tasks.

Large Language Model Other#Indonesian text generation #Multi-domain text generation #Low-resource optimization

Downloads 100

Release Time : 3/2/2022

Model Overview

This is a pre-trained model for Indonesian language based on causal language modeling (CLM) objective, suitable for Indonesian text generation and related tasks.

Model Features

Indonesian language optimization

Specially trained and optimized for Indonesian language, suitable for Indonesian text generation tasks.

Flax framework based

Trained using HuggingFace's Flax framework, supporting JAX/Flax ecosystem.

Multi-dataset training

Trained on combined datasets of OSCAR, mc4 and Indonesian Wikipedia content, totaling 29GB of data.

Model Capabilities

Indonesian text generation

Language model fine-tuning foundation

Text feature extraction

Use Cases

Text generation

Creative writing

Generate creative texts in Indonesian such as poems, stories

Can generate coherent Indonesian text paragraphs

Content completion

Complete Indonesian sentences or paragraphs based on given prompts

Can generate semantically coherent follow-up content

Education

Language learning

As an auxiliary tool for Indonesian language learners

Provides natural Indonesian language examples

🚀 GPT2-medium-indonesian

This is a pre-trained model for the Indonesian language, trained with a causal language modeling (CLM) objective. The CLM objective was first introduced in this paper and initially released on this page.

This model was trained using HuggingFace's Flax framework and is part of the JAX/Flax Community Week organized by HuggingFace. All training was conducted on a TPUv3 - 8 VM sponsored by the Google Cloud team.

You can find the demo here.

🚀 Quick Start

✨ Features

A pre - trained model for the Indonesian language using the CLM objective.
Trained with HuggingFace's Flax framework.
Demonstrated on a TPUv3 - 8 VM sponsored by Google Cloud.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='flax-community/gpt2-medium-indonesian')
>>> set_seed(42)
>>> generator("Sewindu sudah kita tak berjumpa,", max_length=30, num_return_sequences=5)

[{'generated_text': 'Sewindu sudah kita tak berjumpa, dua dekade lalu, saya hanya bertemu sekali. Entah mengapa, saya lebih nyaman berbicara dalam bahasa Indonesia, bahasa Indonesia'},
 {'generated_text': 'Sewindu sudah kita tak berjumpa, tapi dalam dua hari ini, kita bisa saja bertemu.”\
“Kau tau, bagaimana dulu kita bertemu?” aku'},
 {'generated_text': 'Sewindu sudah kita tak berjumpa, banyak kisah yang tersimpan. Tak mudah tuk kembali ke pelukan, di mana kini kita berada, sebuah tempat yang jauh'},
 {'generated_text': 'Sewindu sudah kita tak berjumpa, sejak aku lulus kampus di Bandung, aku sempat mencari kabar tentangmu. Ah, masih ada tempat di hatiku,'},
 {'generated_text': 'Sewindu sudah kita tak berjumpa, tapi Tuhan masih saja menyukarkan doa kita masing-masing.\
Tuhan akan memberi lebih dari apa yang kita'}]

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('flax-community/gpt2-medium-indonesian')
model = GPT2Model.from_pretrained('flax-community/gpt2-medium-indonesian')
text = "Ubah dengan teks apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('flax-community/gpt2-medium-indonesian')
model = TFGPT2Model.from_pretrained('flax-community/gpt2-medium-indonesian')
text = "Ubah dengan teks apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

Limitations and bias

The training data for this model are from Indonesian websites of OSCAR, mc4, and Wikipedia. These datasets contain a lot of unfiltered internet content, which is far from neutral. Although some filtering has been done on the dataset (see the Training data section), it doesn't fully mitigate the biased content used in training. These biases may also affect models fine - tuned using this model.

As the openAI team pointed out in their model card:

⚠️ Important Note

Because large - scale language models like GPT - 2 do not distinguish fact from fiction, we don’t support use - cases that require the generated text to be true.

⚠️ Important Note

Additionally, language models like GPT - 2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use - case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT - 2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.

We have done a basic bias analysis, which can be found in this notebook. It was performed on Indonesian GPT2 medium, based on the bias analysis for Polish GPT2 with modifications.

Gender bias

We generated 50 texts starting with prompts "She/He works as". After some preprocessing (lowercase and stopwords removal), we obtained texts to generate word clouds of female/male professions. The most salient terms for male professions are: driver, sopir (driver), ojek, tukang, online.

gender bias - male

The most salient terms for female professions are: pegawai (employee), konsultan (consultant), asisten (assistant).

gender bias - female

Ethnicity bias

We generated 1,200 texts to assess bias across ethnicity and gender vectors. We created prompts with the following scheme:

Person - we assessed 5 ethnicities: Sunda, Batak, Minahasa, Dayak, Asmat, Neutral (no ethnicity)
Topic - we used 5 different topics:
- random act: entered home
- said: said
- works as: works as
- intent: let [person] ...
- define: is

Sample of generated prompt: "seorang perempuan sunda masuk ke rumah..." (a Sundanese woman enters the house...)

We used a model trained on an Indonesian hate - speech corpus (dataset 1, dataset 2) to obtain the probability that each generated text contains hate speech. To avoid leakage, we removed the first word identifying the ethnicity and gender from the generated text before running the hate - speech detector.

The following chart demonstrates the intensity of hate speech associated with the generated texts with outlier scores removed. Some ethnicities score higher than the neutral baseline.

bias analysis - ethnicities

Religion bias

With the same methodology, we generated 1,400 texts to assess bias across religion and gender vectors. We assessed 6 religions: Islam, Protestan (Protestant), Katolik (Catholic), Buddha (Buddhism), Hindu (Hinduism), and Khonghucu (Confucianism) with Neutral (no religion) as a baseline.

The following chart demonstrates the intensity of hate speech associated with the generated texts with outlier scores removed. Some religions score higher than the neutral baseline.

bias analysis - ethnicities

Training data

The model was trained on a combined dataset of OSCAR, mc4, and Wikipedia for the Indonesian language. We filtered and reduced the mc4 dataset, resulting in a total of 29 GB of data. The mc4 dataset was cleaned using this filtering script, and we only included links cited by the Indonesian Wikipedia.

Training procedure

The model was trained on a TPUv3 - 8 VM provided by the Google Cloud team. The training duration was 6d 3h 7m 26s.

Evaluation results

The model achieves the following results without any fine - tuning (zero - shot):

Property	Details
dataset	ID OSCAR+mc4+Wikipedia (29GB)
train loss	2.79
eval loss	2.696
eval perplexity	14.826

Tracking

The training process was tracked in TensorBoard and Weights and Biases.

🔧 Technical Details

No specific technical details beyond what's already covered are provided.

📄 License

No license information is provided in the original document.

Team members

Akmal (@Wikidepia)
alvinwatner (@alvinwatner)
Cahya Wirawan (@cahya)
Galuh Sahid (@Galuh)
Muhammad Agung Hambali (@AyameRushia)
Muhammad Fhadli (@muhammadfhadli)
Samsul Rahmadani (@munggok)

Future work

We would like to pre - train the models further with larger and cleaner datasets and fine - tune it to specific domains if we can get the necessary hardware resources.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご