IndoBERT-SQuAD Open-source Indonesian Question-answering Model

Home

Indobert SQuAD

Developed by esakrissa

Indonesian Q&A model fine-tuned on the SQuAD2.0 dataset based on IndoBERT

Question Answering System

Transformers

Open Source License:MIT #Indonesian Q&A #Bali tourism #SQuAD fine-tuning

Downloads 14

Release Time : 12/18/2022

Model Overview

This model is a BERT model optimized for Indonesian Q&A tasks, capable of extracting answers from given text and determining whether questions are answerable

Model Features

Indonesian Optimization

Trained on a 220-million-word Indonesian corpus with excellent Indonesian comprehension

Dual-mode Q&A

Capable of answering answerable questions and identifying unanswerable questions

Efficient Fine-tuning

Fine-tuned on the SQuAD2.0 dataset with validation loss as low as 1.8025

Model Capabilities

Text comprehension

Answer extraction

Question answerability judgment

Use Cases

Customer Support

Tourism Information Q&A

Answering queries about Indonesian tourist attractions

Example accurately identifies location information about Ubud

Education

Learning Assistance

Helping students quickly find answers from textbook content

🚀 IndoBERT SQuAD

This is a fine - tuned model based on IndoBERT, designed for question - answering tasks on the Indonesian language, achieving good results on the evaluation set.

🚀 Quick Start

This model is a fine - tuned version of [indolem/indobert - base - uncased](https://huggingface.co/indolem/indobert - base - uncased) on the None dataset. It achieves the following results on the evaluation set:

Loss: 1.8025

✨ Features

IndoBERT

[IndoBERT](https://huggingface.co/indolem/indobert - base - uncased) is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:

Indonesian Wikipedia (74M words)
news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total)
an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words).

We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being 3.97 (similar to English BERT - base).

This IndoBERT was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho - syntax, semantics, and discourse.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline

qa_pipeline = pipeline(
    "question - answering",
    model="esakrissa/IndoBERT - SQuAD",
    tokenizer="esakrissa/IndoBERT - SQuAD"
)

qa_pipeline({
    'context': """Sudah sejak tahun 1920 - an, Ubud terkenal di antara wisatawan barat. Kala itu pelukis Jerman; Walter Spies dan pelukis Belanda; Rudolf Bonnet menetap di sana. Mereka dibantu oleh Tjokorda Gde Agung Sukawati, dari Puri Agung Ubud. Sekarang karya mereka bisa dilihat di Museum Puri Lukisan, Ubud.""",
    'question': "Sejak kapan Ubud terkenal di antara wisatawan barat?"
})

output:

{
'answer': '1920 - an',
'start': 18, 
'end': 25,
'score': 0.8675463795661926
}

📚 Documentation

Training and evaluation data

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Dataset	Split	# samples
SQuAD2.0	train	130k
SQuAD2.0	eval	12.3k

Training procedure

The model was trained on a Tesla T4 GPU and 12GB of RAM.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e - 05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
1.4098	1.0	8202	1.3860
1.1716	2.0	16404	1.8555
1.2909	3.0	24606	1.8025

Metric	# Value
EM	52.17
F1	69.22

Reference

[1]Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin. 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre - trained Language Model for Indonesian NLP. Proceedings of the 28th COLING. [2]rifkybujana/IndoBERT - QA

Framework versions

Transformers 4.25.1
Pytorch 1.13.0+cu116
Datasets 2.7.1
Tokenizers 0.13.2

🔧 Technical Details

No specific technical details beyond what's in other sections are provided, so this section is skipped.

📄 License

The model is released under the MIT license.

🔗 Links

[Github](https://github.com/esakrissa/question - answering)
[IndoBERT SQuAD Demo](https://huggingface.co/spaces/esakrissa/IndoBERT - SQuAD)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご