Indobert-base-p1 Open-Source Indonesian Language Model - Supports Text Understanding and Prediction Tasks

Indobert Base P1

Developed by indobenchmark

IndoBERT is an advanced Indonesian language model based on BERT, trained with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives.

Large Language Model OtherOpen Source License:MIT #Indonesian Pretraining #Multi-stage Training #Masked Language Modeling

Downloads 261.95k

Release Time : 3/2/2022

Model Overview

IndoBERT is a pre-trained language model optimized for Indonesian, based on the BERT architecture, suitable for various natural language processing tasks.

Model Features

Indonesian Optimization

Specially trained and optimized for Indonesian, suitable for Indonesian natural language processing tasks.

BERT-based Architecture

Utilizes the BERT model architecture with robust language understanding and generation capabilities.

Large-scale Training Data

Trained on the Indo4B dataset (23.43 GB of text), covering a wide range of Indonesian content.

Model Capabilities

Text Understanding

Text Generation

Language Model Pretraining

Sentence Relation Prediction

Use Cases

Natural Language Processing

Text Classification

Classifying Indonesian text

Question Answering System

Building Indonesian question answering systems

Text Generation

Generating Indonesian text content

🚀 IndoBERT Base Model (phase1 - uncased)

IndoBERT is a state - of - the - art language model for Indonesian based on the BERT model. It addresses the challenges of natural language processing in Indonesian by leveraging the power of BERT architecture. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective, enabling it to understand and generate high - quality Indonesian text.

✨ Features

IndoBERT offers a variety of pre - trained models with different parameter sizes and architectures, suitable for diverse Indonesian language processing tasks.

📚 Documentation

All Pre - trained Models

Property	Details
Model Type	`indobenchmark/indobert-base-p1`, `indobenchmark/indobert-base-p2`, `indobenchmark/indobert-large-p1`, `indobenchmark/indobert-large-p2`, `indobenchmark/indobert-lite-base-p1`, `indobenchmark/indobert-lite-base-p2`, `indobenchmark/indobert-lite-large-p1`, `indobenchmark/indobert-lite-large-p2`
Training Data	Indo4B (23.43 GB of text)
#params	124.5M (Base models), 335.2M (Large models), 11.7M (Lite Base models), 17.7M (Lite Large models)
Arch.	Base, Large

Model	#params	Arch.	Training data
`indobenchmark/indobert-base-p1`	124.5M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-base-p2`	124.5M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-large-p1`	335.2M	Large	Indo4B (23.43 GB of text)
`indobenchmark/indobert-large-p2`	335.2M	Large	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-base-p1`	11.7M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-base-p2`	11.7M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-large-p1`	17.7M	Large	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-large-p2`	17.7M	Large	Indo4B (23.43 GB of text)

💻 Usage Examples

Basic Usage

# Load model and tokenizer
from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
model = AutoModel.from_pretrained("indobenchmark/indobert-base-p1")

Advanced Usage

# Extract contextual representation
import torch
x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📄 License

This project is licensed under the MIT license.

👥 Authors

IndoBERT was trained and evaluated by Bryan Wilie*, Karissa Vincentio*, Genta Indra Winata*, Samuel Cahyawijaya*, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti.

📚 Citation

If you use our work, please cite:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご