Indobert - lite - base - p2, an open - source Indonesian language model, can be used for various Indonesian text processing tasks.

Indobert Lite Base P2

Developed by indobenchmark

IndoBERT is a top-tier language model developed for Indonesian, based on the BERT architecture, trained using masked language modeling and next sentence prediction objectives.

Large Language Model

Transformers

OtherOpen Source License:MIT #Indonesian-specific #Lightweight BERT #Case-insensitive

Downloads 2,498

Release Time : 3/2/2022

Model Overview

IndoBERT-Lite is a lightweight version of IndoBERT, specifically optimized for Indonesian, suitable for natural language understanding tasks.

Model Features

Lightweight Design

Fewer model parameters, suitable for resource-constrained environments.

Indonesian Optimization

Pre-trained specifically for Indonesian, excelling in Indonesian language tasks.

Case-insensitive

The model is case-insensitive, adaptable to various text formats.

Model Capabilities

Text feature extraction

Masked language modeling

Next sentence prediction

Use Cases

Natural Language Processing

Text Classification

Can be used for sentiment analysis or topic classification of Indonesian text.

Question Answering System

Suitable for building Indonesian question answering systems.

🚀 IndoBERT-Lite Base Model (phase2 - uncased)

IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. It is trained using masked language modeling (MLM) and next sentence prediction (NSP) objectives.

🚀 Quick Start

IndoBERT is a cutting - edge Indonesian language model built on the BERT architecture. The pre - trained model is optimized through masked language modeling (MLM) and next sentence prediction (NSP) tasks.

✨ Features

All Pre - trained Models

Property	Details
Model Type	There are multiple pre - trained models including `indobenchmark/indobert-base-p1`, `indobenchmark/indobert-base-p2`, `indobenchmark/indobert-large-p1`, `indobenchmark/indobert-large-p2`, `indobenchmark/indobert-lite-base-p1`, `indobenchmark/indobert-lite-base-p2`, `indobenchmark/indobert-lite-large-p1`, `indobenchmark/indobert-lite-large-p2`.
Training Data	All models are trained on the Indo4B dataset which contains 23.43 GB of text.

Model	#params	Arch.	Training data
`indobenchmark/indobert-base-p1`	124.5M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-base-p2`	124.5M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-large-p1`	335.2M	Large	Indo4B (23.43 GB of text)
`indobenchmark/indobert-large-p2`	335.2M	Large	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-base-p1`	11.7M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-base-p2`	11.7M	Base	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-large-p1`	17.7M	Large	Indo4B (23.43 GB of text)
`indobenchmark/indobert-lite-large-p2`	17.7M	Large	Indo4B (23.43 GB of text)

💻 Usage Examples

Basic Usage

# Load model and tokenizer
from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-lite-base-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-lite-base-p2")

Advanced Usage

# Extract contextual representation
x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 Documentation

Authors

IndoBERT was trained and evaluated by Bryan Wilie*, Karissa Vincentio*, Genta Indra Winata*, Samuel Cahyawijaya*, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti.

Citation

If you use our work, please cite:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご