Indobert-lite-base-p1 Open-source Indonesian Language Model - Lightweight Design, Suitable for Environments with Limited Resources!

Indobert Lite Base P1

Developed by indobenchmark

IndoBERT is a BERT model variant tailored for the Indonesian language, trained using masked language modeling and next sentence prediction objectives. The Lite version is a lightweight model suitable for resource-constrained environments.

Large Language Model

Transformers

OtherOpen Source License:MIT #Indonesian Pretraining #Lightweight BERT #Case Insensitive

Downloads 723

Release Time : 3/2/2022

Model Overview

An Indonesian pretrained language model optimized based on the BERT architecture, focusing on natural language understanding tasks, offering both base and lightweight versions.

Model Features

Indonesian Language Optimization

Specially pretrained and optimized for Indonesian language characteristics

Lightweight Design

The Lite version significantly reduces parameters, making it suitable for resource-limited scenarios

Two-Phase Training

Offers P1 (Case Insensitive) and P2 (Case Sensitive) versions

Model Capabilities

Indonesian Text Understanding

Contextual Feature Extraction

Masked Word Prediction

Use Cases

Natural Language Processing

Text Classification

Classification of Indonesian news/articles

Named Entity Recognition

Entity recognition in Indonesian text

🚀 IndoBERT-Lite Base Model (phase1 - uncased)

IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. It offers advanced language processing capabilities for the Indonesian language.

🚀 Quick Start

IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective.

✨ Features

All Pre-trained Models

Property	Details
Model Type	`indobenchmark/indobert-base-p1`, `indobenchmark/indobert-base-p2`, `indobenchmark/indobert-large-p1`, `indobenchmark/indobert-large-p2`, `indobenchmark/indobert-lite-base-p1`, `indobenchmark/indobert-lite-base-p2`, `indobenchmark/indobert-lite-large-p1`, `indobenchmark/indobert-lite-large-p2`
#params	124.5M (Base models), 335.2M (Large models), 11.7M (Lite Base models), 17.7M (Lite Large models)
Arch.	Base, Large
Training Data	Indo4B (23.43 GB of text)

💻 Usage Examples

Basic Usage

Load model and tokenizer

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-lite-base-p1")
model = AutoModel.from_pretrained("indobenchmark/indobert-lite-base-p1")

Advanced Usage

Extract contextual representation

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 Documentation

Authors

IndoBERT was trained and evaluated by Bryan Wilie*, Karissa Vincentio*, Genta Indra Winata*, Samuel Cahyawijaya*, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti.

Citation

If you use our work, please cite:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご