Indobert-large-p1 Open-Source Indonesian Language Model - Free Support for Indonesian Language Understanding and Applications

Indobert Large P1

Developed by indobenchmark

IndoBERT is an advanced Indonesian language model based on the BERT model, trained with masked language modeling and next - sentence prediction objectives.

Large Language Model OtherOpen Source License:MIT #Indonesian pre - training #Large - scale language model #Contextual representation extraction

Downloads 1,686

Release Time : 3/2/2022

Model Overview

IndoBERT is a pre - trained language model optimized for the Indonesian language, suitable for various natural language processing tasks.

Model Features

Large - scale pre - training

Pre - trained using the Indo4B dataset (23.43GB of text)

Case - insensitive

The model does not distinguish between uppercase and lowercase when processing text

Two - phase training

The model undergoes a two - phase training process (P1 and P2)

Model Capabilities

Text representation learning

Language understanding

Text classification

Question - answering system

Named entity recognition

Use Cases

Natural language processing

Text classification

Classify Indonesian texts

Question - answering system

Build an Indonesian question - answering system

🚀 IndoBERT Large Model (phase1 - uncased)

IndoBERT is a state - of - the - art language model for Indonesian, built upon the BERT model. It is pretrained using masked language modeling (MLM) and next sentence prediction (NSP) objectives.

🚀 Quick Start

The following sections will guide you through using the IndoBERT model, including loading the model and tokenizer, and extracting contextual representation.

✨ Features

IndoBERT offers a range of pre - trained models with different architectures and parameter sizes, all trained on the Indo4B dataset. These models can be used for various Indonesian natural language processing tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p1")
model = AutoModel.from_pretrained("indobenchmark/indobert-large-p1")

Advanced Usage

import torch
x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 Documentation

All Pre - trained Models

Property	Details
Model Type	`indobenchmark/indobert-base-p1`, `indobenchmark/indobert-base-p2`, `indobenchmark/indobert-large-p1`, `indobenchmark/indobert-large-p2`, `indobenchmark/indobert-lite-base-p1`, `indobenchmark/indobert-lite-base-p2`, `indobenchmark/indobert-lite-large-p1`, `indobenchmark/indobert-lite-large-p2`
#params	124.5M (Base models), 335.2M (Large models), 11.7M (Lite Base models), 17.7M (Lite Large models)
Arch.	Base, Large
Training Data	Indo4B (23.43 GB of text)

📄 License

This project is licensed under the MIT license.

👥 Authors

IndoBERT was trained and evaluated by Bryan Wilie*, Karissa Vincentio*, Genta Indra Winata*, Samuel Cahyawijaya*, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti.

📚 Citation

If you use our work, please cite:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご