đ IndoBERTweet đĻ
IndoBERTweet is the first large - scale pretrained model for Indonesian Twitter. It is trained by extending a monolingually trained Indonesian BERT model with additive domain - specific vocabulary, which effectively addresses language characteristics on Twitter.
đ Quick Start
Loading the Model and Tokenizer
Load model and tokenizer (tested with transformers==3.5.1)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")
Preprocessing Steps
â ī¸ Important Note
- lower - case all words
- converting user mentions and URLs into @USER and HTTPURL, respectively
- translating emoticons into text using the emoji package.
⨠Features
Model Innovation
IndoBERTweet initializes domain - specific vocabulary with average - pooling of BERT subword embeddings. This method is more efficient than pretraining from scratch and more effective than initializing based on word2vec projections.
Pretraining Data
We crawl Indonesian tweets over a 1 - year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens, two times larger than the training data used to pretrain IndoBERT. Due to Twitter policy, this pretraining data will not be released to public.
đ Documentation
Paper
Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter
with Effective Domain - Specific Vocabulary Initialization.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).
Results over 7 Indonesian Twitter Datasets
Property |
Details |
Model Type |
IndoBERTweet |
Training Data |
Indonesian tweets crawled from December 2019 to December 2020 using the official Twitter API, with 60 keywords covering economy, health, education, and government, totaling 409M word tokens |
Models |
Sentiment (IndoLEM) |
Sentiment (SmSA) |
Emotion (EmoT) |
Hate Speech (HS1) |
Hate Speech (HS2) |
NER (Formal) |
NER (Informal) |
Average |
mBERT |
76.6 |
84.7 |
67.5 |
85.1 |
75.1 |
85.2 |
83.2 |
79.6 |
malayBERT |
82.0 |
84.1 |
74.2 |
85.0 |
81.9 |
81.9 |
81.3 |
81.5 |
IndoBERT (Willie, et al., 2020) |
84.1 |
88.7 |
73.3 |
86.8 |
80.4 |
86.3 |
84.3 |
83.4 |
IndoBERT (Koto, et al., 2020) |
84.1 |
87.9 |
71.0 |
86.4 |
79.3 |
88.0 |
86.9 |
83.4 |
IndoBERTweet (1M steps from scratch) |
86.2 |
90.4 |
76.0 |
88.8 |
87.5 |
88.1 |
85.4 |
86.1 |
IndoBERT + Voc adaptation + 200k steps |
86.6 |
92.7 |
79.0 |
88.4 |
84.0 |
87.7 |
86.9 |
86.5 |
đ License
This project is licensed under the Apache - 2.0 license.
Citation
If you use our work, please cite:
@inproceedings{koto2021indobertweet,
title={IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain - Specific Vocabulary Initialization},
author={Fajri Koto and Jey Han Lau and Timothy Baldwin},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)},
year={2021}
}