Indobertweet-base-uncased Open-Source Language Model - Specifically Designed for Indonesian Tweets to Assist in Text Analysis

Indobertweet Base Uncased

Developed by indolem

The first pre-trained language model specifically for Indonesian Twitter, built by extending Indonesian BERT with domain-specific vocabulary

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Indonesian Twitter Analysis #Domain Vocabulary Optimization #Multitask NLP

Downloads 2,848

Release Time : 3/2/2022

Model Overview

IndoBERTweet is a pre-trained model optimized for Indonesian Twitter, employing effective domain-specific vocabulary initialization methods, excelling in various Indonesian Twitter NLP tasks

Model Features

Domain-Specific Vocabulary Initialization

Initializes Twitter domain vocabulary via average pooling of BERT subword embeddings, more efficient than training from scratch

Large-Scale Pretraining Data

Uses 409 million tokens of Indonesian tweet data, twice the training data of IndoBERT

Twitter Text Optimization

Specifically handles Twitter-specific content like user mentions, URLs, and emojis

Model Capabilities

Indonesian Twitter text understanding

Sentiment Analysis

Emotion Recognition

Hate Speech Detection

Named Entity Recognition

Use Cases

Social Media Analysis

Twitter Sentiment Analysis

Analyzes sentiment tendencies of Indonesian Twitter users on specific topics

Achieves 86.6% accuracy on the IndoLEM dataset

Hate Speech Detection

Identifies hate speech content in Indonesian Twitter

Achieves 88.8% accuracy on the HS1 dataset

Natural Language Processing

Named Entity Recognition

Identifies entities like person names and locations in Indonesian Twitter text

Achieves 88.1% accuracy on formal text datasets

🚀 IndoBERTweet 🐦

IndoBERTweet is the first large - scale pretrained model for Indonesian Twitter. It is trained by extending a monolingually trained Indonesian BERT model with additive domain - specific vocabulary, which effectively addresses language characteristics on Twitter.

🚀 Quick Start

Loading the Model and Tokenizer

Load model and tokenizer (tested with transformers==3.5.1)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")

Preprocessing Steps

⚠️ Important Note

lower - case all words

converting user mentions and URLs into @USER and HTTPURL, respectively

translating emoticons into text using the emoji package.

✨ Features

Model Innovation

IndoBERTweet initializes domain - specific vocabulary with average - pooling of BERT subword embeddings. This method is more efficient than pretraining from scratch and more effective than initializing based on word2vec projections.

Pretraining Data

We crawl Indonesian tweets over a 1 - year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens, two times larger than the training data used to pretrain IndoBERT. Due to Twitter policy, this pretraining data will not be released to public.

📚 Documentation

Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain - Specific Vocabulary Initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).

Results over 7 Indonesian Twitter Datasets

Property	Details
Model Type	IndoBERTweet
Training Data	Indonesian tweets crawled from December 2019 to December 2020 using the official Twitter API, with 60 keywords covering economy, health, education, and government, totaling 409M word tokens

Models	Sentiment (IndoLEM)	Sentiment (SmSA)	Emotion (EmoT)	Hate Speech (HS1)	Hate Speech (HS2)	NER (Formal)	NER (Informal)	Average
mBERT	76.6	84.7	67.5	85.1	75.1	85.2	83.2	79.6
malayBERT	82.0	84.1	74.2	85.0	81.9	81.9	81.3	81.5
IndoBERT (Willie, et al., 2020)	84.1	88.7	73.3	86.8	80.4	86.3	84.3	83.4
IndoBERT (Koto, et al., 2020)	84.1	87.9	71.0	86.4	79.3	88.0	86.9	83.4
IndoBERTweet (1M steps from scratch)	86.2	90.4	76.0	88.8	87.5	88.1	85.4	86.1
IndoBERT + Voc adaptation + 200k steps	86.6	92.7	79.0	88.4	84.0	87.7	86.9	86.5

📄 License

This project is licensed under the Apache - 2.0 license.

Citation

If you use our work, please cite:

@inproceedings{koto2021indobertweet,
  title={IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain - Specific Vocabulary Initialization},
  author={Fajri Koto and Jey Han Lau and Timothy Baldwin},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご