KR-ELECTRA-generator Open-source Korean Model - Excels at Handling Informal Text Tasks

Home

KR ELECTRA Generator

Developed by snunlp

A Korean-specific ELECTRA model developed by Seoul National University, excelling in informal text processing tasks

Large Language Model

Transformers

Korean#Korean optimization #Informal text processing #Morpheme segmentation

Downloads 42.01k

Release Time : 3/2/2022

Model Overview

A Korean pre-trained model based on the ELECTRA architecture, optimized for Korean text with particular strength in handling informal texts like comments, while maintaining excellent performance across various NLP tasks

Model Features

Korean optimization

Specially designed for Korean language characteristics, using MeCab-Ko morphological analyzer for morpheme-level tokenization

Advantage in informal text processing

Particularly outstanding in processing tasks involving informal texts like comments

Balanced training data

Training data includes balanced proportions of written and spoken language texts

Efficient pre-training

Utilizes ELECTRA's replaced token detection pre-training method for high computational efficiency

Model Capabilities

Text classification

Named entity recognition

Semantic similarity calculation

Question answering systems

Sentence pair matching

Hate speech detection

Use Cases

Sentiment analysis

Product review analysis

Analyze sentiment tendencies in product reviews on e-commerce platforms

Achieved 91.168% accuracy on the NSMC dataset

Information extraction

Named entity recognition

Extract entity information like person names and locations from news texts

Achieved F1 score of 87.90 on the Naver NER dataset

Semantic understanding

Question answering system

Build Korean question answering systems

Achieved exact match of 84.93 and F1 score of 93.04 on the KorQuaD development set

🚀 Korean based ELECTRA (KR-ELECTRA)

KR-ELECTRA is a Korean-specific ELECTRA model developed by the Computational Linguistics Lab at Seoul National University. It shows comparable or better performance on various tasks, especially on informal text-related tasks like review documents.

🚀 Quick Start

This is a release of a Korean-specific ELECTRA model with comparable or better performances developed by the Computational Linguistics Lab at Seoul National University. Our model shows remarkable performances on tasks related to informal texts such as review documents, while still showing comparable results on other kinds of tasks.

✨ Features

High Performance: Demonstrates excellent performance on Korean language tasks, especially those related to informal texts.
Balanced Training: Trained on a balanced dataset of written and spoken Korean data.
Multiple Model Formats: Available in TensorFlow-v1 and PyTorch formats.

📦 Installation

You can download the TensorFlow-v1 model from here. For the PyTorch model on HuggingFace, you can use the following code:

from transformers import ElectraModel, ElectraTokenizer

model = ElectraModel.from_pretrained("snunlp/KR-ELECTRA-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("snunlp/KR-ELECTRA-discriminator")

📚 Documentation

Released Model

We pre-trained our KR-ELECTRA model following a base-scale model of ELECTRA. We trained the model based on Tensorflow-v1 using a v3-8 TPU of Google Cloud Platform.

Model Details

We followed the training parameters of the base-scale model of ELECTRA.

Hyperparameters

Property	Details
Model Type	KR-ELECTRA
# of layers (Discriminator)	12
embedding size (Discriminator)	768
hidden size (Discriminator)	768
# of heads (Discriminator)	12
# of layers (Generator)	12
embedding size (Generator)	768
hidden size (Generator)	256
# of heads (Generator)	4

Pretraining

batch size	train steps	learning rates	max sequence length	generator size
256	700000	2e-4	128	0.33333

Training Dataset

34GB Korean texts including Wikipedia documents, news articles, legal texts, news comments, product reviews, and so on. These texts are balanced, consisting of the same ratios of written and spoken data.

Vocabulary

Property	Details
Vocab size	30,000
Vocab unit	Morpheme-based unit tokens using Mecab-Ko morpheme analyzer

Download Link

Tensorflow-v1 model (download)
PyTorch models on HuggingFace

from transformers import ElectraModel, ElectraTokenizer

model = ElectraModel.from_pretrained("snunlp/KR-ELECTRA-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("snunlp/KR-ELECTRA-discriminator")

Finetuning

We used and slightly edited the finetuning codes from KoELECTRA, with additionally adjusted hyperparameters. You can download the codes and config files that we used for our model from our github.

Experimental Results

	NSMC (acc)	Naver NER (F1)	PAWS (acc)	KorNLI (acc)	KorSTS (spearman)	Question Pair (acc)	KorQuaD (Dev) (EM/F1)	Korean-Hate-Speech (Dev) (F1)
KoBERT	89.59	87.92	81.25	79.62	81.59	94.85	51.75 / 79.15	66.21
XLM-Roberta-Base	89.03	86.65	82.80	80.23	78.45	93.80	64.70 / 88.94	64.06
HanBERT	90.06	87.70	82.95	80.32	82.73	94.72	78.74 / 92.02	68.32
KoELECTRA-Base	90.33	87.18	81.70	80.64	82.00	93.54	60.86 / 89.28	66.09
KoELECTRA-Base-v2	89.56	87.16	80.70	80.72	82.30	94.85	84.01 / 92.40	67.45
KoELECTRA-Base-v3	90.63	88.11	84.45	82.24	85.53	95.25	84.83 / 93.45	67.61
KR-ELECTRA (ours)	91.168	87.90	82.05	82.51	85.41	95.51	84.93 / 93.04	74.50

The baseline results are brought from KoELECTRA's.

Citation

@misc{kr-electra,
  author = {Lee, Sangah and Hyopil Shin},
  title = {KR-ELECTRA: a KoRean-based ELECTRA model},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snunlp/KR-ELECTRA}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご