KoELECTRA-small-v3-modu-ner: An Open-source Korean Named Entity Recognition Model

Home

Koelectra Small V3 Modu Ner

Developed by Leo97

Korean Named Entity Recognition model based on KoELECTRA-small-v3, supporting 15 entity types

Sequence Labeling

Transformers

Korean#Korean Named Entity Recognition #BIO tagging system #TTA standard entity classification

Downloads 9,277

Release Time : 3/29/2023

Model Overview

This model is a fine-tuned Korean Named Entity Recognition (NER) model based on monologg/koelectra-small-v3-discriminator, using the BIO tagging system and supporting 15 entity types.

Model Features

Multi-category entity recognition

Supports 15 entity types including locations, persons, organizations, etc.

High performance

Achieves an F1 score of 0.8339 and accuracy of 0.9628 on the evaluation set

Standard tagging system

Uses the BIO tagging system, conforming to industry standards

Model Capabilities

Korean text entity recognition

Multi-category entity tagging

Natural language processing

Use Cases

Smart assistants

Address recognition

Identifies location information in user commands

Example: 'Please take me to Seoul Station' can recognize 'Seoul Station' as a location (LC)

Device control

Identifies parameters in device control commands

Example: 'Increase the air conditioner temperature by 3 degrees' can recognize '3 degrees' as a quantity (QT)

Information retrieval

Artist work search

Identifies artist information in search queries

Example: 'Search for IU's songs' can recognize 'IU' as a person (PS)

🚀 KoELECTRA-small-v3-modu-ner

This model is a fine - tuned version of monologg/koelectra-small-v3-discriminator on an unknown dataset. It can be used for token - classification tasks, achieving high performance in named entity recognition.

✨ Features

Tagging System: It uses the BIO tagging system, which is effective in identifying named entities.
Diverse Tag Sets: Follows the Korea Information and Communications Technology Association (TTA) classification criteria, with 15 tag sets for comprehensive entity recognition.

📦 Installation

The installation mainly involves setting up the necessary Python libraries. You can install the required libraries using pip:

pip install transformers datasets torch tokenizers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Leo97/KoELECTRA-small-v3-modu-ner")
model = AutoModelForTokenClassification.from_pretrained("Leo97/KoELECTRA-small-v3-modu-ner")
ner = pipeline("ner", model=model, tokenizer=tokenizer)

example = "서울역으로 안내해줘."
ner_results = ner(example)
print(ner_results)

📚 Documentation

Model description

Tagging System: BIO System

B - (begin): Indicates the start of a named entity.
I - (inside): Indicates that the token is in the middle of a named entity.
O - (outside): Indicates that the token is not part of a named entity.

It follows 15 tag sets based on the classification criteria of the Korea Information and Communications Technology Association (TTA).

Property	Details
ARTIFACTS	AF. Man - made objects created by humans, including cultural relics, buildings, musical instruments, roads, weapons, means of transportation, work names, and industrial product names.
ANIMAL	AM. Animals other than humans.
CIVILIZATION	CV. Civilization/culture.
DATE	DT. Periods, seasons, and time/era.
EVENT	EV. Specific event/accident/event names.
STUDY_FIELD	FD. Academic fields, schools of thought, and sects.
LOCATION	LC. All geographical locations, including regions, places, and geographical features.
MATERIAL	MT. Elements, metals, rocks/gems, and chemicals.
ORGANIZATION	OG. Organization and group names.
PERSON	PS. Personal names and aliases (including similar personal names).
PLANT	PT. Flowers/trees, land plants, seaweeds, mushrooms, and mosses.
QUANTITY	QT. Quantities, orders, and expressions made up of numbers.
TIME	TI. Clock - based time, time ranges.
TERM	TM. Named entities other than those defined in other categories.
THEORY	TR. Specific theories, laws, and principles.

Intended uses & limitations

This model is mainly intended for named entity recognition tasks in Korean text. However, its performance may be affected by the domain of the text and the quality of the training data.

Training and evaluation data

The model is trained on the named entity recognition (NER) dataset from the following source:

Ministry of Culture, Sports and Tourism > National Institute of the Korean Language > Everyone's Corpus > Named Entity Analysis Corpus 2021
https://corpus.korean.go.kr/request/reausetMain.do

🔧 Technical Details

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 64
eval_batch_size: 64
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 15151
num_epochs: 20
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
No log	1.0	3788	0.3978	0.5986	0.5471	0.5717	0.9087
No log	2.0	7576	0.2319	0.6986	0.6953	0.6969	0.9345
No log	3.0	11364	0.1838	0.7363	0.7612	0.7486	0.9444
No log	4.0	15152	0.1610	0.7762	0.7745	0.7754	0.9509
No log	5.0	18940	0.1475	0.7862	0.8011	0.7936	0.9545
No log	6.0	22728	0.1417	0.7857	0.8181	0.8016	0.9563
No log	7.0	26516	0.1366	0.8022	0.8196	0.8108	0.9584
No log	8.0	30304	0.1346	0.8093	0.8236	0.8164	0.9596
No log	9.0	34092	0.1328	0.8085	0.8299	0.8190	0.9602
No log	10.0	37880	0.1332	0.8110	0.8368	0.8237	0.9608
No log	11.0	41668	0.1323	0.8157	0.8347	0.8251	0.9612
No log	12.0	45456	0.1353	0.8118	0.8402	0.8258	0.9611
No log	13.0	49244	0.1370	0.8152	0.8416	0.8282	0.9616
No log	14.0	53032	0.1368	0.8164	0.8415	0.8287	0.9616
No log	15.0	56820	0.1378	0.8187	0.8438	0.8310	0.9621
No log	16.0	60608	0.1389	0.8217	0.8438	0.8326	0.9626
No log	17.0	64396	0.1380	0.8266	0.8426	0.8345	0.9631
No log	18.0	68184	0.1428	0.8216	0.8445	0.8329	0.9625
No log	19.0	71972	0.1431	0.8232	0.8455	0.8342	0.9628
0.1712	20.0	75760	0.1431	0.8232	0.8449	0.8339	0.9628

Framework versions

Transformers 4.27.4
Pytorch 2.0.0+cu118
Datasets 2.11.0
Tokenizers 0.13.3

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご