🚀 KoELECTRA-small-v3-modu-ner
This model is a fine - tuned version of monologg/koelectra-small-v3-discriminator on an unknown dataset. It can be used for token - classification tasks, achieving high performance in named entity recognition.
✨ Features
- Tagging System: It uses the BIO tagging system, which is effective in identifying named entities.
- Diverse Tag Sets: Follows the Korea Information and Communications Technology Association (TTA) classification criteria, with 15 tag sets for comprehensive entity recognition.
📦 Installation
The installation mainly involves setting up the necessary Python libraries. You can install the required libraries using pip
:
pip install transformers datasets torch tokenizers
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Leo97/KoELECTRA-small-v3-modu-ner")
model = AutoModelForTokenClassification.from_pretrained("Leo97/KoELECTRA-small-v3-modu-ner")
ner = pipeline("ner", model=model, tokenizer=tokenizer)
example = "서울역으로 안내해줘."
ner_results = ner(example)
print(ner_results)
📚 Documentation
Model description
Tagging System: BIO System
- B - (begin): Indicates the start of a named entity.
- I - (inside): Indicates that the token is in the middle of a named entity.
- O - (outside): Indicates that the token is not part of a named entity.
It follows 15 tag sets based on the classification criteria of the Korea Information and Communications Technology Association (TTA).
Property |
Details |
ARTIFACTS |
AF. Man - made objects created by humans, including cultural relics, buildings, musical instruments, roads, weapons, means of transportation, work names, and industrial product names. |
ANIMAL |
AM. Animals other than humans. |
CIVILIZATION |
CV. Civilization/culture. |
DATE |
DT. Periods, seasons, and time/era. |
EVENT |
EV. Specific event/accident/event names. |
STUDY_FIELD |
FD. Academic fields, schools of thought, and sects. |
LOCATION |
LC. All geographical locations, including regions, places, and geographical features. |
MATERIAL |
MT. Elements, metals, rocks/gems, and chemicals. |
ORGANIZATION |
OG. Organization and group names. |
PERSON |
PS. Personal names and aliases (including similar personal names). |
PLANT |
PT. Flowers/trees, land plants, seaweeds, mushrooms, and mosses. |
QUANTITY |
QT. Quantities, orders, and expressions made up of numbers. |
TIME |
TI. Clock - based time, time ranges. |
TERM |
TM. Named entities other than those defined in other categories. |
THEORY |
TR. Specific theories, laws, and principles. |
Intended uses & limitations
This model is mainly intended for named entity recognition tasks in Korean text. However, its performance may be affected by the domain of the text and the quality of the training data.
Training and evaluation data
The model is trained on the named entity recognition (NER) dataset from the following source:
- Ministry of Culture, Sports and Tourism > National Institute of the Korean Language > Everyone's Corpus > Named Entity Analysis Corpus 2021
- https://corpus.korean.go.kr/request/reausetMain.do
🔧 Technical Details
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 15151
- num_epochs: 20
- mixed_precision_training: Native AMP
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Precision |
Recall |
F1 |
Accuracy |
No log |
1.0 |
3788 |
0.3978 |
0.5986 |
0.5471 |
0.5717 |
0.9087 |
No log |
2.0 |
7576 |
0.2319 |
0.6986 |
0.6953 |
0.6969 |
0.9345 |
No log |
3.0 |
11364 |
0.1838 |
0.7363 |
0.7612 |
0.7486 |
0.9444 |
No log |
4.0 |
15152 |
0.1610 |
0.7762 |
0.7745 |
0.7754 |
0.9509 |
No log |
5.0 |
18940 |
0.1475 |
0.7862 |
0.8011 |
0.7936 |
0.9545 |
No log |
6.0 |
22728 |
0.1417 |
0.7857 |
0.8181 |
0.8016 |
0.9563 |
No log |
7.0 |
26516 |
0.1366 |
0.8022 |
0.8196 |
0.8108 |
0.9584 |
No log |
8.0 |
30304 |
0.1346 |
0.8093 |
0.8236 |
0.8164 |
0.9596 |
No log |
9.0 |
34092 |
0.1328 |
0.8085 |
0.8299 |
0.8190 |
0.9602 |
No log |
10.0 |
37880 |
0.1332 |
0.8110 |
0.8368 |
0.8237 |
0.9608 |
No log |
11.0 |
41668 |
0.1323 |
0.8157 |
0.8347 |
0.8251 |
0.9612 |
No log |
12.0 |
45456 |
0.1353 |
0.8118 |
0.8402 |
0.8258 |
0.9611 |
No log |
13.0 |
49244 |
0.1370 |
0.8152 |
0.8416 |
0.8282 |
0.9616 |
No log |
14.0 |
53032 |
0.1368 |
0.8164 |
0.8415 |
0.8287 |
0.9616 |
No log |
15.0 |
56820 |
0.1378 |
0.8187 |
0.8438 |
0.8310 |
0.9621 |
No log |
16.0 |
60608 |
0.1389 |
0.8217 |
0.8438 |
0.8326 |
0.9626 |
No log |
17.0 |
64396 |
0.1380 |
0.8266 |
0.8426 |
0.8345 |
0.9631 |
No log |
18.0 |
68184 |
0.1428 |
0.8216 |
0.8445 |
0.8329 |
0.9625 |
No log |
19.0 |
71972 |
0.1431 |
0.8232 |
0.8455 |
0.8342 |
0.9628 |
0.1712 |
20.0 |
75760 |
0.1431 |
0.8232 |
0.8449 |
0.8339 |
0.9628 |
Framework versions
- Transformers 4.27.4
- Pytorch 2.0.0+cu118
- Datasets 2.11.0
- Tokenizers 0.13.3