๐ KR-FinBert & KR-FinBert-SC
In the field of Natural Language Processing (NLP), significant progress has been made. Many studies have demonstrated that domain adaptation using a small - scale corpus and fine - tuning with labeled data can effectively improve overall performance. We proposed KR - FinBert for the financial domain by further pre - training it on a financial corpus and fine - tuning it for sentiment analysis. As shown in many studies, the performance improvement through adaptation and downstream tasks was also evident in this experiment.

๐ Quick Start
In the NLP (Natural Language Processing) field, substantial progress has been achieved. A multitude of studies indicate that domain adaptation with a small - scale corpus and fine - tuning using labeled data are effective for enhancing overall performance. We introduced KR - FinBert for the financial domain. This was accomplished by further pre - training it on a financial corpus and fine - tuning it for sentiment analysis. Similar to many other studies, the performance improvement through adaptation and downstream tasks was also clear in this experiment.
โจ Features
- Domain - Specific Adaptation: KR - FinBert is pre - trained on a financial corpus, which is beneficial for financial - related NLP tasks.
- Sentiment Analysis Fine - Tuning: Fine - tuned for sentiment analysis, it can accurately classify financial texts' sentiment.
๐ฆ Installation
No installation steps are provided in the original document, so this section is skipped.
๐ Documentation
Data
The training data for this model is an expansion of that of [KR - BERT - MEDIUM](https://huggingface.co/snunlp/KR - Medium), including texts from Korean Wikipedia, general news articles, legal texts crawled from the National Law Information Center, and the [Korean Comments dataset](https://www.kaggle.com/junbumlee/kcbert - pretraining - corpus - korean - news - comments). For transfer learning, corporate - related economic news articles from 72 media sources such as the Financial Times, The Korean Economy Daily, etc., and analyst reports from 16 securities companies such as Kiwoom Securities, Samsung Securities, etc. are added. The dataset contains 440,067 news titles with their content and 11,237 analyst reports. The total data size is approximately 13.22GB. For mlm training, we split the data line by line, and the total number of lines is 6,379,315. KR - FinBert is trained for 5.5M steps with a maxlen of 512, a training batch size of 32, and a learning rate of 5e - 5. It takes 67.48 hours to train the model using an NVIDIA TITAN XP.
Property |
Details |
Model Type |
KR - FinBert |
Training Data |
Expanded from KR - BERT - MEDIUM, including Korean Wikipedia texts, general news, legal texts, Korean Comments dataset, corporate - related economic news articles from 72 media sources, and analyst reports from 16 securities companies. Total data size: about 13.22GB; total number of lines for mlm training: 6,379,315 |
Downstream tasks
Sentimental Classification model
The downstream task performances are evaluated with 50,000 labeled data.
Model |
Accuracy |
KR - FinBert |
0.963 |
KR - BERT - MEDIUM |
0.958 |
KcBert - large |
0.955 |
KcBert - base |
0.953 |
KoBert |
0.817 |
Inference sample
Positive |
Negative |
ํ๋๋ฐ์ด์ค, 'ํด๋ฆฌํ์
' ์ฝ๋ก๋19 ์น๋ฃ ๊ฐ๋ฅ์ฑ์ 19% ๊ธ๋ฑ |
์ํ๊ดๆ ช '์ฝ๋ก๋ ๋นํ๊ธฐ' ์ธ์ ๋๋๋โฆ"CJ CGV ์ฌ 4000์ต ์์ค ๋ ์๋" |
์ด์ํํ, 3๋ถ๊ธฐ ์์
์ต 176์ตโฆ์ ๋
ๆฏ 80%โ |
C์ผํฌ์ ๋ฉ์ถ ํ์๋นํโฆ๋ํํญ๊ณต 1๋ถ๊ธฐ ์์
์ ์ 566์ต |
"GKL, 7๋
๋ง์ ๋ ์๋ฆฟ์ ๋งค์ถ์ฑ์ฅ ์์" |
'1000์ต๋ ํก๋ นยท๋ฐฐ์' ์ต์ ์ ํ์ฅ ๊ตฌ์โฆ SK๋คํธ์์ค "๊ฒฝ์ ๊ณต๋ฐฑ ๋ฐฉ์ง ์ต์ " |
์์ง์
์คํ๋์ค, ์ฝํ
์ธ ํ์ฝ์ ์ฌ์ ์ฒซ ๋งค์ถ 1000์ต์ ๋ํ |
๋ถํ ๊ณต๊ธ ์ฐจ์ง์โฆ๊ธฐ์์ฐจ ๊ด์ฃผ๊ณต์ฅ ์ ๋ฉด ๊ฐ๋ ์ค๋จ |
์ผ์ฑ์ ์, 2๋
๋ง์ ์ธ๋ ์ค๋งํธํฐ ์์ฅ ์ ์ ์จ 1์ '์์ข ํํ' |
ํ๋์ ์ฒ , ์ง๋ํด ์์
์ต 3,313์ต์ยทยทยท์ ๋
ๆฏ 67.7% ๊ฐ์ |
Citation
@misc{kr-FinBert-SC,
author = {Kim, Eunhee and Hyopil Shin},
title = {KR-FinBert: Fine-tuning KR-FinBert for Sentiment Analysis},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://huggingface.co/snunlp/KR-FinBert-SC}}
}