BERT-Small-Kor-v1 Open-Source Korean Foundation Model: Empowering Korean Applications Precisely with Massive Corpus Data

Bert Small Kor V1

Developed by bongsoo

Korean foundational model based on the BERT architecture, trained using Korean text data from the AI Hub web corpus (approximately 52 million texts)

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Korean Masked Prediction #Small-scale BERT #Korean NSP-MLM

Downloads 41

Release Time : 12/28/2022

Model Overview

This is a Korean foundational model based on the BERT architecture, primarily used for masked language modeling tasks, supporting both Korean and English.

Model Features

Korean Optimization

Trained using Korean text data from the AI Hub web corpus (approximately 52 million texts), specifically optimized for Korean

BERT Architecture

Based on the BERT-base architecture, with strong language understanding capabilities

Multi-task Training

Simultaneously trained with NSP (Next Sentence Prediction) and MLM (Masked Language Modeling)

Model Capabilities

Masked Language Modeling

Korean Text Understanding

English Text Understanding

Use Cases

Text Completion

Capital Name Prediction

Predict the missing capital name in a sentence

Example input: 'The capital of South Korea is [MASK]', predicted result: 'Seoul'

Historical Figure Identification

Identify missing historical figure information in a sentence

Example input: 'Admiral Yi Sun-sin was the most outstanding general of the [MASK] era', predicted result: '' (no valid prediction provided)

🚀 bert-base-kor-v1

A Korean BERT-base model trained from scratch. It is trained on the Korean corpus data based on AI Hub web data (about 52M text) with NSP and MLM tasks.

🚀 Quick Start

This is a BERT-base Korean model trained from scratch. It is trained on the Korean corpus data based on AI Hub web data (about 52M text) using NSP and MLM tasks. The vocabulary size is 10,022 (BertTokenizer).

✨ Features

Model Type: BERT-base Korean model trained from scratch.
Training Data: Korean corpus data based on AI Hub web data (about 52M text).
Vocabulary: 10,022 tokens (BertTokenizer).

Property	Details
Model Type	A BERT-base Korean model trained from scratch
Training Data	Korean corpus data based on AI Hub web data (about 52M text)
Vocabulary	10,022 tokens (BertTokenizer)

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel, BertForMaskedLM
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('bongsoo/bert-small-kor-v1', do_lower_case=False)
model = BertForMaskedLM.from_pretrained('bongsoo/bert-small-kor-v1')
text = ['한국 수도는 [MASK] 이다', '프랑스 수도는 [MASK]이다', '충무공 이순신은 [MASK]에 최고의 장수였다']
tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
outputs = model(**tokenized_input)
logits = outputs.logits
mask_idx_list = []
for tokens in tokenized_input['input_ids'].tolist():
    token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens]
    
    # **위 token_str리스트에서 [MASK] 인덱스를 구함
    # => **해당 [MASK] 안덱스 값 mask_idx 에서는 아래 출력하는데 사용됨
    mask_idx = token_str.index('[MASK]')
    mask_idx_list.append(mask_idx)
    
for idx, mask_idx in enumerate(mask_idx_list):
    
    logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1)
    mask_logits_idx = int(logits_pred[mask_idx])
    # [MASK]에 해당하는 token 구함
    mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx)
    # 결과 출력 
    print('\n')
    print('*Input: {}'.format(text[idx]))
    print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx))

Advanced Usage

The above code demonstrates basic usage. For more complex scenarios, you can adjust the hyperparameters and input texts according to your needs.

🔧 Technical Details

Training

Model: Bert-base
Corpus: Korean corpus data based on AI Hub web data (about 52M text)
Hyperparameters: lr = 1e-4, weight_decay = 0.0, batch_size = 256, token_max_len = 160, epoch = 8, do_lower_case = True
Vocabulary: 10,022 (BertTokenizer)
Training Time: 171h/1GPU (24GB/18.5GB use)
Training Code: Refer to here

Model Config

{
  "architectures": [
    "BertForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.21.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 10022
}

📄 License

This project is licensed under the Apache-2.0 license.

Citing & Authors

bongsoo

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご