๐ bert-base-kor-v1
A Korean BERT-base model trained from scratch. It is trained on the Korean corpus data based on AI Hub web data (about 52M text) with NSP and MLM tasks.
๐ Quick Start
This is a BERT-base Korean model trained from scratch. It is trained on the Korean corpus data based on AI Hub web data (about 52M text) using NSP and MLM tasks. The vocabulary size is 10,022 (BertTokenizer).
โจ Features
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel, BertForMaskedLM
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('bongsoo/bert-small-kor-v1', do_lower_case=False)
model = BertForMaskedLM.from_pretrained('bongsoo/bert-small-kor-v1')
text = ['ํ๊ตญ ์๋๋ [MASK] ์ด๋ค', 'ํ๋์ค ์๋๋ [MASK]์ด๋ค', '์ถฉ๋ฌด๊ณต ์ด์์ ์ [MASK]์ ์ต๊ณ ์ ์ฅ์์๋ค']
tokenized_input = tokenizer(text, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
outputs = model(**tokenized_input)
logits = outputs.logits
mask_idx_list = []
for tokens in tokenized_input['input_ids'].tolist():
token_str = [tokenizer.convert_ids_to_tokens(s) for s in tokens]
mask_idx = token_str.index('[MASK]')
mask_idx_list.append(mask_idx)
for idx, mask_idx in enumerate(mask_idx_list):
logits_pred=torch.argmax(F.softmax(logits[idx]), dim=1)
mask_logits_idx = int(logits_pred[mask_idx])
mask_logits_token = tokenizer.convert_ids_to_tokens(mask_logits_idx)
print('\n')
print('*Input: {}'.format(text[idx]))
print('*[MASK] : {} ({})'.format(mask_logits_token, mask_logits_idx))
Advanced Usage
The above code demonstrates basic usage. For more complex scenarios, you can adjust the hyperparameters and input texts according to your needs.
๐ง Technical Details
Training
- Model: Bert-base
- Corpus: Korean corpus data based on AI Hub web data (about 52M text)
- Hyperparameters: lr = 1e-4, weight_decay = 0.0, batch_size = 256, token_max_len = 160, epoch = 8, do_lower_case = True
- Vocabulary: 10,022 (BertTokenizer)
- Training Time: 171h/1GPU (24GB/18.5GB use)
- Training Code: Refer to here
Model Config
{
"architectures": [
"BertForPreTraining"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 8,
"num_hidden_layers": 4,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.21.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 10022
}
๐ License
This project is licensed under the Apache-2.0 license.
Citing & Authors
bongsoo