kobart-base-v2 Open-source Korean Language Model - Free Support for Korean Text Feature Extraction and Generation

Kobart Base V2

Developed by gogamza

KoBART is a Korean encoder-decoder language model based on the BART architecture, trained with text infilling noise functions, supporting Korean text feature extraction and generation tasks.

Large Language Model

Transformers

KoreanOpen Source License:MIT #Korean Text Generation #Autoencoder Language Model #Text Infilling Denoising

Downloads 5,937

Release Time : 3/2/2022

Model Overview

Korean BART model, trained in an autoencoder format, suitable for Korean text feature extraction and generation tasks.

Model Features

Korean Optimization

Specially trained for Korean, including Korean Wikipedia and various other Korean corpora

Emoji Support

High-frequency emojis are specially added to the vocabulary to enhance emoji recognition capabilities

Efficient Tokenization

Trained with a character-level BPE tokenizer for high tokenization efficiency

Model Capabilities

Korean text feature extraction

Korean text generation

Text infilling

Text summarization

Use Cases

Text Processing

Sentiment Analysis

Used for sentiment classification of Korean text

Achieved 90.24% accuracy on the NSMC dataset

Text Similarity Calculation

Calculates semantic similarity between Korean sentences

Spearman coefficient of 81.66 on the KorSTS dataset

Question Pairing

Determines whether two Korean questions are semantically identical

Accuracy reached 94.34%

🚀 Model Card for kobart-base-v2

KoBART-base-v2 is a Korean encoder-decoder language model based on the BART architecture, trained on over 40GB of Korean text.

🚀 Quick Start

Use the code below to get started with the model.

Click to expand

from transformers import PreTrainedTokenizerFast, BartModel

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
model = BartModel.from_pretrained('gogamza/kobart-base-v2')

✨ Features

Feature Extraction: This model can be used for the task of feature extraction.
Korean Language Support: Specifically designed for the Korean language, trained on a large amount of Korean text.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import PreTrainedTokenizerFast, BartModel

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
model = BartModel.from_pretrained('gogamza/kobart-base-v2')

📚 Documentation

Model Details

BART (Bidirectional and Auto-Regressive Transformers) is trained in the form of an autoencoder by adding noise to parts of the input text and then restoring it to the original text. Korean BART (hereinafter KoBART) is a Korean encoder-decoder language model trained on over 40GB of Korean text using the Text Infilling noise function used in the paper. We are releasing the derived KoBART-base.

Developed by: More information needed
Shared by [Optional]: Heewon(Haven) Jeon
Model type: Feature Extraction
Language(s) (NLP): Korean
License: MIT
Parent Model: BART
Resources for more information:
- GitHub Repo
- Model Demo Space

Uses

Direct Use

This model can be used for the task of Feature Extraction.

Out-of-Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

Data	# of Sentences
Korean Wiki	5M
Other corpus	0.27B

In addition to the Korean Wikipedia, various data such as news, books, Modu Corpus v1.0 (conversations, news, ...), and Blue House Petitions were used for model training.

The vocab size is 30,000, and emoticons and emojis commonly used in conversations, such as 😀, 😁, 😆, 😅, 🤣, .. , :-), :), -), (-:..., were added to improve the recognition ability of these tokens.

Training Procedure

Tokenizer

It was trained using the Character BPE tokenizer from the tokenizers package.

Speeds, Sizes, Times

Model	# of params	Type	# of layers	# of heads	ffn_dim	hidden_dims
`KoBART-base`	124M	Encoder	6	16	3072	768
		Decoder	6	16	3072	768

Evaluation

Testing Data, Factors & Metrics

More information needed for testing data, factors, and metrics.

Results

NSMC

acc. : 0.901

The model authors also note in the GitHub Repo:

	NSMC(acc)	KorSTS(spearman)	Question Pair(acc)
KoBART-base	90.24	81.66	94.34

Model Examination

More information needed.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: More information needed
Hours used: More information needed
Cloud Provider: More information needed
Compute Region: More information needed
Carbon Emitted: More information needed

Technical Specifications [optional]

More information needed for model architecture, objective, compute infrastructure (hardware and software).

Citation

More information needed for BibTeX citation.

Glossary [optional]

More information needed.

More Information [optional]

More information needed.

Model Card Authors [optional]

Heewon(Haven) Jeon in collaboration with Ezi Ozoani and the Hugging Face team

Model Card Contact

The model authors note in the GitHub Repo: Please post issues related to KoBART here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご