roberta_with_kornli Open-source Model - Free for Korean Zero-shot Classification Tasks

Roberta With Kornli

Developed by pongjin

This model is fine-tuned from klue/roberta-base using mnli and xnli datasets from kor_nli, specifically designed for Korean zero-shot classification tasks.

Text Classification

Transformers

KoreanOpen Source License:Apache-2.0 #Korean zero-shot classification #NLI fine-tuned model #Financial text analysis

Downloads 52

Release Time : 6/22/2023

Model Overview

By fine-tuning the KLUE-RoBERTa base model, this model enables zero-shot classification of Korean texts without task-specific training.

Model Features

Korean Zero-Shot Classification

Designed specifically for Korean text with zero-shot classification capability, requiring no task-specific training.

Based on KLUE-RoBERTa

Optimized RoBERTa model based on the Korean Language Understanding Evaluation (KLUE) benchmark for better Korean language comprehension.

Custom Parameter Handling

Addresses compatibility issues with zero-shot classification pipelines in transformers 4.7.0 through custom parameter processors.

Model Capabilities

Korean text classification

Zero-shot learning

Natural language inference

Use Cases

Financial News Classification

Stock Market News Classification

Automatically categorizes Korean financial news into predefined categories such as stocks, forex, etc.

In the example, the model successfully classified the news '배당락 D-1 코스피, 2330선 상승세...외인·기관 사자' as '주식' (stocks) with an accuracy of 50.5%.

Content Moderation

Korean Content Classification

Automatically classifies user-generated Korean content for moderation or recommendation systems.

🚀 Zero-shot Classification Model for Korean NLI

This model is designed for zero-shot classification in the Korean language. It fine-tunes the klue/roberta-base model on the mnli and xnli datasets of kor_nli, offering high accuracy in text classification tasks.

🚀 Quick Start

Prerequisites

This model has been referred to the following link: https://github.com/Huffon/klue-transformers-tutorial.git

Model Details

Property	Details
Model Type	Fine-tuned klue/roberta-base on kor_nli
Training Data	kor_nli (mnli, xnli)
License	Apache-2.0
Metrics	Accuracy
Pipeline Tag	Zero-shot Classification

Training Parameters

train_loss	val_loss	acc	epoch	batch	lr
0.326	0.538	0.811	3	32	2e-5

Code Modification for Zero-shot Pipeline

For models that do not use token_type_ids, such as RoBERTa, the zero-shot pipeline cannot be directly applied (as of transformers==4.7.0). Therefore, you need to add the following code for conversion. This code is also a modification of the code from the above GitHub repository.

class ArgumentHandler(ABC):
    """
    Base interface for handling arguments for each :class:`~transformers.pipelines.Pipeline`.
    """

    @abstractmethod
    def __call__(self, *args, **kwargs):
        raise NotImplementedError()


class CustomZeroShotClassificationArgumentHandler(ArgumentHandler):
    """
    Handles arguments for zero-shot for text classification by turning each possible label into an NLI
    premise/hypothesis pair.
    """

    def _parse_labels(self, labels):
        if isinstance(labels, str):
            labels = [label.strip() for label in labels.split(",")]
        return labels

    def __call__(self, sequences, labels, hypothesis_template):
        if len(labels) == 0 or len(sequences) == 0:
            raise ValueError("You must include at least one label and at least one sequence.")
        if hypothesis_template.format(labels[0]) == hypothesis_template:
            raise ValueError(
                (
                    'The provided hypothesis_template "{}" was not able to be formatted with the target labels. '
                    "Make sure the passed template includes formatting syntax such as {{}} where the label should go."
                ).format(hypothesis_template)
            )

        if isinstance(sequences, str):
            sequences = [sequences]
        labels = self._parse_labels(labels)

        sequence_pairs = []
        for label in labels:
            # Modified part: To prevent the automatic attachment of `token_type_ids` when two sentences are input as a pair,
            # connect the two sentences in advance based on the `sep_token`.
            sequence_pairs.append(f"{sequences} {tokenizer.sep_token} {hypothesis_template.format(label)}")

        return sequence_pairs, sequences

Applying the Modified Code

You need to apply the above code when defining the classifier.

classifier = pipeline(
    "zero-shot-classification",
    args_parser=CustomZeroShotClassificationArgumentHandler(),
    model="pongjin/roberta_with_kornli"
)

💻 Usage Examples

Basic Usage

sequence = "배당락 D-1 코스피, 2330선 상승세...외인·기관 사자"	
candidate_labels =["외환",'환율', "경제", "금융", "부동산","주식"]

classifier(
    sequence,
    candidate_labels,
    hypothesis_template='이는 {}에 관한 것이다.',
)

>>{'sequence': '배당락 D-1 코스피, 2330선 상승세...외인·기관 사자',
 'labels': ['주식', '금융', '경제', '외환', '환율', '부동산'],
 'scores': [0.5052872896194458,
  0.17972524464130402,
  0.13852974772453308,
  0.09460823982954025,
  0.042949128895998,
  0.038900360465049744]}

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご