Open-source Korean Speech Recognition Model: stt_kr_conformer_ctc_medium - Optimized for Streaming, Super Practical in Customer Service Field

Stt Kr Conformer Ctc Medium

Developed by SungBeom

Korean automatic speech recognition model based on Conformer architecture, optimized for stream processing with excellent performance in specific domains like customer service voice

Speech Recognition KoreanOpen Source License:Apache-2.0 #Korean speech recognition #Stream processing optimization #Specialized for customer service

Downloads 176

Release Time : 6/4/2023

Model Overview

This model is a Korean automatic speech recognition model based on the Conformer-CTC architecture, fine-tuned for the AI Hub dataset. Compared to attention-based models, it shows less performance degradation during stream processing and operates faster, making it particularly suitable for real-time speech recognition applications.

Model Features

Stream Processing Optimization

Compared to attention-based models like Whisper, it shows less performance degradation during stream processing (about 20%) and faster processing speed

Efficient Inference

Real-time factor (RTF) is 0.05 on V100 GPU and 0.35 on CPU (7 cores), suitable for real-time applications

Strong Domain Adaptability

In specific domains like customer service voice, when combined with KenLM, the word error rate can be significantly reduced from 13.45 to 5.27

Model Capabilities

Korean speech recognition

Real-time stream speech processing

Optimized for specific domain speech recognition

Use Cases

Customer Service Domain

Customer Service Voice Transcription

Used for real-time voice transcription of customer service calls

Word error rate reduced from 13.45 to 5.27 when combined with KenLM

In-car Systems

In-car Voice Command Recognition

Used to recognize in-car conversations and voice commands

🚀 Conformer-ctc-medium-ko

This model is a fine - tuned version of RIVA Conformer ASR Korean on the AI hub dataset. Conformer - based models, unlike attention - based models like Whisper, have the advantages of maintaining high performance during streaming and fast speed.

🚀 Quick Start

This Conformer - based model is fine - tuned from RIVA Conformer ASR Korean on the AI hub dataset. Unlike attention - based models such as Whisper, Conformer - based models do not experience a significant performance drop during streaming and are fast.

It was found that the Real - Time Factor (RTF) is about 0.05 on a V100 GPU and 0.35 on a 7 - core CPU. In the streaming test with an audio chunk size of 2 seconds, there is about a 20% performance degradation compared to using the entire audio, but the performance is still acceptable.

Additionally, in domains such as customer service voice (non - open domain), adding KenLM significantly improved the Word Error Rate (WER) from 13.45 to 5.27. However, in other domains, adding KenLM did not lead to a significant performance improvement.

The code for streaming and the code including the denoise model can be found on the following GitHub repository: https://github.com/SUNGBEOMCHOI/Korean-Streaming-ASR

✨ Features

Streaming - friendly: Maintains relatively stable performance during streaming compared to attention - based models.
Fast speed: Achieves low RTF values on both GPU and CPU.
Domain - specific performance improvement: Adding KenLM can significantly improve WER in specific domains.

📚 Documentation

Training results

Training Loss	Epoch	Wer
9.09	1.0	11.51

Dataset

Dataset Name	Number of Data Samples (train/test)
Customer Service Voice	2067668/21092
Korean Voice	620000/3000
Korean Conversation Voice	2483570/142399
Free Conversation Voice (General Men and Women)	1886882/263371
Welfare Sector Call Center Consultation Data	1096704/206470
In - vehicle Conversation Data	2624132/332787
Command Voice (Elderly Men and Women)	137467/237469
Total	10916423 (13946 hours)/1206588 (1474 hours)

🔧 Technical Details

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 05
train_batch_size: 16
eval_batch_size: 16
num_train_epoch: 1
sample_rate: 16000
max_duration: 20.0

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご