Nue ASRオープンソース日本語音声認識モデル - 2つのモデルを統合し、正確かつ迅速に音声を認識する

ホーム

Nue Asr

rinnaによって開発

Nue ASRはエンドツーエンドの日本語音声認識モデルで、事前学習された音声と言語モデルを統合し、認識精度が高く高速です。

音声認識

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #日本語音声認識 #エンドツーエンドASR #事前学習モデル統合

ダウンロード数 722

リリース時間 : 12/7/2023

モデル概要

このモデルはエンドツーエンドの日本語音声認識を提供し、最新のASRモデルと同等の認識精度を実現します。GPUを使用することで、リアルタイムよりも高速な音声認識が可能です。

モデル特徴

エンドツーエンド音声認識

事前学習された音声と言語モデルを統合し、完全なエンドツーエンドソリューションを提供します。

高性能

最新のASRモデルと同等の認識精度を実現し、推論速度はリアルタイムよりも高速です。

事前学習モデル統合

japanese-hubert-baseとjapanese-gpt-neox-3.6bの事前学習重みを使用して初期化されます。

大規模トレーニングデータ

約19,000時間の日本語音声コーパスReazonSpeech v1でトレーニングされました。

モデル能力

日本語音声認識

エンドツーエンド音声テキスト変換

リアルタイム音声処理

使用事例

音声文字起こし

会議議事録

日本語会議の録音をリアルタイムでテキストに変換

高精度な会議議事録テキスト

字幕生成

日本語動画コンテンツに自動的に字幕を生成

同期された字幕ファイル

音声アシスタント

日本語音声コマンド認識

日本語音声コマンドを認識・理解

正確なコマンド認識

🚀 `rinna/nue-asr`

Nue ASRは、事前学習された音声モデルと言語モデルを統合した、新しいエンドツーエンドの音声認識モデルです。このモデルは、最新のASRモデルと匹敵する認識精度で、エンドツーエンドの日本語音声認識を提供します。GPUを使用することで、リアルタイムよりも高速に音声を認識することができます。

🚀 クイックスタート

このモデルを使用するには、まず推論用のコードをインストールします。

pip install git+https://github.com/rinnakk/nue-asr.git

コマンドラインインターフェースとPythonインターフェースの両方が利用可能です。

✨ 主な機能

事前学習された音声モデルと言語モデルを統合した、新しいエンドツーエンドの音声認識モデル。
最新のASRモデルと匹敵する認識精度で、エンドツーエンドの日本語音声認識を提供。
GPUを使用することで、リアルタイムよりも高速に音声を認識することができる。

📦 インストール

推論コードのインストール

pip install git+https://github.com/rinnakk/nue-asr.git

DeepSpeed-Inferenceのインストール

DeepSpeed-Inferenceを使用する場合は、DeepSpeedをインストールする必要があります。

pip install deepspeed

💻 使用例

コマンドラインでの使用

nue-asr audio1.wav

複数の音声ファイルを指定することもできます。

nue-asr audio1.wav audio2.flac audio3.mp3

DeepSpeed-Inferenceを使用する場合は、以下のように指定します。

nue-asr --use-deepspeed audio1.wav

Pythonでの使用

import nue_asr

model = nue_asr.load_model("rinna/nue-asr")
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)

DeepSpeed-Inferenceを使用して推論速度を高速化することもできます。

import nue_asr

model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)

📚 ドキュメント

概要

[論文] [GitHub]

Nue ASRの名前は、日本の伝説的な妖怪の一つである「鵺（ぬえ）」に由来しています。このモデルは、HuBERTオーディオエンコーダ、ブリッジネットワーク、およびGPT-NeoXデコーダの3つの主要なコンポーネントで構成されています。HuBERTとGPT-NeoXの重みは、それぞれ事前学習された重みで初期化されています。

トレーニング

このモデルは、約19,000時間の日本語音声コーパスReazonSpeech v1でトレーニングされました。トレーニング前に、16秒を超える音声サンプルは除外されています。

ReazonSpeech

貢献者

リリース日

2023年12月7日

🔧 技術詳細

モデルアーキテクチャ

このモデルは、HuBERTオーディオエンコーダ、ブリッジネットワーク、およびGPT-NeoXデコーダの3つの主要なコンポーネントで構成されています。HuBERTとGPT-NeoXの重みは、それぞれ事前学習された重みで初期化されています。

トークン化

このモデルは、japanese-gpt-neox-3.6bと同じSentencePieceベースのトークナイザを使用しています。

📄 ライセンス

The Apache 2.0 license

引用方法

@inproceedings{hono2024integrating,
    title = {Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
    author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
    booktitle = {Findings of the Association for Computational Linguistics ACL 2024},
    month = {8},
    year = {2024},
    pages = {13289--13305},
    url = {https://aclanthology.org/2024.findings-acl.787}
}

@misc{rinna-nue-asr,
    title = {rinna/nue-asr},
    author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
    url = {https://huggingface.co/rinna/nue-asr}
}

参考文献

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

@article{hsu2021hubert,
    title = {{HuBERT}: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
    author = {Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
    journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    month = {10},
    year = {2021},
    volume = {29},
    pages = {3451--3460},
    doi = {10.1109/TASLP.2021.3122291}
}

@software{andoniangpt2021gpt,
    title = {{GPT}-{N}eo{X}: Large Scale Autoregressive Language Modeling in {P}y{T}orch},
    author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
    month = {8},
    year = {2021},
    version = {0.0.1},
    doi = {10.5281/zenodo.5879544},
    url = {https://www.github.com/eleutherai/gpt-neox}
}

@inproceedings{aminabadi2022deepspeed,
    title = {{DeepSpeed-Inference}: enabling efficient inference of transformer models at unprecedented scale},
    author = {Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others},
    booktitle = {SC22: International Conference for High Performance Computing, Networking, Storage and Analysis},
    year = {2022},
    pages = {1--15},
    doi = {10.1109/SC41404.2022.00051}
}