bert-restore-punctuationオープンソースモデル - 無料でデプロイ可能！純小文字のテキストに句読点と大文字を復元します！

ホーム

Bert Restore Punctuation

speeqoによって開発

bert - base - uncasedアーキテクチャに基づくモデルで、Yelpレビューデータセットに対して句読点復元の微調整を行い、純小文字のテキストの句読点と大文字を予測できます。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #英語の句読点復元 #ASR後処理 #テキストの正規化

ダウンロード数 19

リリース時間 : 3/22/2022

モデル概要

このモデルは、英語テキストの句読点と大文字を復元するために使用され、音声認識の出力テキストやその他の句読点が欠落したテキスト処理に適しています。

モデル特徴

複数の句読点復元

! ? . , - : ; 'など、複数の句読点の復元をサポートします。

大文字復元

単語の先頭文字の大文字を自動的に復元できます。

任意の長さのテキスト処理

任意の長さの英文テキストの処理をサポートします。

GPU加速

処理速度を向上させるために自動的にGPU加速を有効にします。

モデル能力

句読点復元

大文字復元

テキストの正規化

使用事例

音声認識後処理

ASR出力テキストの正規化

音声認識システムの出力する句読点のないテキストに句読点と大文字を追加します。

テキストの読みやすさと後続の処理効果を向上させます。

テキスト前処理

句読点が欠落したテキストの復元

形式変換やその他の理由で句読点が欠落したテキストを復元します。

テキストの元の構造と意味を復元します。

🚀 bert-restore-punctuation

このモデルは、Yelpレビューの句読点復元のために微調整されたbert-base-uncasedモデルです。このモデルは、小文字の平文テキストの句読点と大文字化を予測します。例えば、音声認識の出力や、句読点が失われたテキストの場合に使用できます。このモデルは、一般的な英語の句読点復元モデルとして直接使用することを想定しています。また、特定ドメインのテキストに対する句読点復元タスクでのさらなる微調整にも使用できます。モデルは、[! ? . , - : ; ' ] の句読点を復元し、単語の大文字化も復元します。

🚀 クイックスタート

モデルの使用方法

以下は、このモデルをすぐに使い始めるための手順です。

まず、パッケージをインストールします。

pip install rpunct

サンプルのPythonコードです。

from rpunct import RestorePuncts
# デフォルトの言語は 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# 以下のように出力されます:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

このモデルは、任意の長さの英語テキストに対応しており、GPUが利用可能な場合はGPUを使用します。

✨ 主な機能

英語の平文テキストの句読点と大文字化を予測する。
以下の句読点を復元する -- [! ? . , - : ; ' ]
単語の大文字化を復元する。
任意の長さの英語テキストに対応。
GPUが利用可能な場合はGPUを使用。

📦 インストール

pip install rpunct

💻 使用例

基本的な使用法

from rpunct import RestorePuncts
# デフォルトの言語は 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")

📚 ドキュメント

学習データ

このモデルを微調整するために使用した商品レビューの数は以下の通りです。

言語	テキストサンプル数
英語	560,000
最適な収束は約 3エポックで得られ、これが現在提供されているモデルです。

精度

微調整されたモデルは、45,990のホールドアウトテキストサンプルで以下の精度を達成しました。

精度	全体のF1スコア	評価サポート
91%	90%	45,990

以下は、各ラベルごとのモデルの性能の内訳です。

ラベル	精度	再現率	F1スコア	サポート
!	0.45	0.17	0.24	424
!+Upper	0.43	0.34	0.38	98
'	0.60	0.27	0.37	11
,	0.59	0.51	0.55	1522
,+Upper	0.52	0.50	0.51	239
-	0.00	0.00	0.00	18
.	0.69	0.84	0.75	2488
.+Upper	0.65	0.52	0.57	274
:	0.52	0.31	0.39	39
:+Upper	0.36	0.62	0.45	16
;	0.00	0.00	0.00	17
?	0.54	0.48	0.51	46
?+Upper	0.40	0.50	0.44	4
none	0.96	0.96	0.96	35352
Upper	0.84	0.82	0.83	5442