bert-restore-punctuationオープンソースモデル - 無料でデプロイし、文章の句読点と大文字小文字を正確に復元する

ホーム

Bert Restore Punctuation

felflareによって開発

bert-base-uncasedアーキテクチャに基づいて微調整された句読点復元モデルで、Yelpレビューデータセット用に設計されており、純粋な小文字のテキストの句読点と大文字小文字を予測できます。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #英文の句読点復元 #ASR後処理 #テキスト強化

ダウンロード数 1,890

リリース時間 : 3/2/2022

モデル概要

このモデルは、英文テキストの句読点と大文字小文字を復元するために使用され、音声認識出力やその他の句読点が欠落したテキスト処理に適しています。復元をサポートする句読点には、! ? . , - : ; ' および単語の先頭文字の大文字が含まれます。

モデル特徴

複数の句読点復元

句点、コンマ、疑問符、感嘆符などの一般的な句読点を含む複数の句読点の復元をサポートします。

大文字小文字復元

単語の先頭文字の大文字を自動的に復元し、テキストの読みやすさを向上させます。

長いテキスト処理

任意の長さの英文テキスト処理をサポートし、長編の内容の処理に適しています。

GPU加速

GPU加速を自動的に有効にし、処理速度を向上させます。

モデル能力

句読点復元

大文字小文字復元

テキスト処理

長いテキストサポート

使用事例

音声認識後処理

ASR出力テキストの句読点復元

音声認識システムから出力された句読点のないテキストの句読点と大文字小文字を復元します。

テキストの読みやすさと専門性を向上させます。

テキスト前処理

句読点が欠落したテキストの復元

転送や保存中に句読点が欠落したテキストを処理します。

元のテキスト形式を復元し、後続の分析を容易にします。

🚀 bert-restore-punctuation

このモデルは、Yelpレビューを用いて句読点復元のために微調整されたbert-base-uncasedモデルです。このモデルは、小文字の平文テキストの句読点と大文字化を予測します。典型的な使用例としては、自動音声認識（ASR）の出力や、句読点が失われたテキストの処理が挙げられます。このモデルは、一般的な英語の句読点復元モデルとして直接使用することを想定しています。また、特定ドメインのテキストに対する句読点復元タスクでさらに微調整するためにも利用できます。モデルは以下の句読点を復元します -- [! ? . , - : ; ' ] また、単語の大文字化も復元します。

🚀 クイックスタート

モデルの使用方法

以下は、このモデルをすぐに使い始めるための手順です。

まず、パッケージをインストールします。

pip install rpunct

サンプルのPythonコードです。

from rpunct import RestorePuncts
# デフォルトの言語は 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# 以下のように出力されます:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

このモデルは、任意の長さの英語テキストに対応しており、GPUが利用可能な場合はGPUを使用します。

📦 インストール

モデルを使用するには、以下のパッケージをインストールします。

pip install rpunct

💻 使用例

基本的な使用法

from rpunct import RestorePuncts
# デフォルトの言語は 'english'
rpunct = RestorePuncts()
rpunct.punctuate("your input text here")

📚 ドキュメント

学習データ

モデルを微調整するために使用した商品レビューの数は以下の通りです。

言語	テキストサンプル数
英語	560,000

約 3エポック で最適な収束が得られ、このモデルはその状態でダウンロード可能です。

精度

微調整されたモデルは、45,990のホールドアウトテキストサンプルで以下の精度を達成しました。

精度	全体のF1スコア	評価サポート数
91%	90%	45,990

以下は、各ラベルごとのモデルの性能の内訳です。

ラベル	適合率	再現率	F1スコア	サポート数
!	0.45	0.17	0.24	424
!+Upper	0.43	0.34	0.38	98
'	0.60	0.27	0.37	11
,	0.59	0.51	0.55	1522
,+Upper	0.52	0.50	0.51	239
-	0.00	0.00	0.00	18
.	0.69	0.84	0.75	2488
.+Upper	0.65	0.52	0.57	274
:	0.52	0.31	0.39	39
:+Upper	0.36	0.62	0.45	16
;	0.00	0.00	0.00	17
?	0.54	0.48	0.51	46
?+Upper	0.40	0.50	0.44	4
none	0.96	0.96	0.96	35352
Upper	0.84	0.82	0.83	5442