bert-portuguese-nerオープンソースモデル - 档案文献のポルトガル语命名实体识别に无料で使用可能

ホーム

Bert Portuguese Ner

lfccによって開発

BERTベースのポルトガル語の命名エンティティ認識モデルで、アーカイブ文献に特化して微調整されています。

シーケンスラベリング

Transformers

オープンソースライセンス:MIT #ポルトガル語NER #アーカイブ文献解析 #高精度のエンティティ認識

ダウンロード数 2,229

リリース時間 : 3/2/2022

モデル概要

このモデルはneuralmind/bert-base-portuguese-casedを微調整したバージョンで、ポルトガル語のアーカイブ文献の命名エンティティ認識タスクに特化しており、日付、職業、人物、場所、組織などのエンティティクラスを認識できます。

モデル特徴

高い正解率

評価セットで97%の正解率と93.12%のF1値を達成しました。

専門分野最適化

ポルトガル語のアーカイブ文献に特化して微調整されており、歴史文書の処理に適しています。

多クラス認識

日付、職業、人物、場所、組織などの複数のエンティティタイプを認識できます。

モデル能力

ポルトガル語テキスト処理

命名エンティティ認識

アーカイブ文献分析

使用事例

アーカイブのデジタル化

歴史アーカイブの自動アノテーション

歴史アーカイブ内の人物、場所、組織の情報を自動で認識します。

アーカイブ検索の効率を向上させ、インテリジェントな閲覧ツールの開発をサポートします。

学術研究

歴史文献分析

ポルトガルの歴史文献から重要なエンティティ情報を抽出します。

歴史家の文献研究と分析を支援します。

🚀 bert-portuguese-ner

このモデルは neuralmind/bert-base-portuguese-cased をベースに微調整されたバージョンで、ポルトガル語のアーカイブ文書における固有表現抽出（NER）問題を解決するために使用でき、評価セットで優れた性能を発揮します。

🚀 クイックスタート

このモデルは評価セットで以下の結果を達成しました：

損失値：0.1140
適合率：0.9147
再現率：0.9483
F1値：0.9312
正解率：0.9700

✨ 主な機能

事前学習されたポルトガル語のBERTモデルを微調整し、ポルトガル語のアーカイブ文書の固有表現抽出タスクに適しています。
注釈付けされるラベルには、日付、職業、人物、場所、組織が含まれ、さまざまな情報抽出ニーズを満たすことができます。

📚 ドキュメント

モデルの説明

このモデルは、ポルトガル語のアーカイブ文書のタグ付け分類タスク（NER）で微調整されています。注釈付けされるラベルは、日付、職業、人物、場所、組織です。

データセット

すべての学習と評価データは、以下のリンクから取得できます：http://ner.epl.di.uminho.pt/

学習ハイパーパラメータ

学習過程では以下のハイパーパラメータが使用されました：

学習率：2e-05
学習バッチサイズ：16
評価バッチサイズ：16
乱数シード：42
オプティマイザ：Adam（β1=0.9，β2=0.999，ε=1e-08）
学習率スケジューラのタイプ：線形
学習エポック数：4

学習結果

学習損失	エポック数	ステップ数	検証損失	適合率	再現率	F1値	正解率
記録なし	1.0	192	0.1438	0.8917	0.9392	0.9148	0.9633
0.2454	2.0	384	0.1222	0.8985	0.9417	0.9196	0.9671
0.0526	3.0	576	0.1098	0.9150	0.9481	0.9312	0.9698
0.0372	4.0	768	0.1140	0.9147	0.9483	0.9312	0.9700

フレームワークのバージョン

Transformers 4.10.0.dev0
Pytorch 1.9.0+cu111
Datasets 1.10.2
Tokenizers 0.10.3

引用情報

@Article{make4010003,
AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos},
TITLE = {NER in Archival Finding Aids: Extended},
JOURNAL = {Machine Learning and Knowledge Extraction},
VOLUME = {4},
YEAR = {2022},
NUMBER = {1},
PAGES = {42--65},
URL = {https://www.mdpi.com/2504-4990/4/1/3},
ISSN = {2504-4990},
ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country&rsquo;s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.},
DOI = {10.3390/make4010003}
}