yacis-electra-small-japanese-cyberbullyingオープンソースモデル - 日本語のサイバーバイ凌ぎ発言を高精度に検出

ホーム

Yacis Electra Small Japanese Cyberbullying

ptaszynskiによって開発

ELECTRA小型モデルに基づく日本語ネットいじめ自動検出微調整モデル、YACISブログコーパスで事前学習され、有害BBSコメントとツイッターいじめデータセットで微調整

テキスト分類

Transformers

日本語#日本語ネットいじめ検出 #ELECTRA小型モデル #ソーシャルメディア有害コンテンツ識別

ダウンロード数 22

リリース時間 : 3/2/2022

モデル概要

このモデルは日本語のネットいじめコンテンツを検出するために特別に設計され、ソーシャルメディアやフォーラムのコンテンツ審査に適しています

モデル特徴

日本語専用モデル

日本語コンテンツに最適化されたネットいじめ検出モデル

複数データソース学習

ブログ、BBS、ツイッターなど複数のデータソースを組み合わせて学習

ELECTRAアーキテクチャ

効率的なELECTRA事前学習アーキテクチャを採用、従来のBERTより効率的

モデル能力

日本語テキスト分類

ネットいじめコンテンツ識別

有害コンテンツ検出

使用事例

コンテンツ審査

ソーシャルメディア監視

ツイッターなどのソーシャルメディア上のネットいじめコンテンツを自動検出

プラットフォームが有害コンテンツを識別・フィルタリングするのに役立つ

フォーラム管理

BBSやフォーラムの自動コンテンツ審査システムに使用

手動審査の作業量を削減

🚀 yacis-electra-small-cyberbullying

このモデルは、自動サイバーいじめ検出のためにファインチューニングされた日本語用のELECTRA Smallモデルです。元の基礎モデルは、56億語のYACISブログコーパスで事前学習され、その後、「Harmful BBS Japanese comments dataset」と「Twitter Japanese cyberbullying dataset」の2つのデータセットを統合して作成された均衡データセットでファインチューニングされました。

🚀 クイックスタート

このモデルは、自動サイバーいじめ検出に特化してファインチューニングされています。日本語のテキストに対して高い性能を発揮します。

✨ 主な機能

自動サイバーいじめ検出に特化した日本語モデル
大規模なブログコーパスと2つのデータセットを用いて学習

📦 データセット

Property	Details
学習データセット	YACIS corpus、Harmful BBS Japanese comments dataset、Twitter Japanese cyberbullying dataset

📚 ドキュメント

モデルアーキテクチャ

元のモデルはELECTRA Smallモデルの設定で事前学習されており、以下のリンクから確認できます。 https://huggingface.co/ptaszynski/yacis-electra-small-japanese

ライセンス

ファインチューニングされたモデルとすべての添付ファイルは、CC BY - SA 4.0、つまりCreative Commons Attribution - ShareAlike 4.0 International Licenseの下でライセンスされています。

引用

このモデルを引用する場合は、以下の引用を使用してください。

@inproceedings{shibata2022yacis-electra,
  title={日本語大規模ブログコーパスYACISに基づいたELECTRA事前学習済み言語モデルの作成及び性能評価}, 
%  title={Development and performance evaluation of ELECTRA pretrained language model based on YACIS large-scale Japanese blog corpus [in Japanese]}, %% for English citations
  author={柴田 祥伍 and プタシンスキ ミハウ and エロネン ユーソ and ノヴァコフスキ カロル and 桝井 文人}, 
%  author={Shibata, Shogo and Ptaszynski, Michal and Eronen, Juuso and Nowakowski, Karol and Masui, Fumito},  %% for English citations
  booktitle={言語処理学会第28回年次大会(NLP2022) (予定)}, 
%  booktitle={Proceedings of The 28th Annual Meeting of The Association for Natural Language Processing (NLP2022)},  %% for English citations
  pages={1--4},
  year={2022}
}

ファインチューニングに使用された2つのデータセットは、以下の参考文献を使用して引用してください。

Harmful BBS Japanese comments dataset:

@book{ptaszynski2018automatic,
  title={Automatic Cyberbullying Detection: Emerging Research and Opportunities: Emerging Research and Opportunities},
  author={Ptaszynski, Michal E and Masui, Fumito},
  year={2018},
  publisher={IGI Global}
}

@article{松葉達明2009学校非公式サイトにおける有害情報検出,
  title={学校非公式サイトにおける有害情報検出},
  author={松葉達明 and 里見尚宏 and 桝井文人 and 河合敦夫 and 井須尚紀},
  journal={電子情報通信学会技術研究報告. NLC, 言語理解とコミュニケーション},
  volume={109},
  number={142},
  pages={93--98},
  year={2009},
  publisher={一般社団法人電子情報通信学会}
}

Twitter Japanese cyberbullying dataset:

TBA

事前学習にはYACISコーパスが使用されており、以下の参考文献の少なくとも1つを使用して引用してください。

@inproceedings{ptaszynski2012yacis,
  title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},
  author={Ptaszynski, Michal and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji and Momouchi, Yoshio},
  booktitle={Proceedings of the AISB/IACAP world congress},
  pages={40--49},
  year={2012},
  howpublished = "\url{https://github.com/ptaszynski/yacis-corpus}"
}

@article{ptaszynski2014automatically,
  title={Automatically annotating a five-billion-word corpus of Japanese blogs for sentiment and affect analysis},
  author={Ptaszynski, Michal and Rzepka, Rafal and Araki, Kenji and Momouchi, Yoshio},
  journal={Computer Speech \& Language},
  volume={28},
  number={1},
  pages={38--55},
  year={2014},
  publisher={Elsevier},
  howpublished = "\url{https://github.com/ptaszynski/yacis-corpus}"
}