gaBERTオープンソースアイルランド語単語モデル - 790万の文で訓練され、下流タスクの微調整をサポート

ホーム

Bert Base Irish Cased V1

DCU-NLPによって開発

gaBERTはBERTベースのアイルランド語単言語モデルで、790万のアイルランド語の文章で学習され、アイルランド語の下流タスクの微調整に適しています。

大規模言語モデル

Transformers

#アイルランド語の事前学習 #単言語BERT #文脈依存エンコーディング

ダウンロード数 42

リリース時間 : 3/2/2022

モデル概要

エンコーダベースのTransformerモデルで、アイルランド語の下流タスクを微調整するための特徴を取得するために使用されます。

モデル特徴

アイルランド語専用モデル

アイルランド語に特化して学習された単言語BERTモデルで、多言語モデルよりも良い言語表現を提供します。

大規模な学習データ

790万のアイルランド語の文章で学習され、幅広い言語使用シーンをカバーしています。

下流タスクへの適合性

様々なアイルランド語NLPタスク（テキスト分類、固有表現抽出など）の微調整に設計されています。

モデル能力

アイルランド語テキストの理解

アイルランド語テキストの特徴抽出

アイルランド語テキストのマスク予測

使用事例

自然言語処理

アイルランド語テキスト分類

アイルランド語テキストの感情分析やトピック分類を行います。

アイルランド語固有表現抽出

アイルランド語テキスト中の人名、地名などの固有表現を識別します。

アイルランド語の多語表現識別

アイルランド語の動詞の多語表現を識別します。

論文によると、多言語BERTモデルよりも優れています。

🚀 bert-base-irish-cased-v1

gaBERT は、790万のアイルランド語の文章で学習されたBERTベースのモデルです。ハイパーパラメータや事前学習に使用したコーパスなどの詳細については、当社の論文を参照してください。

🚀 クイックスタート

このセクションでは、bert-base-irish-cased-v1モデルの概要と基本的な使い方を説明します。

✨ 主な機能

アイルランド語の下流タスクの微調整用の特徴量を取得するために使用できるエンコーダベースのTransformerモデルです。

📚 ドキュメント

モデルの説明

下流のアイルランド語タスクの微調整用の特徴量を取得するために使用されるエンコーダベースのTransformerです。

想定される用途と制限

gaBERTの事前学習に使用された一部のデータはウェブから収集されたもので、倫理的に問題のあるテキスト（偏見、憎悪、成人向けコンテンツなど）が含まれている可能性があります。したがって、gaBERTを使用する下流のタスクやアプリケーションは、倫理的な観点から十分にテストする必要があります。

学習ハイパーパラメータ

学習中に使用されたハイパーパラメータは以下の通りです。

オプティマイザ: None
学習精度: float32

フレームワークのバージョン

Transformers 4.20.1
TensorFlow 2.9.1
Datasets 2.3.2
Tokenizers 0.12.1

BibTeXエントリと引用情報

もしあなたがこのモデルを研究で使用する場合は、当社の論文を引用していただけると幸いです。

@inproceedings{barry-etal-2022-gabert,
    title = "ga{BERT} {---} an {I}rish Language Model",
    author = "Barry, James  and
      Wagner, Joachim  and
      Cassidy, Lauren  and
      Cowap, Alan  and
      Lynn, Teresa  and
      Walsh, Abigail  and
      {\'O} Meachair, M{\'\i}che{\'a}l J.  and
      Foster, Jennifer",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.511",
    pages = "4774--4788",
    abstract = "The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.",
}

プロパティ	詳細
モデルタイプ	エンコーダベースのTransformer
学習データ	790万のアイルランド語の文章

⚠️ 重要な注意事項

gaBERTの事前学習に使用された一部のデータはウェブから収集されたもので、倫理的に問題のあるテキスト（偏見、憎悪、成人向けコンテンツなど）が含まれている可能性があります。したがって、gaBERTを使用する下流のタスクやアプリケーションは、倫理的な観点から十分にテストする必要があります。