ViSoBERTオープンソース言語モデル - ベトナムのソーシャルメディアテキスト処理に適し、卓越した性能を発揮

ホーム

Visobert

uitnlpによって開発

ViSoBERTは、ベトナムのソーシャルメディアテキスト向けに構築された初の単一言語事前学習言語モデルで、XLM-Rアーキテクチャに基づいており、複数のベトナムソーシャルメディアタスクで優れたパフォーマンスを発揮します。

大規模言語モデル

Transformers

その他#ベトナム語ソーシャルメディア #ヘイトスピーチ検出 #感情分析

ダウンロード数 2,260

リリース時間 : 10/17/2023

モデル概要

ViSoBERTは、ベトナムのソーシャルメディアテキスト処理に特化して設計された事前学習言語モデルで、感情分析、ヘイトスピーチ検出、スパム検出、感情認識などのタスクに適しています。

モデル特徴

単一言語事前学習

ベトナムのソーシャルメディアテキスト向けに構築された初の単一言語MLMモデルで、ベトナム語の特性に最適化されています。

ソーシャルメディア最適化

大規模で高品質な多様なベトナムソーシャルメディアコーパスで事前学習されており、ソーシャルメディアテキストの特性に適応しています。

マルチタスクでの優れたパフォーマンス

感情認識、ヘイトスピーチ検出、感情分析、スパムコメント検出などのタスクで、従来の最良モデルを上回る性能を発揮します。

モデル能力

ベトナム語テキスト理解

感情分析

ヘイトスピーチ検出

スパム検出

感情認識

マスク埋め

使用事例

ソーシャルメディアコンテンツモデレーション

ヘイトスピーチ検出

ベトナムのソーシャルメディアにおけるヘイトスピーチコンテンツを自動的に識別

従来の最良モデルを上回る検出精度

スパムフィルタリング

ベトナムのソーシャルメディアプラットフォーム上のスパムコメントを検出

様々な種類のスパムを効率的に識別

感情分析

ユーザー感情認識

ベトナムのソーシャルメディアユーザーの感情傾向を分析

複数の感情状態を正確に識別

🚀 ViSoBERT: ベトナム語ソーシャルメディアテキスト処理のための事前学習言語モデル (EMNLP 2023 - メイン)

ViSoBERTは、ベトナム語のソーシャルメディアタスクに最適化された最先端の言語モデルです。このモデルは、ベトナム語のソーシャルメディアテキストに特化したモノリンガルのMLM（XLM - Rアーキテクチャ）で、既存のモノリンガル、マルチリンガル、およびマルチリンガルのソーシャルメディアアプローチを上回り、4つの下流のベトナム語ソーシャルメディアタスクで新たな最先端の性能を達成します。

🚀 クイックスタート

ViSoBERTを使い始めるには、まず必要なパッケージをインストールし、簡単なコードでモデルを使用することができます。以下に詳細を説明します。

✨ 主な機能

ベトナム語のソーシャルメディアテキストに特化したモノリンガルのMLM（XLM - Rアーキテクチャ）を構築。
既存のモノリンガル、マルチリンガル、およびマルチリンガルのソーシャルメディアアプローチを上回り、4つの下流のベトナム語ソーシャルメディアタスクで新たな最先端の性能を達成。

📦 インストール

transformers と SentencePiece パッケージをインストールします。

pip install transformers
pip install SentencePiece

💻 使用例

基本的な使用法

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained('uitnlp/visobert')
tokenizer = AutoTokenizer.from_pretrained('uitnlp/visobert')

encoding = tokenizer('hào quang rực rỡ', return_tensors='pt')

with torch.no_grad():
  output = model(**encoding)

📚 ドキュメント

モデル概要

ViSoBERTの一般的なアーキテクチャと実験結果は、論文で確認できます。

@inproceedings{nguyen-etal-2023-visobert,
    title = "{V}i{S}o{BERT}: A Pre-Trained Language Model for {V}ietnamese Social Media Text Processing",
    author = "Nguyen, Nam  and
      Phan, Thang  and
      Nguyen, Duc-Vu  and
      Nguyen, Kiet",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.315",
    pages = "5191--5207",
    abstract = "English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes. Disclaimer: This paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene.",
}