ViSoBERT Open-Source Language Model - Excellent Performance for Vietnamese Social Media Text Processing

Visobert

Developed by uitnlp

ViSoBERT is the first monolingual pre-trained language model specifically designed for Vietnamese social media text, based on the XLM-R architecture, and has demonstrated excellent performance in various Vietnamese social media tasks.

Large Language Model

Transformers

Other#Vietnamese social media #Hate speech detection #Sentiment analysis

Downloads 2,260

Release Time : 10/17/2023

Model Overview

ViSoBERT is a pre-trained language model tailored for Vietnamese social media text processing, suitable for tasks such as sentiment analysis, hate speech detection, spam detection, and emotion recognition.

Model Features

Monolingual pre-training

The first monolingual MLM model specifically built for Vietnamese social media text, optimized for Vietnamese language characteristics.

Social media optimization

Pre-trained on a large-scale, high-quality, and diverse Vietnamese social media corpus, adapting to the characteristics of social media text.

Outstanding multi-task performance

Surpasses previous state-of-the-art models in tasks such as emotion recognition, hate speech detection, sentiment analysis, and spam comment detection.

Model Capabilities

Vietnamese text understanding

Sentiment analysis

Hate speech detection

Spam detection

Emotion recognition

Masked language modeling

Use Cases

Social media content moderation

Hate speech detection

Automatically identify hate speech content in Vietnamese social media

Detection accuracy surpasses previous state-of-the-art models

Spam filtering

Detect spam comments on Vietnamese social media platforms

Efficiently identifies various types of spam

Sentiment analysis

User emotion recognition

Analyze the emotional tendencies of Vietnamese social media users

Accurately identifies multiple emotional states

🚀 ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing (EMNLP 2023 - Main)

ViSoBERT is an advanced pre - trained language model tailored for Vietnamese social media text processing. It addresses the limitations of existing models in handling Vietnamese social media tasks and achieves state - of - the - art performance.

🚀 Quick Start

ViSoBERT is a cutting - edge language model designed for Vietnamese social media tasks. It's the first monolingual MLM (using XLM - R architecture) built specifically for Vietnamese social media texts. It outperforms previous models on four downstream Vietnamese social media tasks.

✨ Features

Monolingual Focus: ViSoBERT is the first monolingual MLM built specifically for Vietnamese social media texts, leveraging the XLM - R architecture.
State - of - the - art Performance: It surpasses previous monolingual, multilingual, and multilingual social media approaches, achieving new state - of - the - art results on four downstream Vietnamese social media tasks.

📦 Installation

Install transformers and SentencePiece packages:

pip install transformers
pip install SentencePiece

💻 Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained('uitnlp/visobert')
tokenizer = AutoTokenizer.from_pretrained('uitnlp/visobert')

encoding = tokenizer('hào quang rực rỡ', return_tensors='pt')

with torch.no_grad():
  output = model(**encoding)

📚 Documentation

The general architecture and experimental results of ViSoBERT can be found in our paper.

@inproceedings{nguyen-etal-2023-visobert,
    title = "{V}i{S}o{BERT}: A Pre-Trained Language Model for {V}ietnamese Social Media Text Processing",
    author = "Nguyen, Nam  and
      Phan, Thang  and
      Nguyen, Duc-Vu  and
      Nguyen, Kiet",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.315",
    pages = "5191--5207",
    abstract = "English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes. Disclaimer: This paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene.",
}

The pretraining dataset of our paper is available at: Pretraining dataset

⚠️ Important Note

The paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene.

💡 Usage Tip

Please CITE our paper when ViSoBERT is used to help produce published results or is incorporated into other software.

Property	Details
Pipeline Tag	fill - mask
Language	Vietnamese
Tags	Vietnamese, Social Media, Vietnamese Pre - trained Model, Sentiment Analysis, Hate Speech Detection, Spam Detection, Emotion Recognition

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご