vibert4news-base-cased Open-source Model - Free Deployment for Vietnamese News Sentiment Analysis

Vibert4news Base Cased

Developed by NlpHUST

This model is a BERT model trained on over 20GB of Vietnamese news datasets, suitable for tasks such as sentiment analysis, and performs excellently on the AIViVN comment dataset.

Large Language Model

Transformers

Other#Vietnamese BERT #News text pre-training #High-precision word segmentation

Downloads 368

Release Time : 3/2/2022

Model Overview

This BERT model is specifically designed for Vietnamese, trained on a large amount of news data, and suitable for natural language processing tasks such as sentiment analysis, word segmentation, and named entity recognition.

Model Features

Large-scale news data training

Trained on over 20GB of Vietnamese news datasets, with strong language understanding capabilities

Multi-task applicability

Suitable for various natural language processing tasks such as sentiment analysis, word segmentation, and named entity recognition

High-performance results

Achieved a score of 0.90268 on the AIViVN comment dataset, surpassing the champion score

Model Capabilities

Vietnamese text understanding

Sentiment analysis

Word segmentation

Named entity recognition

Use Cases

Sentiment analysis

Comment sentiment analysis

Analyze the sentiment tendency of Vietnamese comments

Achieved a score of 0.90268 on the AIViVN dataset

Text processing

Vietnamese word segmentation

Perform word segmentation on Vietnamese text

Achieved an F1 score of 0.984 on the VLSP 2013 dataset

Named entity recognition

Identify named entities in Vietnamese text

Achieved an F1 score of 0.786 on the VLSP 2018 dataset

🚀 BERT for Vietnamese Trained on Over 20 GB News Dataset

This project applies BERT to Vietnamese language tasks, leveraging a large news dataset. It's used for sentiment analysis and integrated into a Vietnamese NLP toolkit, achieving high scores on relevant benchmarks.

🚀 Quick Start

This BERT model for Vietnamese is trained on a news dataset exceeding 20 GB. It's applied to the task of sentiment analysis using AIViVN's comments dataset.

The model achieved a score of 0.90268 on the public leaderboard (the winner's score was 0.90087). Bert4news is used in the Vietnames toolkit (for segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP).

We use the word sentencepiece, basic BERT tokenization, and the same configuration as BERT base with lowercase = False.

You can download the trained model from the following links:

tensorflow.
pytorch.

Usage with huggingface/transformers

import torch
from transformers import BertTokenizer,BertModel
tokenizer= BertTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
bert_model = BertModel.from_pretrained("NlpHUST/vibert4news-base-cased")

line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
input_id = tokenizer.encode(line,add_special_tokens = True)
att_mask = [int(token_id > 0) for token_id in input_id]
input_ids = torch.tensor([input_id])
att_masks = torch.tensor([att_mask])
with torch.no_grad():
    features = bert_model(input_ids,att_masks)

print(features)

✨ Features

Vietnamese Toolkit with BERT

ViNLP is a Vietnamese annotation system. It uses the pre-trained Bert4news to fine-tune for NLP problems in Vietnamese components such as word segmentation and Named Entity Recognition (NER), achieving high accuracy.

📦 Installation

git clone https://github.com/bino282/ViNLP.git
cd ViNLP
python setup.py develop build

💻 Usage Examples

Basic Usage

Test Segmentation

The model achieved an F1 score of 0.984 on the VLSP 2013 dataset.

Model	F1
BertVnTokenizer	98.40
DongDu	96.90
JvnSegmenter-Maxent	97.00
JvnSegmenter-CRFs	97.06
VnTokenizer	97.33
UETSegmenter	97.87
VnTokenizer	97.33
VnCoreNLP (i.e. RDRsegmenter)	97.90

from ViNLP import BertVnTokenizer
tokenizer = BertVnTokenizer()
sentences = tokenizer.split(["Tổng thống Donald Trump ký sắc lệnh cấm mọi giao dịch của Mỹ với ByteDance và Tecent - chủ sở hữu của 2 ứng dụng phổ biến TikTok và WeChat sau 45 ngày nữa."])
print(sentences[0])

Tổng_thống Donald_Trump ký sắc_lệnh cấm mọi giao_dịch của Mỹ với ByteDance và Tecent - chủ_sở_hữu của 2 ứng_dụng phổ_biến TikTok và WeChat sau 45 ngày nữa .

Test Named Entity Recognition

The model achieved an F1 score of 0.786 on the VLSP 2018 dataset for all named entities including nested entities.

Model	F1
BertVnNer	78.60
VNER Attentive Neural Network	77.52
vietner CRF (ngrams + word shapes + cluster + w2v)	76.63
ZA-NER BiLSTM	74.70

from ViNLP import BertVnNer
bert_ner_model = BertVnNer()
sentence = "Theo SCMP, báo cáo của CSIS với tên gọi Định hình Tương lai Chính sách của Mỹ với Trung Quốc cũng cho thấy sự ủng hộ tương đối rộng rãi của các chuyên gia về việc cấm Huawei, tập đoàn viễn thông khổng lồ của Trung Quốc"
entities = bert_ner_model.annotate([sentence])
print(entities)

[{'ORGANIZATION': ['SCMP', 'CSIS', 'Huawei'], 'LOCATION': ['Mỹ', 'Trung Quốc']}]

Advanced Usage

Run Training with Base Config

python train_pytorch.py \
  --model_path=bert4news.pytorch \
  --max_len=200 \
  --batch_size=16 \
  --epochs=6 \
  --lr=2e-5

📄 Contact Information

For personal communication related to this project, please contact Nha Nguyen Van (nha282@gmail.com).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご