PhoBERT - Base - V2: Open - source Vietnamese Pre - trained Model for Free Deployment to Boost Various NLP Tasks

Phobert Base V2

Developed by vinai

PhoBERT is the most advanced pretrained language model for Vietnamese, optimized based on RoBERTa, and excels in various Vietnamese NLP tasks.

Large Language Model

Transformers

Other#Vietnamese Pretraining #RoBERTa Optimization #Text Segmentation Dependency

Downloads 54.89k

Release Time : 4/24/2023

Model Overview

PhoBERT is a large-scale monolingual pretrained language model for Vietnamese, optimized based on the RoBERTa architecture, suitable for various Vietnamese natural language processing tasks.

Model Features

Vietnamese Optimization

The first publicly available large-scale monolingual pretrained language model specifically for Vietnamese

High Performance

Outperforms previous monolingual and multilingual approaches in four Vietnamese NLP tasks

Two Sizes

Offers model choices in two parameter sizes: base (135M) and large (370M)

Professional Segmentation

Uses VnCoreNLP's RDRSegmenter for Vietnamese text preprocessing

Model Capabilities

Vietnamese text understanding

Vietnamese part-of-speech tagging

Vietnamese syntactic analysis

Vietnamese named entity recognition

Vietnamese natural language inference

Use Cases

Academic Research

Vietnamese Linguistic Analysis

Used for research on Vietnamese grammar and syntactic structures

Provides accurate part-of-speech tagging and dependency parsing

Commercial Applications

Vietnamese Text Processing

Used in commercial scenarios such as Vietnamese customer service systems and content analysis

Improves the accuracy and efficiency of Vietnamese text processing

🚀 PhoBERT: Pre-trained language models for Vietnamese

PhoBERT offers state-of-the-art pre-trained language models tailored for Vietnamese, achieving new benchmarks on multiple NLP tasks.

🚀 Quick Start

PhoBERT provides two pre - trained models, "base" and "large", which are the first public large - scale monolingual language models for Vietnamese. It is based on the RoBERTa approach, optimizing the BERT pre - training procedure. PhoBERT outperforms previous methods on four downstream Vietnamese NLP tasks.

✨ Features

State - of - the - art: PhoBERT sets new performance records on four downstream Vietnamese NLP tasks: Part - of - speech tagging, Dependency parsing, Named - entity recognition, and Natural language inference.
Monolingual focus: The "base" and "large" versions are the first public large - scale monolingual pre - trained models for Vietnamese.
Optimized pre - training: Based on RoBERTa, it optimizes the BERT pre - training process for better performance.

📦 Installation

Using with `transformers`

Install transformers with pip: pip install transformers, or install transformers from source. Note that a slow tokenizer for PhoBERT has been merged into the main transformers branch. The process of merging a fast tokenizer is under discussion, as mentioned in this pull request. If you want to use the fast tokenizer, install transformers as follows:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .

Install tokenizers with pip: pip3 install tokenizers

For word segmentation (if input texts are raw)

pip install py_vncorenlp

💻 Usage Examples

Using PhoBERT with `transformers`

Basic Usage

import torch
from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = phobert(input_ids)  # Models outputs are now tuples

## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")

Word segmentation example (using `py_vncorenlp`)

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local machine folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')

text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

output = rdrsegmenter.word_segment(text)

print(output)
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

📚 Documentation

Pre - trained models

Property	Details
Model Type	`vinai/phobert-base`, `vinai/phobert-large`, `vinai/phobert-base-v2`
#params	135M (for `vinai/phobert-base` and `vinai/phobert-base-v2`), 370M (for `vinai/phobert-large`)
Arch.	base (for `vinai/phobert-base` and `vinai/phobert-base-v2`), large (for `vinai/phobert-large`)
Max length	256
Training Data	20GB of Wikipedia and News texts (for `vinai/phobert-base` and `vinai/phobert-large`); 20GB of Wikipedia and News texts + 120GB of texts from OSCAR - 2301 (for `vinai/phobert-base-v2`)

Using PhoBERT with `fairseq`

Please see details at HERE!

🔧 Technical Details

The general architecture and experimental results of PhoBERT can be found in our paper:

@inproceedings{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year      = {2020},
pages     = {1037--1042}
}

Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software.

📄 License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

⚠️ Important Note

In case the input texts are raw, i.e. without word segmentation, a word segmenter must be applied to produce word - segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre - process the pre - training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT - based downstream applications w.r.t. the input raw texts.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご