vi-word-segmentation Open-source Vietnamese Word Segmentation Model - Free Deployment for High-precision Vietnamese Word Segmentation

Vi Word Segmentation

Developed by NlpHUST

Vietnamese word segmentation model based on ELECTRA architecture, fine-tuned on VLSP 2013 dataset, providing high-precision Vietnamese word segmentation capability

Sequence Labeling

Transformers

Other#Vietnamese word segmentation #High-precision F1 score #ELECTRA fine-tuning

Downloads 1,756

Release Time : 10/30/2022

Model Overview

This model is specifically designed for Vietnamese text segmentation tasks, accurately identifying word boundaries in Vietnamese, suitable for preprocessing in natural language processing

Model Features

High-precision segmentation

Achieves 98.35% F1 score on VLSP 2013 evaluation set

Based on ELECTRA architecture

Uses efficient ELECTRA pre-trained model as base, with better contextual understanding

Domain adaptation

Excellent performance on government documents and socio-economic texts

Model Capabilities

Vietnamese text segmentation

Terminology recognition

Compound word splitting

Use Cases

Government document processing

Parliament document analysis

Automatic segmentation of Vietnamese parliamentary discussion documents

Accurately segments professional terms and compound words in government documents

Socio-economic research

Socio-economic report processing

Automatic processing of Vietnamese socio-economic situation reports

Correctly identifies professional vocabulary in economic fields

🚀 Vi - Word Segmentation

This model is a fine - tuned version for Vietnamese word segmentation, which helps accurately segment Vietnamese text and can be used in various NLP tasks.

🚀 Quick Start

This model is a fine - tuned version of [NlpHUST/electra - base - vn](https://huggingface.co/NlpHUST/electra - base - vn) on an vlsp 2013 Vietnamese word segmentation dataset. It achieves the following results on the evaluation set:

Loss: 0.0501
Precision: 0.9833
Recall: 0.9838
F1: 0.9835
Accuracy: 0.9911

✨ Features

Based on the fine - tuned [NlpHUST/electra - base - vn](https://huggingface.co/NlpHUST/electra - base - vn), it can effectively perform Vietnamese word segmentation.
Achieved high precision, recall, F1, and accuracy on the evaluation set.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("NlpHUST/vi - word - segmentation")
model = AutoModelForTokenClassification.from_pretrained("NlpHUST/vi - word - segmentation")

nlp = pipeline("token - classification", model=model, tokenizer=tokenizer)
example = "Phát biểu tại phiên thảo luận về tình hình kinh tế xã hội của Quốc hội sáng 28/10 , Bộ trưởng Bộ LĐ - TB&XH Đào Ngọc Dung khái quát , tại phiên khai mạc kỳ họp , lãnh đạo chính phủ đã báo cáo , đề cập tương đối rõ ràng về việc thực hiện các chính sách an sinh xã hội"

ner_results = nlp(example)
example_tok = ""
for e in ner_results:
    if "##" in e["word"]:
        example_tok = example_tok + e["word"].replace("##","")
    elif e["entity"] =="I":
        example_tok = example_tok + "_" + e["word"]
    else:
        example_tok = example_tok + " " + e["word"]
print(example_tok)

Phát_biểu tại phiên thảo_luận về tình_hình kinh_tế xã_hội của Quốc_hội sáng 28 / 10 , Bộ_trưởng Bộ LĐ - TB [UNK] XH Đào_Ngọc_Dung khái_quát , tại phiên khai_mạc kỳ họp , lãnh_đạo chính_phủ đã báo_cáo , đề_cập tương_đối rõ_ràng về việc thực_hiện các chính_sách an_sinh xã_hội

🔧 Technical Details

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 8
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
num_epochs: 5.0

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0168	1.0	4712	0.0284	0.9813	0.9825	0.9819	0.9904
0.0107	2.0	9424	0.0350	0.9789	0.9814	0.9802	0.9895
0.005	3.0	14136	0.0364	0.9826	0.9843	0.9835	0.9909
0.0033	4.0	18848	0.0434	0.9830	0.9831	0.9830	0.9908
0.0017	5.0	23560	0.0501	0.9833	0.9838	0.9835	0.9911

Framework versions

Transformers 4.22.2
Pytorch 1.12.1+cu113
Datasets 2.4.0
Tokenizers 0.12.1

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご