phobert - base - v2开源越南语预训练模型 - 免费部署助力多种NLP任务

首页

Phobert Base V2

由 vinai 开发

PhoBERT是越南语最先进的预训练语言模型，基于RoBERTa优化，在多项越南语NLP任务中表现优异。

大型语言模型

Transformers

其他#越南语预训练 #RoBERTa优化 #文本分词依赖

下载量 54.89k

发布时间 : 4/24/2023

模型简介

PhoBERT是针对越南语的大规模单语预训练语言模型，基于RoBERTa架构优化，适用于各种越南语自然语言处理任务。

模型特点

越南语优化

首个针对越南语公开的大规模单语预训练语言模型

高性能

在四项越南语NLP任务中超越之前的单语和多语方法

两种规模

提供base(135M)和large(370M)两种参数规模的模型选择

专业分词

使用VnCoreNLP的RDRSegmenter进行越南语文本预处理

模型能力

越南语文本理解

越南语词性标注

越南语句法分析

越南语命名实体识别

越南语自然语言推理

使用案例

学术研究

越南语语言学分析

用于越南语语法和句法结构研究

提供准确的词性标注和依存分析

商业应用

越南语文本处理

用于越南语客服系统、内容分析等商业场景

提高越南语文本处理的准确性和效率

🚀 PhoBERT：越南语预训练语言模型

PhoBERT 预训练模型是目前最先进的越南语语言模型（Pho，即“越南河粉”，是越南的一种流行美食）。它具有以下特点：

“基础版”和“大型版”这两个 PhoBERT 版本是首批为越南语预训练的公开大规模单语言模型。PhoBERT 的预训练方法基于 RoBERTa，该方法优化了 BERT 的预训练过程，以获得更强大的性能。
PhoBERT 在词性标注、依存句法分析、命名实体识别和自然语言推理这四个下游越南语自然语言处理任务中，超越了之前的单语言和多语言方法，取得了新的最优性能。

PhoBERT 的总体架构和实验结果可在我们的论文中找到：

@inproceedings{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year      = {2020},
pages     = {1037--1042}
}

请在使用 PhoBERT 帮助产生已发表的研究结果或将其集成到其他软件中时引用我们的论文。

🚀 快速开始

本部分将介绍如何使用 PhoBERT 进行越南语自然语言处理任务。

✨ 主要特性

首批为越南语预训练的公开大规模单语言模型。
基于 RoBERTa 优化预训练过程，性能更强大。
在多个下游越南语自然语言处理任务中取得新的最优性能。

📦 安装指南

使用 `transformers` 库安装

使用 pip 安装 transformers：pip install transformers，或从源代码安装 transformers。注意，我们已将 PhoBERT 的慢速分词器合并到 transformers 的主分支中。如此拉取请求所述，合并 PhoBERT 快速分词器的过程正在讨论中。如果用户想使用快速分词器，可以按以下方式安装 transformers：

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .

使用 pip 安装 tokenizers：pip3 install tokenizers

使用 `fairseq` 库安装

请查看此处的详细信息。

安装 `VnCoreNLP` 进行分词

pip install py_vncorenlp

💻 使用示例

使用 `transformers` 库的基础用法

import torch
from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = phobert(input_ids)  # Models outputs are now tuples

## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")

使用 `VnCoreNLP` 进行分词的示例

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local machine folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')

text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

output = rdrsegmenter.word_segment(text)

print(output)
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

📚 详细文档

预训练模型

模型名称	参数数量	架构	最大长度	预训练数据
`vinai/phobert-base`	1.35 亿	基础版	256	20GB 的维基百科和新闻文本
`vinai/phobert-large`	3.7 亿	大型版	256	20GB 的维基百科和新闻文本
`vinai/phobert-base-v2`	1.35 亿	基础版	256	20GB 的维基百科和新闻文本 + 120GB 的 OSCAR - 2301 文本

🔧 技术细节

PhoBERT 基于 RoBERTa 优化预训练过程，以获得更强大的性能。
在预训练数据处理方面，使用了 RDRSegmenter 进行词性标注、依存句法分析、命名实体识别和自然语言推理等任务。

📄 许可证

Copyright (c) 2023 VinAI Research

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

⚠️ 重要提示

如果输入文本是未分词的原始文本，则必须先使用分词器对文本进行分词，然后再将其输入到 PhoBERT 中。由于 PhoBERT 在预训练数据处理中使用了 RDRSegmenter 进行词性标注、依存句法分析、命名实体识别和自然语言推理等任务，因此建议在基于 PhoBERT 的下游应用中，对原始输入文本也使用相同的分词器。