Envibert开源双语模型 - 免费部署，高效处理越南语与英语内容！

首页

Envibert

由 nguyenvulebinh 开发

envibert是一个基于RoBERTa架构的双语模型，支持越南语和英语处理，专为生产环境优化。

大型语言模型

Transformers

其他#越南语-英语双语 #轻量级RoBERTa #文本特征提取

下载量 84

发布时间 : 3/2/2022

模型简介

该模型使用100GB文本数据（越南语和英语各50GB）训练，参数规模为7000万，适用于自然语言处理任务。

模型特点

双语支持

同时支持越南语和英语处理，适用于双语场景。

生产环境优化

模型架构经过专门优化，适合生产环境部署。

高效参数设计

仅7000万参数，在保持性能的同时提高运行效率。

模型能力

文本编码

特征提取

越南语处理

英语处理

使用案例

自然语言处理

命名实体识别

可用于越南语命名实体识别任务

在相关研究中被用于改进越南语命名实体识别性能

文本特征提取

提取文本的深层特征表示

🚀 适用于越南语和英语的RoBERTa（envibert）

这款RoBERTa版本使用了100GB的文本数据（其中50GB为越南语，50GB为英语）进行训练，因此被命名为 envibert。该模型架构是为生产环境定制的，仅包含7000万个参数。

🚀 快速开始

本模型的使用方法如下：

💻 使用示例

基础用法

from transformers import RobertaModel
from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
import os

cache_dir='./cache'
model_name='nguyenvulebinh/envibert'

def download_tokenizer_files():
  resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
  for item in resources:
    if not os.path.exists(os.path.join(cache_dir, item)):
      tmp_file = hf_bucket_url(model_name, filename=item)
      tmp_file = cached_path(tmp_file,cache_dir=cache_dir)
      os.rename(tmp_file, os.path.join(cache_dir, item))
      
download_tokenizer_files()
tokenizer = SourceFileLoader("envibert.tokenizer", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
model = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)

# Encode text
text_input = 'Đại học Bách Khoa Hà Nội .'
text_ids = tokenizer(text_input, return_tensors='pt').input_ids
# tensor([[   0,  705,  131, 8751, 2878,  347,  477,    5,    2]])

# Extract features
text_features = model(text_ids)
text_features['last_hidden_state'].shape
# torch.Size([1, 9, 768])
len(text_features['hidden_states'])
# 7

引用信息

如果您使用本仓库的内容来帮助产生已发表的研究成果，或者将其集成到其他软件中，请引用以下文献：

@inproceedings{nguyen20d_interspeech,
  author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},
  title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4263--4267},
  doi={10.21437/Interspeech.2020-1896}
}