Envibert開源雙語模型 - 免費部署，高效處理越南語與英語內容！

首頁

Envibert

由nguyenvulebinh開發

envibert是一個基於RoBERTa架構的雙語模型，支持越南語和英語處理，專為生產環境優化。

大型語言模型

Transformers

其他#越南語-英語雙語 #輕量級RoBERTa #文本特徵提取

下載量 84

發布時間 : 3/2/2022

模型概述

該模型使用100GB文本數據（越南語和英語各50GB）訓練，參數規模為7000萬，適用於自然語言處理任務。

模型特點

雙語支持

同時支持越南語和英語處理，適用於雙語場景。

生產環境優化

模型架構經過專門優化，適合生產環境部署。

高效參數設計

僅7000萬參數，在保持性能的同時提高運行效率。

模型能力

文本編碼

特徵提取

越南語處理

英語處理

使用案例

自然語言處理

命名實體識別

可用於越南語命名實體識別任務

在相關研究中被用於改進越南語命名實體識別性能

文本特徵提取

提取文本的深層特徵表示

🚀 適用於越南語和英語的RoBERTa（envibert）

這款RoBERTa版本使用了100GB的文本數據（其中50GB為越南語，50GB為英語）進行訓練，因此被命名為 envibert。該模型架構是為生產環境定製的，僅包含7000萬個參數。

🚀 快速開始

本模型的使用方法如下：

💻 使用示例

基礎用法

from transformers import RobertaModel
from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
import os

cache_dir='./cache'
model_name='nguyenvulebinh/envibert'

def download_tokenizer_files():
  resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
  for item in resources:
    if not os.path.exists(os.path.join(cache_dir, item)):
      tmp_file = hf_bucket_url(model_name, filename=item)
      tmp_file = cached_path(tmp_file,cache_dir=cache_dir)
      os.rename(tmp_file, os.path.join(cache_dir, item))
      
download_tokenizer_files()
tokenizer = SourceFileLoader("envibert.tokenizer", os.path.join(cache_dir,'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
model = RobertaModel.from_pretrained(model_name,cache_dir=cache_dir)

# Encode text
text_input = 'Đại học Bách Khoa Hà Nội .'
text_ids = tokenizer(text_input, return_tensors='pt').input_ids
# tensor([[   0,  705,  131, 8751, 2878,  347,  477,    5,    2]])

# Extract features
text_features = model(text_ids)
text_features['last_hidden_state'].shape
# torch.Size([1, 9, 768])
len(text_features['hidden_states'])
# 7

引用信息

如果您使用本倉庫的內容來幫助產生已發表的研究成果，或者將其集成到其他軟件中，請引用以下文獻：

@inproceedings{nguyen20d_interspeech,
  author={Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},
  title={{Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4263--4267},
  doi={10.21437/Interspeech.2020-1896}
}