🚀 HoogBERTa
本倉庫包含針對命名實體識別(NER)任務進行微調的泰語預訓練語言表示模型(HoogBERTa_base)。
🚀 快速開始
前提條件
由於我們使用 subword - nmt BPE 編碼,在將輸入送入 HoogBERTa 之前,需要使用 BEST 標準對輸入進行預分詞。
pip install attacut
初始化模型
要從模型中心初始化模型,請使用以下命令:
from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch
tokenizer = RobertaTokenizerFast.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
model = RobertaForTokenClassification.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
進行命名實體識別標註
使用以下命令進行命名實體識別標註:
from transformers import pipeline
nlp = pipeline('token - classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
print(nlp(sentence))
批量處理
from transformers import pipeline
nlp = pipeline('token - classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
sentences = sentX.split(" ")
all_sent = []
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
inputList.append(sentence)
print(nlp(inputList))
📚 詳細文檔
Huggingface 模型
HoogBERTaEncoder
- [HoogBERTa](https://huggingface.co/lst - nectec/HoogBERTa):用於
特徵提取
和掩碼語言建模
HoogBERTaMuliTaskTagger
- [HoogBERTa - NER - lst20](https://huggingface.co/lst - nectec/HoogBERTa - NER - lst20):基於 LST20 數據集的
命名實體識別(NER)
- [HoogBERTa - POS - lst20](https://huggingface.co/lst - nectec/HoogBERTa - POS - lst20):基於 LST20 數據集的
詞性標註(POS)
- [HoogBERTa - SENTENCE - lst20](https://huggingface.co/lst - nectec/HoogBERTa - SENTENCE - lst20):基於 LST20 數據集的
子句邊界分類
引用
請按以下格式引用:
@inproceedings{porkaew2021hoogberta,
title = {HoogBERTa: Multi - task Sequence Labeling using Thai Pretrained Language Representation},
author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI - NLP 2021)},
year = {2021},
address={Online}
}
下載全文 PDF
查看 Github 上的代碼