🚀 HoogBERTa
This repository contains the Thai pretrained language representation (HoogBERTa_base) fine - tuned for the Named - Entity Recognition (NER) Task. It provides a powerful tool for identifying named entities in Thai text, enhancing the accuracy and efficiency of information extraction in the Thai language.
🚀 Quick Start
Prerequisite
Since we use subword - nmt BPE encoding, the input needs to be pre - tokenized using BEST standard before being input into HoogBERTa. You can install the necessary library with the following command:
pip install attacut
Initializing the Model
To initialize the model from the hub, use the following commands:
from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch
tokenizer = RobertaTokenizerFast.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
model = RobertaForTokenClassification.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
Performing NER Tagging
To do NER Tagging, use the following commands:
from transformers import pipeline
nlp = pipeline('token - classification', model = model, tokenizer = tokenizer, aggregation_strategy = "none")
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
print(nlp(sentence))
Batch Processing
For batch processing, use the following code:
from transformers import pipeline
nlp = pipeline('token - classification', model = model, tokenizer = tokenizer, aggregation_strategy = "none")
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
sentences = sentX.split(" ")
all_sent = []
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
inputList.append(sentence)
print(nlp(inputList))
📚 Documentation
Model Information
Property |
Details |
Datasets |
lst20 |
Language |
th |
Library Name |
transformers |
Huggingface Models
HoogBERTaEncoder
- [HoogBERTa](https://huggingface.co/lst - nectec/HoogBERTa):
Feature Extraction
and Mask Language Modeling
HoogBERTaMuliTaskTagger
:
- [HoogBERTa - NER - lst20](https://huggingface.co/lst - nectec/HoogBERTa - NER - lst20):
Named - entity recognition (NER)
based on LST20
- [HoogBERTa - POS - lst20](https://huggingface.co/lst - nectec/HoogBERTa - POS - lst20):
Part - of - speech tagging (POS)
based on LST20
- [HoogBERTa - SENTENCE - lst20](https://huggingface.co/lst - nectec/HoogBERTa - SENTENCE - lst20):
Clause Boundary Classification
based on LST20
📄 License
Citation
Please cite as:
@inproceedings{porkaew2021hoogberta,
title = {HoogBERTa: Multi - task Sequence Labeling using Thai Pretrained Language Representation},
author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI - NLP 2021)},
year = {2021},
address={Online}
}
Download full - text PDF
Check out the code on Github