HoogBERTa-NER-lst20 Open-source Model - Free Deployment, Efficiently Complete Thai Named Entity Recognition

Home

Hoogberta NER Lst20

Developed by lst-nectec

A pre-trained language model fine-tuned for Thai named entity recognition tasks, based on the LST20 dataset

Sequence Labeling

Transformers

Other#Thai NER #Multi-task sequence labeling #Pre-trained language model

Downloads 95

Release Time : 4/5/2023

Model Overview

HoogBERTa is a pre-trained language model developed for Thai natural language processing tasks. This version is specifically fine-tuned for named entity recognition (NER) tasks on the LST20 dataset.

Model Features

Thai language optimization

A pre-trained language model specifically optimized for Thai language characteristics

Multi-task support

Supports various tasks including named entity recognition, part-of-speech tagging, and clause boundary classification

Pre-tokenization processing

Uses BEST standard pre-tokenization processing to ensure input quality

Model Capabilities

Thai text processing

Named entity recognition

Part-of-speech tagging

Clause boundary classification

Use Cases

Text analysis

Thai text entity extraction

Identify and classify named entities in Thai text

Can accurately recognize various entity types defined in the LST20 dataset

Language processing

Thai text preprocessing

Provides preprocessing support for downstream NLP tasks

Offers part-of-speech tagging and clause boundary identification features

🚀 HoogBERTa

This repository contains the Thai pretrained language representation (HoogBERTa_base) fine - tuned for the Named - Entity Recognition (NER) Task. It provides a powerful tool for identifying named entities in Thai text, enhancing the accuracy and efficiency of information extraction in the Thai language.

🚀 Quick Start

Prerequisite

Since we use subword - nmt BPE encoding, the input needs to be pre - tokenized using BEST standard before being input into HoogBERTa. You can install the necessary library with the following command:

pip install attacut

Initializing the Model

To initialize the model from the hub, use the following commands:

from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch

tokenizer = RobertaTokenizerFast.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
model = RobertaForTokenClassification.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")

Performing NER Tagging

To do NER Tagging, use the following commands:

from transformers import pipeline

nlp = pipeline('token - classification', model = model, tokenizer = tokenizer, aggregation_strategy = "none")

sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

sentence = " _ ".join(all_sent)

print(nlp(sentence))

Batch Processing

For batch processing, use the following code:

from transformers import pipeline

nlp = pipeline('token - classification', model = model, tokenizer = tokenizer, aggregation_strategy = "none")

sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
  sentences = sentX.split(" ")
  all_sent = []
  for sent in sentences:
      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

  sentence = " _ ".join(all_sent)
  inputList.append(sentence)

print(nlp(inputList))

📚 Documentation

Model Information

Property	Details
Datasets	lst20
Language	th
Library Name	transformers

Huggingface Models

HoogBERTaEncoder

[HoogBERTa](https://huggingface.co/lst - nectec/HoogBERTa): Feature Extraction and Mask Language Modeling

HoogBERTaMuliTaskTagger:

[HoogBERTa - NER - lst20](https://huggingface.co/lst - nectec/HoogBERTa - NER - lst20): Named - entity recognition (NER) based on LST20
[HoogBERTa - POS - lst20](https://huggingface.co/lst - nectec/HoogBERTa - POS - lst20): Part - of - speech tagging (POS) based on LST20
[HoogBERTa - SENTENCE - lst20](https://huggingface.co/lst - nectec/HoogBERTa - SENTENCE - lst20): Clause Boundary Classification based on LST20

📄 License

Citation

Please cite as:

@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi - task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI - NLP 2021)},
  year = {2021},
  address={Online}
}

Download full - text PDF
Check out the code on Github

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご