vietnamese - accent - marker - xlm - roberta開源模型 - 高精度自動為越南語文本添加聲調符號

首頁

Vietnamese Accent Marker Xlm Roberta

由peterhung開發

該模型用於為未標註聲調的越南語文本自動添加聲調符號，基於XLM-Roberta Large微調，準確率達97%。

序列標註

Transformers

其他開源協議:Apache-2.0 #越南語聲調標註 #高準確率97%#XLM-Roberta微調

下載量 188

發布時間 : 3/2/2022

模型概述

這是一個專門為越南語文本設計的聲調標註模型，能夠自動為未標註或部分標註聲調的越南語詞彙添加正確的聲調符號（變音符號）。

模型特點

高準確率

相比傳統HMM方法（91%準確率），該模型達到97%的準確率。

基於Transformer架構

採用先進的XLM-Roberta Large模型進行微調，具有強大的上下文理解能力。

詞元級分類

將聲調標註問題建模為詞元分類任務，精確處理每個詞彙單元。

模型能力

越南語文本聲調標註

變音符號自動添加

部分標註文本處理

使用案例

文本處理

越南語文本規範化

為未標註聲調的越南語文本自動添加正確聲調符號

將'Nhin nhung mua thu di'轉換為'Nhìn những mùa thu đi'

越南語學習輔助

幫助學習者正確理解和使用越南語聲調

🚀 用於插入越南語重音符號的Transformer模型

本模型用於為沒有重音符號（或部分單詞有重音、部分沒有）的越南語文本插入重音符號（變音符號）。例如，輸入 “Nhin nhung mua thu di”，目標輸出為 “Nhìn những mùa thu đi”。

🚀 快速開始

本模型將插入越南語重音符號的問題建模為一個標記分類問題，為每個輸入標記分配一個“標籤”，將其轉換為帶重音的標記。

✨ 主要特性

高精度：相比HMM版本（91%），本模型的準確率更高，達到97%。
基於Transformer架構：從XLM - Roberta Large微調而來，具有更強的語言理解能力。

📦 安裝指南

使用本模型前，請先安裝 transformers、torch 和 numpy 包。

💻 使用示例

基礎用法

以下是使用該模型的詳細步驟及代碼示例：

步驟1：加載模型

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

def load_trained_transformer_model():
    model_path = "peterhung/vietnamese-accent-marker-xlm-roberta"
    tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
    model = AutoModelForTokenClassification.from_pretrained(model_path)
    return model, tokenizer

model, tokenizer = load_trained_transformer_model()

步驟2：將輸入文本輸入模型

# only needed if it's run on GPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# set to eval mode
model.eval()

def insert_accents(text, model, tokenizer):
    our_tokens = text.strip().split()

    # the tokenizer may further split our tokens
    inputs = tokenizer(our_tokens,
                        is_split_into_words=True,
                        truncation=True,
                        padding=True,
                        return_tensors="pt"
                        )
    input_ids = inputs['input_ids']
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    tokens = tokens[1:-1]

    with torch.no_grad():
        inputs.to(device)
        outputs = model(**inputs)

    predictions = outputs["logits"].cpu().numpy()
    predictions = np.argmax(predictions, axis=2)

    # exclude output at index 0 and the last index, which correspond to '<s>' and '</s>'
    predictions = predictions[0][1:-1]

    assert len(tokens) == len(predictions)

    return tokens, predictions 


text = "Nhin nhung mua thu di, em nghe sau len trong nang."
tokens, predictions = insert_accents(text, model, tokenizer)

步驟3：獲取帶重音的單詞

def _load_tags_set(fpath):
    labels = []
    with open(fpath, 'r') as f:
        for line in f:
            line = line.strip()
            if line:
                labels.append(line)

    return labels
    
label_list = _load_tags_set("./selected_tags_names.txt")
assert len(label_list) == 528, f"Expect {len(label_list)} tags"

print(tokens)
print(list(f"{pred} ({label_list[pred]})" for pred in predictions))

['▁Nhi', 'n', '▁nhu', 'ng', '▁mua', '▁thu', '▁di', ',', '▁em', '▁nghe', '▁sau', '▁len', '▁trong', '▁nang', '.']
['217 (i-ì)', '217 (i-ì)', '388 (u-ữ)', '388 (u-ữ)', '407 (ua-ùa)', '378 (u-u)', '120 (di-đi)', '0 (-)', '185 (e-e)', '185 (e-e)', '41 (au-âu)', '188 (e-ê)', '302 (o-o)', '14 (a-ắ)', '0 (-)']

TOKENIZER_WORD_PREFIX = "▁"
def merge_tokens_and_preds(tokens, predictions): 
    merged_tokens_preds = []
    i = 0
    while i < len(tokens):
        tok = tokens[i]
        label_indexes = set([predictions[i]])
        if tok.startswith(TOKENIZER_WORD_PREFIX): # start a new word
            tok_no_prefix = tok[len(TOKENIZER_WORD_PREFIX):]
            cur_word_toks = [tok_no_prefix]
            # check if subsequent toks are part of this word
            j = i + 1
            while j < len(tokens):
                if not tokens[j].startswith(TOKENIZER_WORD_PREFIX):
                    cur_word_toks.append(tokens[j])
                    label_indexes.add(predictions[j])
                    j += 1
                else:
                    break
            cur_word = ''.join(cur_word_toks)
            merged_tokens_preds.append((cur_word, label_indexes))
            i = j
        else:
            merged_tokens_preds.append((tok, label_indexes))
            i += 1

    return merged_tokens_preds


merged_tokens_preds = merge_tokens_and_preds(tokens, predictions)
print(merged_tokens_preds)

[('Nhin', {217}), ('nhung', {388}), ('mua', {407}), ('thu', {378}), ('di,', {120, 0}), ('em', {185}), ('nghe', {185}), ('sau', {41}), ('len', {188}), ('trong', {302}), ('nang.', {0, 14})]

def get_accented_words(merged_tokens_preds, label_list):
    accented_words = []
    for word_raw, label_indexes in merged_tokens_preds:
        # use the first label that changes word_raw
        for label_index in label_indexes:
            tag_name = label_list[int(label_index)]
            raw, vowel = tag_name.split("-")
            if raw and raw in word_raw:
                word_accented = word_raw.replace(raw, vowel)
                break
        else:
            word_accented = word_raw

        accented_words.append(word_accented)

    return accented_words


accented_words = get_accented_words(merged_tokens_preds, label_list)
print(accented_words)

['Nhìn', 'những', 'mùa', 'thu', 'đi,', 'em', 'nghe', 'sâu', 'lên', 'trong', 'nắng.']

高級用法

暫無高級用法示例。

📚 詳細文檔

本模型從XLM - Roberta Large微調而來，更多訓練過程的詳細信息，請參考此博客文章。

🔧 技術細節

本問題被建模為一個標記分類問題，對於每個輸入標記，目標是分配一個“標籤”，將其轉換為帶重音的標記。

📄 許可證

本模型採用Apache - 2.0許可證。

⚠️ 重要提示

本模型最多接受512個標記，這是從基礎預訓練的XLM - Roberta模型繼承而來的限制。
與HMM版本（91%）相比，本模型準確率更高（97%），但可能運行時間更長。更多信息請參考此處。

💡 使用建議

你可以使用本頁面右側的推理API（由HF自動提供）查看分配給每個單詞的標籤（索引）。

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫