deberta-v3-large-lemon-spell_5k開源英語語法糾錯模型

首頁

Deberta V3 Large Lemon Spell 5k

由manred1997開發

基於DeBERTa-v3-large微調的英語語法糾錯模型，專注於檢測和修正常見語法錯誤

序列標註

Transformers

支持多種語言#英語語法糾錯 #教育輔助工具 #高精度GEC

下載量 15

發布時間 : 10/24/2024

模型概述

該模型是基於microsoft/deberta-v3-large微調的語法糾錯(GEC)系統，旨在檢測和修正英語文本中的語法錯誤，如動詞時態、名詞變形、形容詞用法等。

模型特點

多階段訓練

採用三階段訓練策略，分別使用不同難度和來源的數據集進行優化

雙任務頭設計

同時包含錯誤檢測頭和標記分類頭，提高語法錯誤識別精度

通用英語優化

特別適合語言學習者或需要提升語法精確度的應用場景

模型能力

英語語法錯誤檢測

英語語法錯誤修正

動詞時態糾錯

名詞變形糾錯

形容詞用法糾錯

使用案例

教育技術

寫作助手

集成到寫作軟件中提供即時語法檢查

提高非母語使用者的寫作準確性

語言學習應用

幫助英語學習者識別和糾正語法錯誤

加速語言學習過程

專業工具

專業文檔校對

用於商務郵件、學術論文等正式文本的語法檢查

提升文檔專業度

🚀 語法錯誤糾正模型

本模型是一個基於microsoft/deberta-v3-large微調的語法錯誤糾正（GEC）系統，能夠檢測並糾正英文文本中的語法錯誤。它聚焦於常見語法錯誤，如動詞時態、名詞詞形變化、形容詞用法等，對語言學習者或需要提高語法準確性的應用程序非常有用。

🚀 快速開始

以下是使用該模型的示例代碼：

from dataclasses import dataclass
from typing import Optional, Tuple

import torch
from torch import nn
from torch.nn import CrossEntropyLoss
from transformers import AutoConfig, AutoTokenizer
from transformers.file_utils import ModelOutput
from transformers.models.deberta_v2.modeling_deberta_v2 import (
    DebertaV2Model,
    DebertaV2PreTrainedModel,
)


@dataclass
class XGECToROutput(ModelOutput):
    """
    Output type of `XGECToRForTokenClassification.forward()`.
    loss (`torch.FloatTensor`, optional)
    logits_correction (`torch.FloatTensor`) : The correction logits for each token.
    logits_detection (`torch.FloatTensor`) : The detection logits for each token.
    hidden_states (`Tuple[torch.FloatTensor]`, optional)
    attentions (`Tuple[torch.FloatTensor]`, optional)
    """

    loss: Optional[torch.FloatTensor] = None
    logits_correction: torch.FloatTensor = None
    logits_detection: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None


class XGECToRDebertaV3(DebertaV2PreTrainedModel):
    """
    This class overrides the GECToR model to include an error detection head in addition to the token classification head.
    """

    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.unk_tag_idx = config.label2id.get("@@UNKNOWN@@", None)

        self.deberta = DebertaV2Model(config)

        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        if self.unk_tag_idx is not None:
            self.error_detector = nn.Linear(config.hidden_size, 3)
        else:
            self.error_detector = nn.Linear(config.hidden_size, 2)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
        """
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        outputs = self.deberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]

        logits_correction = self.classifier(sequence_output)
        logits_detection = self.error_detector(sequence_output)

        loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(
                logits_correction.view(-1, self.num_labels), labels.view(-1)
            )

            labels_detection = torch.ones_like(labels)
            labels_detection[labels == 0] = 0
            labels_detection[labels == -100] = -100  # ignore padding
            if self.unk_tag_idx is not None:
                labels_detection[labels == self.unk_tag_idx] = 2
                loss_detection = loss_fct(
                    logits_detection.view(-1, 3), labels_detection.view(-1)
                )
            else:
                loss_detection = loss_fct(
                    logits_detection.view(-1, 2), labels_detection.view(-1)
                )

            loss += loss_detection

        if not return_dict:
            output = (
                logits_correction,
                logits_detection,
            ) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return XGECToROutput(
            loss=loss,
            logits_correction=logits_correction,
            logits_detection=logits_detection,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def get_input_embeddings(self):
        return self.deberta.get_input_embeddings()

    def set_input_embeddings(self, value):
        self.deberta.set_input_embeddings(value)


config = AutoConfig.from_pretrained("manred1997/deberta-v3-large-lemon-spell_5k")
tokenizer = AutoTokenizer.from_pretrained("manred1997/deberta-v3-large-lemon-spell_5k")
model = XGECToRDeberta.from_pretrained(
    "manred1997/deberta-v3-large-lemon-spell_5k", config=config
)

✨ 主要特性

精準糾錯：能夠準確檢測並糾正英文文本中的常見語法錯誤。
可微調性：可以針對特定領域（如學術寫作、商務溝通等）進行微調，以提高特定語境下的糾錯精度。

📚 詳細文檔

模型詳情

模型描述

本模型是一個語法錯誤糾正（GEC）系統，基於microsoft/deberta-v3-large模型進行微調。它旨在檢測並糾正英文文本中的語法錯誤，專注於常見的語法錯誤類型，如動詞時態、名詞詞形變化、形容詞用法等。該模型對於語言學習者或需要提高語法準確性的應用程序特別有用。

屬性	詳情
模型類型	帶有序列到序列糾正的標記分類
適用語言	英文
微調基礎模型	`microsoft/deberta-v3-large`

用途

直接使用

該模型可直接用於檢測和糾正英文文本中的語法錯誤，非常適合集成到寫作助手、教育軟件或校對工具中。

下游使用

可以針對特定領域（如學術寫作、商務溝通或非正式文本糾正）對模型進行微調，以確保在特定語境下對語法錯誤的高精度糾正。

不適用場景

本模型不適用於非英文文本、非語法糾正（如風格、語氣或邏輯），或檢測超出基本語法結構的複雜錯誤。

偏差、風險和侷限性

該模型在通用英文語料庫上進行訓練，對於非標準方言（如口語）或特定領域的行話可能表現不佳。由於訓練數據的侷限性，在這些語境中應用時，可能會引入或忽略一些錯誤。

建議

儘管該模型在一般情況下表現出色，但用戶應手動審查糾正結果，特別是在專業或創造性語境中，因為這些語境中的語法規則可能更加靈活。

🔧 技術細節

訓練詳情

訓練數據

模型的訓練分為三個階段，每個階段需要特定的數據集。以下是每個階段使用的數據描述：

階段	使用的數據集	描述
階段 1	來自PIE語料庫（僅A1部分）的900萬條打亂句子	來自PIE語料庫的900萬條打亂句子，專注於A1級別的句子。
階段 2	NUCLE、FCE、Lang8、W&I + Locness數據集的打亂組合	Lang8數據集包含947,344條句子，其中52.5%的源句子和目標句子不同。如果使用較新的Lang8轉儲，請考慮進行採樣。
階段 3	W&I + Locness數據集的最終打亂版本	W&I + Locness數據集的最終打亂版本。