chatglm3-6b-csc-chinese-lora開源模型 - 精準中文文本糾錯，還能潤色改寫

首頁

Chatglm3 6b Csc Chinese Lora

由shibing624開發

基於ChatGLM3-6B的中文拼寫糾錯LoRA微調模型，具備高準確率的文本糾錯能力，同時支持句子潤色和改寫功能。

大型語言模型

Safetensors

中文開源協議:Apache-2.0 #中文拼寫糾錯 #LoRA微調 #語法改寫

下載量 42

發布時間 : 11/2/2023

模型概述

該模型是針對中文文本拼寫糾錯任務開發的LoRA微調模型，基於THUDM/chatglm3-6b大語言模型，能夠有效識別並糾正中文文本中的拼寫錯誤，適用於各類文本校對場景。

模型特點

高準確率糾錯

在CSC測試集上表現出色，能夠準確識別並糾正中文拼寫錯誤。

句子潤色功能

不僅能糾錯，還能對句子進行潤色和改寫，提升文本質量。

LoRA微調

採用LoRA技術對ChatGLM3-6B進行高效微調，保持原模型能力的同時提升糾錯性能。

模型能力

中文拼寫糾錯

文本潤色

句子改寫

使用案例

文本校對

學生作文糾錯

自動檢測並糾正學生作文中的拼寫錯誤

少先隊員因該為老人讓坐。 → 少先隊員應該為老人讓座。

正式文件校對

對正式文件進行拼寫檢查，確保文本準確性

內容創作輔助

文本潤色

對已有文本進行優化改寫，提升表達質量

🚀 ChatGLM3-6B中文糾錯LoRA模型

本模型是基於ChatGLM3-6B的中文糾錯LoRA模型，在CSC測試集上有高糾錯準確率。它不僅能糾錯，還具備句子潤色和改寫功能，能帶來超出預期的效果。

🚀 快速開始

安裝依賴

使用此模型前，需要安裝相關的Python庫。可以通過以下命令進行安裝：

pip install -U pycorrector

或

pip install transformers

調用示例

使用`pycorrector`庫調用

from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先隊員因該為老人讓坐。"])
print(r) # ['少先隊員應該為老人讓座。']

直接使用HuggingFace Transformers調用

import os

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")

sents = ['對下面文本糾錯\n\n少先隊員因該為老人讓坐。',
         '對下面文本糾錯\n\n下個星期，我跟我朋唷打算去法國玩兒。']


def get_prompt(user_query):
    vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
                    "The assistant gives helpful, detailed, and polite answers to the user's questions. " \
                    "USER: {query} ASSISTANT:"
    return vicuna_prompt.format(query=user_query)


for s in sents:
    q = get_prompt(s)
    input_ids = tokenizer(q).input_ids
    generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
    outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
    output_tensor = outputs[0][len(input_ids):]
    response = tokenizer.decode(output_tensor, skip_special_tokens=True)
    print(response)

輸出示例

少先隊員應該為老人讓座。
下個星期，我跟我朋友打算去法國玩兒。

✨ 主要特性

高準確率：在CSC測試集上生成結果糾錯準確率高。
多功能性：不僅能進行中文拼寫糾錯，還帶有句子潤色和改寫功能。
兼容性強：基於pycorrector項目，可支持ChatGLM原生模型和LoRA微調後的模型。

📦 安裝指南

使用`pycorrector`庫

pip install -U pycorrector

使用HuggingFace Transformers

pip install transformers

💻 使用示例

基礎用法

使用pycorrector庫調用模型進行糾錯：

from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先隊員因該為老人讓坐。"])
print(r) # ['少先隊員應該為老人讓座。']

高級用法

直接使用HuggingFace Transformers調用模型，可自定義更多參數：

import os

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")

sents = ['對下面文本糾錯\n\n少先隊員因該為老人讓坐。',
         '對下面文本糾錯\n\n下個星期，我跟我朋唷打算去法國玩兒。']


def get_prompt(user_query):
    vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
                    "The assistant gives helpful, detailed, and polite answers to the user's questions. " \
                    "USER: {query} ASSISTANT:"
    return vicuna_prompt.format(query=user_query)


for s in sents:
    q = get_prompt(s)
    input_ids = tokenizer(q).input_ids
    generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
    outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
    output_tensor = outputs[0][len(input_ids):]
    response = tokenizer.decode(output_tensor, skip_special_tokens=True)
    print(response)

📚 詳細文檔

模型評估

shibing624/chatglm3-6b-csc-chinese-lora 在CSC測試集上的表現如下：

輸入文本	預測結果
對下面文本糾錯：少先隊員因該為老人讓坐。	少先隊員應該為老人讓座。

模型文件組成

chatglm3-6b-csc-chinese-lora
    ├── adapter_config.json
    └── adapter_model.bin

訓練參數

loss

參數	詳情
num_epochs	5
per_device_train_batch_size	6
learning_rate	2e-05
best steps	25100
train_loss	0.0834
lr_scheduler_type	linear
base model	THUDM/chatglm3-6b
warmup_steps	50
save_strategy	steps
save_steps	500
save_total_limit	10
bf16	false
fp16	true
optim	adamw_torch
ddp_find_unused_parameters	false
gradient_checkpointing	true
max_seq_length	512
max_length	512
prompt_template_name	vicuna
硬件及訓練時長	6 * V100 32GB，訓練48小時

訓練數據集

中文拼寫糾錯數據集：https://huggingface.co/datasets/shibing624/CSC
中文語法糾錯數據集：https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
通用GPT4問答數據集：https://huggingface.co/datasets/shibing624/sharegpt_gpt4

如果需要訓練文本糾錯模型，請參考https://github.com/shibing624/pycorrector

🔧 技術細節

本模型基於THUDM/chatglm3-6b進行LoRA微調，使用了特定的訓練參數和數據集，在中文糾錯任務上取得了較好的效果。通過LoRA微調，模型能夠在不改變原模型結構的基礎上，學習到中文糾錯的特定知識。

📄 許可證

本項目採用apache-2.0許可證。

📚 引用

@software{pycorrector,
  author = {Ming Xu},
  title = {pycorrector: Text Error Correction Tool},
  year = {2023},
  url = {https://github.com/shibing624/pycorrector},
}