chatglm3-6b-csc-chinese-lora开源模型 - 精准中文文本纠错，还能润色改写

Home

Chatglm3 6b Csc Chinese Lora

Developed by shibing624

基于ChatGLM3-6B的中文拼写纠错LoRA微调模型，具备高准确率的文本纠错能力，同时支持句子润色和改写功能。

大型语言模型

Safetensors

ChineseOpen Source License:Apache-2.0 #中文拼写纠错 #LoRA微调 #语法改写

Downloads 42

Release Time : 11/2/2023

Model Overview

该模型是针对中文文本拼写纠错任务开发的LoRA微调模型，基于THUDM/chatglm3-6b大语言模型，能够有效识别并纠正中文文本中的拼写错误，适用于各类文本校对场景。

Model Features

高准确率纠错

在CSC测试集上表现出色，能够准确识别并纠正中文拼写错误。

句子润色功能

不仅能纠错，还能对句子进行润色和改写，提升文本质量。

LoRA微调

采用LoRA技术对ChatGLM3-6B进行高效微调，保持原模型能力的同时提升纠错性能。

Model Capabilities

中文拼写纠错

文本润色

句子改写

Use Cases

文本校对

学生作文纠错

自动检测并纠正学生作文中的拼写错误

少先队员因该为老人让坐。 → 少先队员应该为老人让座。

正式文件校对

对正式文件进行拼写检查，确保文本准确性

内容创作辅助

文本润色

对已有文本进行优化改写，提升表达质量

🚀 ChatGLM3-6B中文纠错LoRA模型

本模型是基于ChatGLM3-6B的中文纠错LoRA模型，在CSC测试集上有高纠错准确率。它不仅能纠错，还具备句子润色和改写功能，能带来超出预期的效果。

🚀 快速开始

安装依赖

使用此模型前，需要安装相关的Python库。可以通过以下命令进行安装：

pip install -U pycorrector

或

pip install transformers

调用示例

使用`pycorrector`库调用

from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']

直接使用HuggingFace Transformers调用

import os

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")

sents = ['对下面文本纠错\n\n少先队员因该为老人让坐。',
         '对下面文本纠错\n\n下个星期，我跟我朋唷打算去法国玩儿。']


def get_prompt(user_query):
    vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
                    "The assistant gives helpful, detailed, and polite answers to the user's questions. " \
                    "USER: {query} ASSISTANT:"
    return vicuna_prompt.format(query=user_query)


for s in sents:
    q = get_prompt(s)
    input_ids = tokenizer(q).input_ids
    generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
    outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
    output_tensor = outputs[0][len(input_ids):]
    response = tokenizer.decode(output_tensor, skip_special_tokens=True)
    print(response)

输出示例

少先队员应该为老人让座。
下个星期，我跟我朋友打算去法国玩儿。

✨ 主要特性

高准确率：在CSC测试集上生成结果纠错准确率高。
多功能性：不仅能进行中文拼写纠错，还带有句子润色和改写功能。
兼容性强：基于pycorrector项目，可支持ChatGLM原生模型和LoRA微调后的模型。

📦 安装指南

使用`pycorrector`库

pip install -U pycorrector

使用HuggingFace Transformers

pip install transformers

💻 使用示例

基础用法

使用pycorrector库调用模型进行纠错：

from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']

高级用法

直接使用HuggingFace Transformers调用模型，可自定义更多参数：

import os

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")

sents = ['对下面文本纠错\n\n少先队员因该为老人让坐。',
         '对下面文本纠错\n\n下个星期，我跟我朋唷打算去法国玩儿。']


def get_prompt(user_query):
    vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
                    "The assistant gives helpful, detailed, and polite answers to the user's questions. " \
                    "USER: {query} ASSISTANT:"
    return vicuna_prompt.format(query=user_query)


for s in sents:
    q = get_prompt(s)
    input_ids = tokenizer(q).input_ids
    generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
    outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
    output_tensor = outputs[0][len(input_ids):]
    response = tokenizer.decode(output_tensor, skip_special_tokens=True)
    print(response)

📚 详细文档

模型评估

shibing624/chatglm3-6b-csc-chinese-lora 在CSC测试集上的表现如下：

输入文本	预测结果
对下面文本纠错：少先队员因该为老人让坐。	少先队员应该为老人让座。

模型文件组成

chatglm3-6b-csc-chinese-lora
    ├── adapter_config.json
    └── adapter_model.bin

训练参数

loss

参数	详情
num_epochs	5
per_device_train_batch_size	6
learning_rate	2e-05
best steps	25100
train_loss	0.0834
lr_scheduler_type	linear
base model	THUDM/chatglm3-6b
warmup_steps	50
save_strategy	steps
save_steps	500
save_total_limit	10
bf16	false
fp16	true
optim	adamw_torch
ddp_find_unused_parameters	false
gradient_checkpointing	true
max_seq_length	512
max_length	512
prompt_template_name	vicuna
硬件及训练时长	6 * V100 32GB，训练48小时

训练数据集

中文拼写纠错数据集：https://huggingface.co/datasets/shibing624/CSC
中文语法纠错数据集：https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
通用GPT4问答数据集：https://huggingface.co/datasets/shibing624/sharegpt_gpt4

如果需要训练文本纠错模型，请参考https://github.com/shibing624/pycorrector

🔧 技术细节

本模型基于THUDM/chatglm3-6b进行LoRA微调，使用了特定的训练参数和数据集，在中文纠错任务上取得了较好的效果。通过LoRA微调，模型能够在不改变原模型结构的基础上，学习到中文纠错的特定知识。

📄 许可证

本项目采用apache-2.0许可证。

📚 引用

@software{pycorrector,
  author = {Ming Xu},
  title = {pycorrector: Text Error Correction Tool},
  year = {2023},
  url = {https://github.com/shibing624/pycorrector},
}