chatglm3-6b-csc-chinese-lora Open Source Model - Accurately Correct Chinese Texts and Even Polish and Rewrite Them

Chatglm3 6b Csc Chinese Lora

Developed by shibing624

A LoRA fine-tuned model for Chinese spelling correction based on ChatGLM3-6B, featuring high-accuracy text correction capabilities and supporting sentence polishing and rewriting functions.

Large Language Model

Safetensors

ChineseOpen Source License:Apache-2.0 #Chinese Spelling Correction #LoRA Fine-tuning #Grammar Rewriting

Downloads 42

Release Time : 11/2/2023

Model Overview

This model is a LoRA fine-tuned model developed for Chinese text spelling correction tasks, based on the THUDM/chatglm3-6b large language model. It effectively identifies and corrects spelling errors in Chinese text, suitable for various proofreading scenarios.

Model Features

High Accuracy Correction

Performs excellently on the CSC test set, accurately identifying and correcting Chinese spelling errors.

Sentence Polishing Function

Not only corrects errors but also polishes and rewrites sentences to improve text quality.

LoRA Fine-tuning

Utilizes LoRA technology for efficient fine-tuning of ChatGLM3-6B, enhancing correction performance while preserving the original model's capabilities.

Model Capabilities

Chinese Spelling Correction

Text Polishing

Sentence Rewriting

Use Cases

Text Proofreading

Student Essay Correction

Automatically detects and corrects spelling errors in student essays

Young Pioneers should give up their seats to the elderly. → Young Pioneers should offer their seats to the elderly.

Formal Document Proofreading

Conducts spelling checks on formal documents to ensure text accuracy

Content Creation Assistance

Text Polishing

Optimizes and rewrites existing text to enhance expression quality

🚀 Chinese Spelling Correction LoRA Model

A LoRA model for Chinese spelling correction based on ChatGLM3-6B

shibing624/chatglm3-6b-csc-chinese-lora evaluates test data:

The overall performance of shibing624/chatglm3-6b-csc-chinese-lora on CSC test:

input_text	pred
Correct the following text: 少先队员因该为老人让坐。	少先队员应该为老人让座。

On the CSC test set, the generated results have a high accuracy rate for error correction. Since it is based on the THUDM/chatglm3-6b model, the results can often bring surprises. It can not only correct errors but also has the functions of sentence polishing and rewriting.

🚀 Quick Start

✨ Features

High accuracy in Chinese spelling correction on the CSC test set.
Based on the THUDM/chatglm3-6b model, capable of sentence polishing and rewriting.

📦 Installation

Install the necessary packages:

pip install -U pycorrector

💻 Usage Examples

Basic Usage

This project is open - sourced in the pycorrector project: pycorrector. It supports both the native ChatGLM model and the LoRA fine - tuned model. You can call it through the following commands:

from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']

Advanced Usage

Without pycorrector, you can use the model like this:

First, you pass your input through the transformer model, then you get the generated sentence.

Install the package:

pip install transformers

import os

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")

sents = ['对下面文本纠错\n\n少先队员因该为老人让坐。',
         '对下面文本纠错\n\n下个星期，我跟我朋唷打算去法国玩儿。']


def get_prompt(user_query):
    vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
                    "The assistant gives helpful, detailed, and polite answers to the user's questions. " \
                    "USER: {query} ASSISTANT:"
    return vicuna_prompt.format(query=user_query)


for s in sents:
    q = get_prompt(s)
    input_ids = tokenizer(q).input_ids
    generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
    outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
    output_tensor = outputs[0][len(input_ids):]
    response = tokenizer.decode(output_tensor, skip_special_tokens=True)
    print(response)

Output:

少先队员应该为老人让座。
下个星期，我跟我朋友打算去法国玩儿。

🔧 Technical Details

Model File Structure

The model files are organized as follows:

chatglm3-6b-csc-chinese-lora
    ├── adapter_config.json
    └── adapter_model.bin

Training Parameters

loss

Property	Details
num_epochs	5
per_device_train_batch_size	6
learning_rate	2e - 05
best steps	25100
train_loss	0.0834
lr_scheduler_type	linear
base model	THUDM/chatglm3-6b
warmup_steps	50
save_strategy	steps
save_steps	500
save_total_limit	10
bf16	false
fp16	true
optim	adamw_torch
ddp_find_unused_parameters	false
gradient_checkpointing	true
max_seq_length	512
max_length	512
prompt_template_name	vicuna
Hardware	6 * V100 32GB, training 48 hours

Training Datasets

The training set includes the following data:

Chinese Spelling Correction Dataset: https://huggingface.co/datasets/shibing624/CSC
Chinese Grammar Correction Dataset: https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
General GPT4 Q&A Dataset: https://huggingface.co/datasets/shibing624/sharegpt_gpt4

If you need to train a text correction model, please refer to https://github.com/shibing624/pycorrector

📄 License

This project is licensed under the apache - 2.0 license.

📚 Documentation

Citation

@software{pycorrector,
  author = {Ming Xu},
  title = {pycorrector: Text Error Correction Tool},
  year = {2023},
  url = {https://github.com/shibing624/pycorrector},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご