Qwen2.5-7B-Instruct開源大語言模型 - 免費部署助力文本生成與推理任務

首頁

Chinese Text Correction 7b

由shibing624開發

Qwen2.5-7B-Instruct 是一個基於 Qwen2.5 架構的 7B 參數規模的中文指令微調大語言模型，適用於文本生成和推理任務。

大型語言模型

Transformers

中文開源協議:Apache-2.0 #中文文本糾錯 #指令微調 #高精度語義理解

下載量 522

發布時間 : 10/12/2024

模型概述

該模型主要用於中文文本生成和推理任務，支持文本糾錯等應用場景。

模型特點

中文指令微調

針對中文指令進行了優化，能夠更好地理解和執行中文任務。

文本糾錯能力

支持中文文本糾錯任務，能夠識別和修正文本中的錯誤。

大語言模型

基於 7B 參數規模的大語言模型，具備強大的文本生成和理解能力。

模型能力

文本生成

文本糾錯

指令理解

使用案例

文本糾錯

中文文本糾錯

識別並修正中文文本中的語法、拼寫和用詞錯誤。

能夠有效提升文本的準確性和可讀性。

文本生成

中文文本生成

根據給定的提示生成連貫、流暢的中文文本。

生成的文本符合上下文邏輯，具有較高的可讀性。

🚀 中文文本糾錯模型

本項目提供的中文文本糾錯模型，可用於拼寫糾錯和語法糾錯，能有效提升文本的準確性和規範性。

🚀 快速開始

使用`pycorrector`調用模型

本項目開源在pycorrector項目：pycorrector，可支持大模型微調後用於文本糾錯，通過如下命令調用：

安裝依賴包：

pip install -U pycorrector

from pycorrector.gpt.gpt_corrector import GptCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻煩你了。希望你們好好的跳無',
        '少先隊員因該為老人讓坐',
        '機七學習是人工智能領遇最能體現智能的一個分知',
        '一隻小魚船浮在平淨的河面上',
        '我的家鄉是有明的漁米之鄉',
    ]
    m = GptCorrector("shibing624/chinese-text-correction-7b")

    batch_res = m.correct_batch(error_sentences)
    for i in batch_res:
        print(i)
        print()

使用`HuggingFace Transformers`調用模型

若不使用 pycorrector，可以按如下方式使用模型：

首先，將輸入數據傳入transformer模型，然後得到生成的句子。

安裝依賴包：

pip install transformers

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "shibing624/chinese-text-correction-7b"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

input_content = "文本糾錯：\n少先隊員因該為老人讓坐。"

messages = [{"role": "user", "content": input_content}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)

print(input_text)

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))

輸出結果：

少先隊員應該為老人讓座。

✨ 主要特性

多類型糾錯：支持拼寫糾錯、語法糾錯，涵蓋音似、形似、多字、少字等多種錯誤類型。
多方式調用：既可以通過pycorrector項目調用，也能使用HuggingFace Transformers直接調用。
多模型可選：提供不同規模的模型，如chinese-text-correction-1.5b、chinese-text-correction-7b等，滿足不同場景需求。

📦 安裝指南

使用`pycorrector`

pip install -U pycorrector

使用`HuggingFace Transformers`

pip install transformers

💻 使用示例

基礎用法

# 使用pycorrector進行文本糾錯
from pycorrector.gpt.gpt_corrector import GptCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻煩你了。希望你們好好的跳無',
        '少先隊員因該為老人讓坐',
        '機七學習是人工智能領遇最能體現智能的一個分知',
        '一隻小魚船浮在平淨的河面上',
        '我的家鄉是有明的漁米之鄉',
    ]
    m = GptCorrector("shibing624/chinese-text-correction-7b")

    batch_res = m.correct_batch(error_sentences)
    for i in batch_res:
        print(i)
        print()

高級用法

# 使用HuggingFace Transformers直接調用模型進行文本糾錯
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "shibing624/chinese-text-correction-7b"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

input_content = "文本糾錯：\n少先隊員因該為老人讓坐。"

messages = [{"role": "user", "content": input_content}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)

print(input_text)

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))

📚 詳細文檔

模型列表

模型名稱	基礎模型	下載鏈接
chinese-text-correction-1.5b	Qwen/Qwen2.5-1.5B-Instruct	🤗 Hugging Face
chinese-text-correction-1.5b-lora	Qwen/Qwen2.5-1.5B-Instruct	🤗 Hugging Face
chinese-text-correction-7b	Qwen/Qwen2.5-7B-Instruct	🤗 Hugging Face
chinese-text-correction-7b-lora	Qwen/Qwen2.5-7B-Instruct	🤗 Hugging Face

評估結果

評估指標：F1
CSC(Chinese Spelling Correction)：拼寫糾錯模型，表示模型可以處理音似、形似、語法等長度對齊的錯誤糾正。
CTC(CHinese Text Correction)：文本糾錯模型，表示模型支持拼寫、語法等長度對齊的錯誤糾正，還可以處理多字、少字等長度不對齊的錯誤糾正。
GPU：Tesla V100，顯存 32 GB

模型名稱	模型鏈接	基礎模型	平均得分	SIGHAN - 2015得分	EC - LAW得分	MCSC得分	GPU/CPU	QPS
Kenlm - CSC	shibing624/chinese-kenlm-klm	kenlm	0.3409	0.3147	0.3763	0.3317	CPU	9
Mengzi - T5 - CSC	shibing624/mengzi-t5-base-chinese-correction	mengzi - t5 - base	0.3984	0.7758	0.3156	0.1039	GPU	214
ERNIE - CSC	PaddleNLP/ernie-csc	PaddlePaddle/ernie - 1.0 - base - zh	0.4353	0.8383	0.3357	0.1318	GPU	114
MacBERT - CSC	shibing624/macbert4csc-base-chinese	hfl/chinese - macbert - base	0.3993	0.8314	0.1610	0.2055	GPU	224
ChatGLM3 - 6B - CSC	shibing624/chatglm3-6b-csc-chinese-lora	THUDM/chatglm3 - 6b	0.4538	0.6572	0.4369	0.2672	GPU	3
Qwen2.5 - 1.5B - CTC	shibing624/chinese-text-correction-1.5b	Qwen/Qwen2.5 - 1.5B - Instruct	0.6802	0.3032	0.7846	0.9529	GPU	6
Qwen2.5 - 7B - CTC	shibing624/chinese-text-correction-7b	Qwen/Qwen2.5 - 7B - Instruct	0.8225	0.4917	0.9798	0.9959	GPU	3

模型文件組成

shibing624/chinese-text-correction-7b
|-- added_tokens.json
|-- config.json
|-- generation_config.json
|-- merges.txt
|-- model.safetensors
|-- model.safetensors.index.json
|-- README.md
|-- special_tokens_map.json
|-- tokenizer_config.json
|-- tokenizer.json
`-- vocab.json

訓練參數

訓練輪數（num_epochs）：8
批次大小（batch_size）：2
訓練步數（steps）：36000
評估損失（eval_loss）：0.12
基礎模型（base model）：Qwen/Qwen2.5 - 7B - Instruct
訓練數據（train data）：shibing624/chinese_text_correction
訓練時間（train time）：10 天
評估損失曲線：
訓練損失曲線：

訓練數據集

中文糾錯數據集

數據：shibing624/chinese_text_correction

訓練參考

如果需要訓練Qwen的糾錯模型，請參考https://github.com/shibing624/pycorrector 或者 https://github.com/shibing624/MedicalGPT

🔧 技術細節

本模型基於Qwen系列基礎模型進行微調，使用特定的訓練數據和訓練參數，以提升在中文文本糾錯任務上的性能。通過F1指標進行評估，在不同的測試數據集上表現良好。

📄 許可證

本項目採用apache - 2.0許可證。

📖 引用

@software{pycorrector,
  author = {Xu Ming},
  title = {pycorrector: Implementation of language model finetune},
  year = {2024},
  url = {https://github.com/shibing624/pycorrector},
}