Heackmt5 ZhSum100k
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 HeackMT5-ZhSum100k: A Summarization Model for Chinese Texts
HeackMT5-ZhSum100k is a fine - tuned mT5 model designed specifically for Chinese text summarization. Trained on diverse Chinese datasets, it can generate coherent and concise summaries for a wide variety of texts.
📚 Documentation
Model Details
Property | Details |
---|---|
Model Type | mT5 |
Language | Chinese |
Training Data | Mainly Chinese Financial News Sources, NO BBC or CNN source. Training data contains 1M lines. |
Finetuning Epochs | 10 |
Evaluation Results
The model achieved the following results:
- ROUGE - 1: 56.46
- ROUGE - 2: 45.81
- ROUGE - L: 52.98
- ROUGE - Lsum: 20.22
💻 Usage Examples
Basic Usage
from transformers import MT5ForConditionalGeneration, T5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
tokenizer = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")
chunk = """
财联社5月22日讯,据平安包头微信公众号消息,近日,包头警方发布一起利用人工智能(AI)实施电信诈骗的典型案例,福州市某科技公司法人代表郭先生10分钟内被骗430万元。
4月20日中午,郭先生的好友突然通过微信视频联系他,自己的朋友在外地竞标,需要430万保证金,且需要公对公账户过账,想要借郭先生公司的账户走账。
基于对好友的信任,加上已经视频聊天核实了身份,郭先生没有核实钱款是否到账,就分两笔把430万转到了好友朋友的银行卡上。郭先生拨打好友电话,才知道被骗。骗子通过智能AI换脸和拟声技术,佯装好友对他实施了诈骗。
值得注意的是,骗子并没有使用一个仿真的好友微信添加郭先生为好友,而是直接用好友微信发起视频聊天,这也是郭先生被骗的原因之一。骗子极有可能通过技术手段盗用了郭先生好友的微信。幸运的是,接到报警后,福州、包头两地警银迅速启动止付机制,成功止付拦截336.84万元,但仍有93.16万元被转移,目前正在全力追缴中。
"""
inputs = tokenizer.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元
Advanced Usage
from transformers import MT5ForConditionalGeneration, T5Tokenizer
model_heack = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
tokenizer_heack = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")
def _split_text(text, length):
chunks = []
start = 0
while start < len(text):
if len(text) - start > length:
pos_forward = start + length
pos_backward = start + length
pos = start + length
while (pos_forward < len(text)) and (pos_backward >= 0) and (pos_forward < 20 + pos) and (pos_backward + 20 > pos) and text[pos_forward] not in {'.', '。',',',','} and text[pos_backward] not in {'.', '。',',',','}:
pos_forward += 1
pos_backward -= 1
if pos_forward - pos >= 20 and pos_backward <= pos - 20:
pos = start + length
elif text[pos_backward] in {'.', '。',',',','}:
pos = pos_backward
else:
pos = pos_forward
chunks.append(text[start:pos+1])
start = pos + 1
else:
chunks.append(text[start:])
break
# Combine last chunk with previous one if it's too short
if len(chunks) > 1 and len(chunks[-1]) < 100:
chunks[-2] += chunks[-1]
chunks.pop()
return chunks
def get_summary_heack(text, each_summary_length=150):
chunks = _split_text(text, 300)
summaries = []
for chunk in chunks:
inputs = tokenizer_heack.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model_heack.generate(inputs, max_length=each_summary_length, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
summary = tokenizer_heack.decode(summary_ids[0], skip_special_tokens=True)
summaries.append(summary)
return " ".join(summaries)
Credits
This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out via WeChat ID: kongyang.
📄 License
Definitions
"Derivative Works" refer to any variants directly or indirectly derived from this model through technical means such as quantization, pruning, distillation, or architectural modification, including but not limited to:
- Products of quantization format conversion such as GGUF/GGML.
- Lightweight models obtained through knowledge distillation.
- Architectural adjustments based on the model's parameters (e.g., changes in the number of layers or attention mechanisms).
Data and Training Cost Explanation
Training high - quality AI models requires substantial resources:
- Data cleaning and annotation costs account for over 60% of the total project investment. All data sources are domestic and compliant, avoiding the "hallucinatory translations" of Chinese contexts by international media (e.g., BBC).
- This project adheres to the use of neutral and objective corpora, aiming to promote the universality of technology, human understanding, and cultural exchange.
Commercial License Terms
- Non - commercial Use: Free
- Commercial Use: If you need to use the model in commercial scenarios (including enterprise products/services), the following fees apply: | Enterprise Type | Perpetual License Fee (CNY) | |-----------------|------------------------------| | Startups or individuals (annual turnover below 1 million CNY) | 1,000 | | Medium - sized enterprises (non - listed companies with annual turnover above 1 million CNY) | 5,000 | | Listed companies | 20,000 |
After scanning the QR code for payment, your Hugging Face account will obtain commercial usage rights. Each enterprise is limited to binding 1 main account. The scope of commercial authorization includes the commercial use of derivative works, regardless of format conversion or architectural modification.
Payment Method:
Raw Data Access
To obtain the uncleaned raw datasets (including multimodal collections), pay 5000 CNY via the QR code and email support@opentech.cn.
Citation
If you use this model in your research, please cite:
@misc{kongyang2023heackmt5zhsum100k,
title={HeackMT5-ZhSum100k: A Large-Scale Multilingual Abstractive Summarization for Chinese Texts},
author={Kong Yang},
year={2023}
}






