🚀 孟加拉语文本摘要模型
本模型旨在对孟加拉语文本进行摘要提取,通过在特定数据集上微调预训练模型,能够有效处理孟加拉语新闻文本的摘要任务。
🚀 快速开始
本模型主要用于对孟加拉语文本进行摘要提取。不过需要注意的是,由于该模型主要在报纸数据上进行训练,因此在对孟加拉语故事、对话或摘录进行摘要时效果不佳。
from transformers import GPT2LMHeadModel, AutoTokenizer
import re
tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt2-bengali")
model = GPT2LMHeadModel.from_pretrained("faridulreza/gpt2-bangla-summurizer")
model.to("cuda")
BEGIN_TOKEN = "<।summary_begin।>"
END_TOKEN = " <।summary_end।>"
BEGIN_TOKEN_ALT = "<।sum_begin।>"
END_TOKEN_ALT = " <।sum_end।>"
SUMMARY_TOKEN = "<।summary।>"
def processTxt(txt):
txt = re.sub(r"।", "। ", txt)
txt = re.sub(r",", ", ", txt)
txt = re.sub(r"!", "। ", txt)
txt = re.sub(r"\?", "। ", txt)
txt = re.sub(r"\"", "", txt)
txt = re.sub(r"'", "", txt)
txt = re.sub(r"’", "", txt)
txt = re.sub(r"’", "", txt)
txt = re.sub(r"‘", "", txt)
txt = re.sub(r";", "। ", txt)
txt = re.sub(r"\s+", " ", txt)
return txt
def index_of(val, in_text, after=0):
try:
return in_text.index(val, after)
except ValueError:
return -1
def summarize(txt):
txt = processTxt(txt.strip())
txt = BEGIN_TOKEN + txt + SUMMARY_TOKEN
inputs = tokenizer(txt, max_length=800, truncation=True, return_tensors="pt")
inputs.to("cuda")
output = model.generate(inputs["input_ids"], max_length=len(txt) + 220, pad_token_id=tokenizer.eos_token_id)
txt = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
start = index_of(SUMMARY_TOKEN, txt) + len(SUMMARY_TOKEN)
print("Whole text completion: \n",txt)
if start == len(SUMMARY_TOKEN) - 1:
return "No Summary!"
end = index_of(END_TOKEN, txt, start)
if end == -1:
end = index_of(END_TOKEN_ALT, txt, start)
if end == -1:
end = index_of(BEGIN_TOKEN, txt, start)
if end == -1:
return txt[start:].strip()
txt = txt[start:end].strip()
end = index_of(SUMMARY_TOKEN,txt)
if end == -1:
return txt
else:
return txt[:end].strip()
summarize('your_bengali_text')
✨ 主要特性
- 针对性训练:基于特定的孟加拉语新闻摘要数据集进行微调,适用于孟加拉语新闻文本的摘要任务。
- 模型类型:采用
GPT2LMHeadModel
,能够有效学习文本的语言模式和语义信息。
📦 模型详情
💻 使用示例
基础用法
summarize('your_bengali_text')
📄 联系信息
如果您有任何问题或建议,可以通过以下邮箱联系开发者:faridul.reza.sagor@gmail.com