🚀 GPT2 Bangla Summarizer
This model is designed to summarize Bengali text, offering a solution for efficient text condensation in the Bengali language.
🚀 Quick Start
This model is tailored to summarize Bengali text. However, it's important to note that since it was primarily trained on newspaper data, it may not perform well when summarizing Bengali stories, dialogues, or excerpts.
from transformers import GPT2LMHeadModel, AutoTokenizer
import re
tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt2-bengali")
model = GPT2LMHeadModel.from_pretrained("faridulreza/gpt2-bangla-summurizer")
model.to("cuda")
BEGIN_TOKEN = "<।summary_begin।>"
END_TOKEN = " <।summary_end।>"
BEGIN_TOKEN_ALT = "<।sum_begin।>"
END_TOKEN_ALT = " <।sum_end।>"
SUMMARY_TOKEN = "<।summary।>"
def processTxt(txt):
txt = re.sub(r"।", "। ", txt)
txt = re.sub(r",", ", ", txt)
txt = re.sub(r"!", "। ", txt)
txt = re.sub(r"\?", "। ", txt)
txt = re.sub(r"\"", "", txt)
txt = re.sub(r"'", "", txt)
txt = re.sub(r"’", "", txt)
txt = re.sub(r"’", "", txt)
txt = re.sub(r"‘", "", txt)
txt = re.sub(r";", "। ", txt)
txt = re.sub(r"\s+", " ", txt)
return txt
def index_of(val, in_text, after=0):
try:
return in_text.index(val, after)
except ValueError:
return -1
def summarize(txt):
txt = processTxt(txt.strip())
txt = BEGIN_TOKEN + txt + SUMMARY_TOKEN
inputs = tokenizer(txt, max_length=800, truncation=True, return_tensors="pt")
inputs.to("cuda")
output = model.generate(inputs["input_ids"], max_length=len(txt) + 220, pad_token_id=tokenizer.eos_token_id)
txt = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
start = index_of(SUMMARY_TOKEN, txt) + len(SUMMARY_TOKEN)
print("Whole text completion: \n",txt)
if start == len(SUMMARY_TOKEN) - 1:
return "No Summary!"
end = index_of(END_TOKEN, txt, start)
if end == -1:
end = index_of(END_TOKEN_ALT, txt, start)
if end == -1:
end = index_of(BEGIN_TOKEN, txt, start)
if end == -1:
return txt[start:].strip()
txt = txt[start:end].strip()
end = index_of(SUMMARY_TOKEN,txt)
if end == -1:
return txt
else:
return txt[:end].strip()
summarize('your_bengali_text')
✨ Features
- Text Summarization: Specifically designed to summarize Bengali text.
- Fine-tuned Model: Fine-tuned on relevant Bengali datasets for better performance.
📦 Installation
The installation steps are not provided in the original README. If you want to use this model, you can follow the general steps for using Hugging Face models:
pip install transformers
📚 Documentation
Model Description
flax-community/gpt2-bengali was fine-tuned on
BANSData: A Dataset for Bengali Abstractive News Summarization and
Bangla Summarization Dataset(Prothom Alo)
Property |
Details |
Developed by |
Faridul Reza Sagor & Abdul Wadud Shakib |
Model Type |
GPT2LMHeadModel |
Language(s) (NLP) |
Bengali |
Finetuned from model |
flax-community/gpt2-bengali |
Uses
⚠️ Important Note
As this model was mainly trained on data from newspapers, it is not good at summarizing Bengali stories, dialogues, or excerpts.
📄 Contact
If you have any questions or need further assistance, you can contact the developer at faridul.reza.sagor@gmail.com.