GPT2-Bangla-Summarizer Open Source Model - Free Deployment for Rapid Generation of Bangla News Summaries

Home

Gpt2 Bangla Summurizer

Developed by faridulreza

This is a Bengali text summarization model based on the GPT2 architecture, specifically optimized for news content.

Text Generation

Transformers

Other#Bengali Summarization #News Text Processing #GPT2 Fine-tuning

Downloads 18

Release Time : 7/7/2023

Model Overview

This model is fine-tuned from flax-community/gpt2-bengali and is specifically designed for generating summaries of Bengali news texts. It is primarily tailored for newspaper content and is not adept at handling other types of texts such as stories or dialogues.

Model Features

News Summarization Optimization

Specifically trained and optimized for Bengali news content

Based on GPT2 Architecture

Utilizes the powerful generative capabilities of the GPT2 language model

Preprocessing Functionality

Built-in text preprocessing to optimize input text format

Model Capabilities

Bengali text summarization

News content comprehension

Key information extraction

Use Cases

News Media

News Summarization

Automatically generates concise summaries for Bengali news articles

Helps readers quickly grasp the key points of the news

🚀 GPT2 Bangla Summarizer

This model is designed to summarize Bengali text, offering a solution for efficient text condensation in the Bengali language.

🚀 Quick Start

This model is tailored to summarize Bengali text. However, it's important to note that since it was primarily trained on newspaper data, it may not perform well when summarizing Bengali stories, dialogues, or excerpts.

from transformers import GPT2LMHeadModel, AutoTokenizer
import re

tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt2-bengali")
model = GPT2LMHeadModel.from_pretrained("faridulreza/gpt2-bangla-summurizer")

model.to("cuda")

BEGIN_TOKEN = "<।summary_begin।>"
END_TOKEN = " <।summary_end।>"
BEGIN_TOKEN_ALT = "<।sum_begin।>"
END_TOKEN_ALT = " <।sum_end।>"
SUMMARY_TOKEN = "<।summary।>"

def processTxt(txt):
    txt = re.sub(r"।", "। ", txt)
    txt = re.sub(r",", ", ", txt)
    txt = re.sub(r"!", "। ", txt)
    txt = re.sub(r"\?", "। ", txt)
    txt = re.sub(r"\"", "", txt)
    txt = re.sub(r"'", "", txt)
    txt = re.sub(r"’", "", txt)
    txt = re.sub(r"’", "", txt)
    txt = re.sub(r"‘", "", txt)
    txt = re.sub(r";", "। ", txt)

    txt = re.sub(r"\s+", " ", txt)

    return txt


def index_of(val, in_text, after=0):
    try:
        return in_text.index(val, after)
    except ValueError:
        return -1

def summarize(txt):
    txt = processTxt(txt.strip())
    txt = BEGIN_TOKEN + txt + SUMMARY_TOKEN

    inputs = tokenizer(txt, max_length=800, truncation=True, return_tensors="pt")
    inputs.to("cuda")
    output = model.generate(inputs["input_ids"], max_length=len(txt) + 220, pad_token_id=tokenizer.eos_token_id)

    txt = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

    start = index_of(SUMMARY_TOKEN, txt) + len(SUMMARY_TOKEN)

    print("Whole text completion: \n",txt)
    if start == len(SUMMARY_TOKEN) - 1:
        return "No Summary!"

    end = index_of(END_TOKEN, txt, start)

    if end == -1:
        end = index_of(END_TOKEN_ALT, txt, start)

    if end == -1:
        end = index_of(BEGIN_TOKEN, txt, start)

    if end == -1:
        return txt[start:].strip()

    txt = txt[start:end].strip()

    end = index_of(SUMMARY_TOKEN,txt)

    if end == -1:
        return txt
    else:
        return txt[:end].strip()


summarize('your_bengali_text')

✨ Features

Text Summarization: Specifically designed to summarize Bengali text.
Fine-tuned Model: Fine-tuned on relevant Bengali datasets for better performance.

📦 Installation

The installation steps are not provided in the original README. If you want to use this model, you can follow the general steps for using Hugging Face models:

pip install transformers

📚 Documentation

Model Description

flax-community/gpt2-bengali was fine-tuned on BANSData: A Dataset for Bengali Abstractive News Summarization and Bangla Summarization Dataset(Prothom Alo)

Property	Details
Developed by	Faridul Reza Sagor & Abdul Wadud Shakib
Model Type	GPT2LMHeadModel
Language(s) (NLP)	Bengali
Finetuned from model	flax-community/gpt2-bengali

Uses

⚠️ Important Note

As this model was mainly trained on data from newspapers, it is not good at summarizing Bengali stories, dialogues, or excerpts.

📄 Contact

If you have any questions or need further assistance, you can contact the developer at faridul.reza.sagor@gmail.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご