mT5_m2o_hindi_crossSum Open-source Abstractive Summarization Model - Free Multilingual Text Summarization into Hindi

Mt5 M2o Hindi Crosssum

Developed by csebuetnlp

A fine-tuned mT5 many-to-one summarization model based on the CrossSum dataset, supporting summarization of multilingual texts into Hindi

Text Generation

Transformers

Supports Multiple Languages#Multilingual Summarization #Hindi Summary Output #Cross-lingual Conversion

Downloads 22

Release Time : 4/20/2022

Model Overview

This model is a multilingual text summarization model based on the mT5 architecture, specifically optimized for the task of generating Hindi summaries from input texts in multiple languages. It can process inputs in various languages including Chinese, English, French, and more, and produce high-quality Hindi summaries.

Model Features

Multilingual Support

Supports input texts in over 40 languages, capable of handling a wide range of international content

Cross-lingual Summarization

Can summarize source texts in different languages into Hindi, enabling cross-lingual information processing

Based on mT5 Architecture

Utilizes the advanced mT5 model architecture with robust text comprehension and generation capabilities

Model Capabilities

Multilingual Text Comprehension

Cross-lingual Summarization

Hindi Text Generation

Use Cases

News Media

International News Summarization

Summarize international news in various languages into Hindi, making it easier for Hindi readers to quickly grasp global events

Enhances efficiency for Hindi readers in accessing international information

Academic Research

Multilingual Paper Summarization

Summarize academic papers published in different languages into Hindi to facilitate knowledge dissemination

Helps Hindi-speaking researchers quickly understand international academic progress

🚀 mT5-m2o-hindi-CrossSum

This repository hosts the many-to-one (m2o) mT5 checkpoint, which has been fine-tuned on all cross-lingual pairs of the CrossSum dataset. The target summaries of this dataset are in Hindi. In other words, this model aims to summarize text written in any language into Hindi. For detailed fine-tuning information and scripts, refer to the paper and the official repository.

🚀 Quick Start

✨ Features

Supports summarization from multiple languages including Amharic, Arabic, Azerbaijani, Bengali, Burmese, Chinese, English, French, Gujarati, Hausa, Hindi, Igbo, Indonesian, Japanese, Kinyarwanda, Korean, Kyrgyz, Marathi, Nepali, Oromo, Pashto, Persian, Nigerian Pidgin, Portuguese, Punjabi, Russian, Scottish Gaelic, Serbian, Sinhala, Somali, Spanish, Swahili, Tamil, Telugu, Thai, Tigrinya, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, and Yoruba into Hindi.
Licensed under CC BY-NC-SA 4.0.

📦 Installation

No specific installation steps are provided in the original README. If you plan to use this model in the transformers library, you need to have the transformers library installed. You can install it using pip install transformers.

💻 Usage Examples

Basic Usage

import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

article_text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said.  The policy includes the termination of accounts of anti-vaccine influencers.  Tech giants have been criticised for not doing more to counter false health information on their sites.  In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue.  YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines.  In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B.  "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""

model_name = "csebuetnlp/mT5_m2o_hindi_crossSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)

📚 Documentation

The model is based on the mT5 architecture and is fine-tuned on the CrossSum dataset for cross-lingual summarization. For more in-depth details, please refer to the paper.

📄 License

This model is licensed under the CC BY-NC-SA 4.0 license.

📖 Citation

If you use this model, please cite the following paper:

@article{hasan2021crosssum,
  author    = {Tahmid Hasan and Abhik Bhattacharjee and Wasi Uddin Ahmad and Yuan-Fang Li and Yong-bin Kang and Rifat Shahriyar},
  title     = {CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs},
  journal   = {CoRR},
  volume    = {abs/2112.08804},
  year      = {2021},
  url       = {https://arxiv.org/abs/2112.08804},
  eprinttype = {arXiv},
  eprint    = {2112.08804}
}

📋 Information Table

Property	Details
Model Type	mT5 many-to-one (m2o) checkpoint for cross-lingual summarization
Training Data	CrossSum dataset
Supported Languages	Amharic, Arabic, Azerbaijani, Bengali, Burmese, Chinese, English, French, Gujarati, Hausa, Hindi, Igbo, Indonesian, Japanese, Kinyarwanda, Korean, Kyrgyz, Marathi, Nepali, Oromo, Pashto, Persian, Nigerian Pidgin, Portuguese, Punjabi, Russian, Scottish Gaelic, Serbian, Sinhala, Somali, Spanish, Swahili, Tamil, Telugu, Thai, Tigrinya, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yoruba
License	CC BY-NC-SA 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご