Indicbart XLSum
IndicBART-XLSum is a sequence-to-sequence pre-trained model based on the multilingual independent script IndicBART, focusing on Indian languages.
Downloads 290
Release Time : 5/11/2022
Model Overview
This model supports 7 Indian languages, is based on the mBART architecture, and is primarily used for text summarization tasks.
Model Features
Multilingual Support
Supports 7 Indian languages, not all of which are supported by mBART50 and mT5.
High Computational Efficiency
The model is much smaller than mBART and mT5 (base versions), resulting in lower computational costs during fine-tuning and decoding.
Independent Script Processing
Each language uses its own script without requiring any script mapping to Devanagari.
Model Capabilities
Multilingual Text Summarization
Sequence-to-Sequence Generation
Use Cases
News Summarization
Indian Language News Summarization
Automatically generate summaries for news articles in Indian languages.
ЁЯЪА IndicBART-XLSum
IndicBART-XLSum is a multilingual pre - trained sequence - to - sequence model based on IndicBART. It focuses on Indic languages, currently supporting 7 Indian languages and is built on the mBART architecture.
тЬи Features
- Supported Languages: Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, and Telugu. Some of these languages are not supported by mBART50 and mT5.
- Lightweight: The model is much smaller than the mBART and mT5(-base) models, resulting in lower computational costs for fine - tuning and decoding.
- Training Data: Trained on the Indic portion of the XLSum corpora.
- Script Independence: Each language is written in its own script, eliminating the need for script mapping to/from Devanagari.
You can read about IndicBARTSS in this paper.
ЁЯЪА Quick Start
Installation
The installation mainly involves setting up the necessary libraries from the transformers
package. You need to install transformers
if you haven't already. You can use the following command:
pip install transformers
Usage
from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import AlbertTokenizer, AutoTokenizer
tokenizer = AlbertTokenizer.from_pretrained("ai4bharat/IndicBART-XLSum", do_lower_case=False, use_fast=False, keep_accents=True)
# Or use tokenizer = AlbertTokenizer.from_pretrained("ai4bharat/IndicBART-XLSum", do_lower_case=False, use_fast=False, keep_accents=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/IndicBART-XLSum")
# Or use model = MBartForConditionalGeneration.from_pretrained("ai4bharat/IndicBART-XLSum")
# Some initial mapping
bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")
eos_id = tokenizer._convert_token_to_id_with_added_voc("</s>")
pad_id = tokenizer._convert_token_to_id_with_added_voc("<pad>")
# To get lang_id use any of ['<2bn>', '<2gu>', '<2hi>', '<2mr>', '<2pa>', '<2ta>', '<2te>']
# First tokenize the input and outputs. The format below is how IndicBART-XLSum was trained so the input should be "Sentence </s> <2xx>" where xx is the language code. Similarly, the output should be "<2yy> Sentence </s>".
inp = tokenizer("рдЯреЗрд╕рд╛ рдЬреЙрд╡рд▓ рдХрд╛ рдХрд╣рдирд╛ рд╣реИ рдХрд┐ рдореГрддрдХреЛрдВ рдФрд░ рд▓рд╛рдкрддрд╛ рд▓реЛрдЧреЛрдВ рдХреЗ рдкрд░рд┐рдЬрдиреЛрдВ рдХреА рдорджрдж рдХреЗ рд▓рд┐рдП рдПрдХ рдХреЗрдВрджреНрд░ рд╕реНрдерд╛рдкрд┐рдд рдХрд┐рдпрд╛ рдЬрд╛ рд░рд╣рд╛ рд╣реИ. рдЙрдиреНрд╣реЛрдВрдиреЗ рдЗрд╕ рд╣рд╛рджрд╕реЗ рдХреЗ рддреАрди рдХреЗ рдмрд╛рдж рднреА рдореГрддрдХреЛрдВ рдХреА рд╕реВрдЪреА рдЬрд╛рд░реА рдХрд░рдиреЗ рдореЗрдВ рд╣реЛ рд░рд╣реА рджреЗрд░реА рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╕реНрдкрд╖реНрдЯреАрдХрд░рдг рджреЗрддреЗ рд╣реБрдП рдХрд╣рд╛ рд╣реИ рд╢рд╡реЛрдВ рдХреА рдареАрдХ рдкрд╣рдЪрд╛рди рд╣реЛрдирд╛ рдЬрд╝рд░реВрд░реА рд╣реИ. рдкреБрд▓рд┐рд╕ рдХреЗ рдЕрдиреБрд╕рд╛рд░ рдзрдорд╛рдХреЛрдВ рдореЗрдВ рдорд╛рд░реЗ рдЧрдП рд▓реЛрдЧреЛрдВ рдХреА рд╕рдВрдЦреНрдпрд╛ рдЕрдм 49 рд╣реЛ рдЧрдИ рд╣реИ рдФрд░ рдЕрдм рднреА 20 рд╕реЗ рдЬрд╝реНрдпрд╛рджрд╛ рд▓реЛрдЧ рд▓рд╛рдкрддрд╛ рд╣реИрдВ. рдкреБрд▓рд┐рд╕ рдХреЗ рдЕрдиреБрд╕рд╛рд░ рд▓рдВрджрди рдкрд░ рд╣рдорд▓реЗ рдпреЛрдЬрдирд╛рдмрджреНрдз рддрд░реАрдХреЗ рд╕реЗ рд╣реБрдП рдФрд░ рднреВрдорд┐рдЧрдд рд░реЗрд▓рдЧрд╛рдбрд╝рд┐рдпреЛрдВ рдореЗрдВ рд╡рд┐рд╕реНрдлреЛрдЯ рддреЛ 50 рд╕реИрдХреЗрдВрдб рдХреЗ рднреАрддрд░ рд╣реБрдП. рдкрд╣рдЪрд╛рди рдХреА рдкреНрд░рдХреНрд░рд┐рдпрд╛ рдХрд┐рдВрдЧреНрд╕ рдХреНрд░реЙрд╕ рд╕реНрдЯреЗрд╢рди рдХреЗ рдкрд╛рд╕ рд╕реБрд░рдВрдЧ рдореЗрдВ рдзрдорд╛рдХреЗ рд╕реЗ рдХреНрд╖рддрд┐рдЧреНрд░рд╕реНрдд рд░реЗрд▓ рдХреЛрдЪреЛрдВ рдореЗрдВ рдЕрдм рднреА рдкрдбрд╝реЗ рд╢рд╡реЛрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╕реНрдерд┐рддрд┐ рд╕рд╛рдл рдирд╣реАрдВ рд╣реИ рдФрд░ рдкреБрд▓рд┐рд╕ рдиреЗ рдЖрдЧрд╛рд╣ рдХрд┐рдпрд╛ рд╣реИ рдХрд┐ рд╣рддрд╛рд╣рддреЛрдВ рдХреА рд╕рдВрдЦреНрдпрд╛ рдмрдврд╝ рд╕рдХрддреА рд╣реИ. рдкреБрд▓рд┐рд╕, рдиреНрдпрд╛рдпрд┐рдХ рдЕрдзрд┐рдХрд╛рд░рд┐рдпреЛрдВ, рдЪрд┐рдХрд┐рддреНрд╕рдХреЛрдВ рдФрд░ рдЕрдиреНрдп рд╡рд┐рд╢реЗрд╖рдЬреНрдЮреЛрдВ рдХрд╛ рдПрдХ рдЖрдпреЛрдЧ рдмрдирд╛рдпрд╛ рдЧрдпрд╛ рд╣реИ рдЬрд┐рд╕рдХреА рджреЗрдЦ-рд░реЗрдЦ рдореЗрдВ рд╢рд╡реЛрдВ рдХреА рдкрд╣рдЪрд╛рди рдХреА рдкреНрд░рдХреНрд░рд┐рдпрд╛ рдкреВрд░реА рд╣реЛрдЧреА. рдорд╣рддреНрд╡рдкреВрд░реНрдг рд╣реИ рдХрд┐ рдЧреБрд░реБрд╡рд╛рд░ рдХреЛ рд▓рдВрджрди рдореЗрдВ рдореГрддрдХреЛрдВ рдХреЗ рд╕рдореНрдорд╛рди рдореЗрдВ рд╕рд╛рд░реНрд╡рдЬрдирд┐рдХ рд╕рдорд╛рд░реЛрд╣ рд╣реЛрдЧрд╛ рдЬрд┐рд╕рдореЗрдВ рдЙрдиреНрд╣реЗрдВ рд╢реНрд░рджреНрдзрд╛рдБрдЬрд▓рд┐ рджреА рдЬрд╛рдПрдЧреА рдФрд░ рджреЛ рдорд┐рдирдЯ рдХрд╛ рдореМрди рд░рдЦрд╛ рдЬрд╛рдПрдЧрд╛. рдкреБрд▓рд┐рд╕ рдХрд╛ рдХрд╣рдирд╛ рд╣реИ рдХрд┐ рд╡рд╣ рдЗрд╕реНрд▓рд╛рдореА рдЪрд░рдордкрдВрдереА рд╕рдВрдЧрдарди рдЕрдмреВ рд╣рдлрд╝реНрд╕ рдЕрд▓-рдорд╛рд╕рд░реА рдмреНрд░рд┐рдЧреЗрдбреНрд╕ рдХрд╛ рдЗрди рдзрдорд╛рдХреЛрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдХрд┐рдП рдЧрдП рджрд╛рд╡реЗ рдХреЛ рдЧрдВрднреАрд░рддрд╛ рд╕реЗ рд▓реЗ рд░рд╣реА рд╣реИ. 'рдзрдорд╛рдХреЗ рдкрдЪрд╛рд╕ рд╕реЗрдХреЗрдВрдб рдореЗрдВ рд╣реБрдП' рдкреБрд▓рд┐рд╕ рдХреЗ рдЕрдиреБрд╕рд╛рд░ рд▓рдВрджрди рдкрд░ рд╣реБрдП рд╣рдорд▓реЗ рдпреЛрдЬрдирд╛рдмрджреНрдз рддрд░реАрдХреЗ рд╕реЗ рдХрд┐рдП рдЧрдП рдереЗ. рдкреБрд▓рд┐рд╕ рдХреЗ рдЕрдиреБрд╕рд╛рд░ рднреВрдорд┐рдЧрдд рд░реЗрд▓реЛрдВ рдореЗрдВ рддреАрди рдмрдо рдЕрд▓рдЧ-рдЕрд▓рдЧ рдЬрдЧрд╣реЛрдВ рд▓рдЧрднрдЧ рдЕрдЪрд╛рдирдХ рдлрдЯреЗ рдереЗ. рдЗрд╕рд▓рд┐рдП рдкреБрд▓рд┐рд╕ рдХреЛ рд╕рдВрджреЗрд╣ рд╣реИ рдХрд┐ рдзрдорд╛рдХреЛрдВ рдореЗрдВ рдЯрд╛рдЗрдорд┐рдВрдЧ рдЙрдкрдХрд░рдгреЛрдВ рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реЛрдЧрд╛. рдпрд╣ рднреА рддрдереНрдп рд╕рд╛рдордиреЗ рдЖрдпрд╛ рд╣реИ рдХрд┐ рдзрдорд╛рдХреЛрдВ рдореЗрдВ рдЖрдзреБрдирд┐рдХ рдХрд┐рд╕реНрдо рдХреЗ рд╡рд┐рд╕реНрдлреЛрдЯрдХреЛрдВ рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛. рдкрд╣рд▓реЗ рдорд╛рдирд╛ рдЬрд╛ рд░рд╣рд╛ рдерд╛ рдХрд┐ рд╣рдорд▓реЛрдВ рдореЗрдВ рджреЗрд╕реА рд╡рд┐рд╕реНрдлреЛрдЯрдХреЛрдВ рдХрд╛ рдЗрд╕реНрддреЗрдорд╛рд▓ рдХрд┐рдпрд╛ рдЧрдпрд╛ рд╣реЛрдЧрд╛. рдкреБрд▓рд┐рд╕ рдореБрдЦреНрдпрд╛рд▓рдп рд╕реНрдХреЙрдЯрд▓реИрдВрдб рдпрд╛рд░реНрдб рдореЗрдВ рд╕рд╣рд╛рдпрдХ рдЙрдкрд╛рдпреБрдХреНрдд рдмреНрд░рд╛рдпрди рдкреИрдбрд┐рдХ рдиреЗ рдмрддрд╛рдпрд╛ рдХрд┐ рднреВрдорд┐рдЧрдд рд░реЗрд▓реЛрдВ рдореЗрдВ рддреАрди рдзрдорд╛рдХреЗ 50 рд╕реЗрдХреЗрдВрдб рдХреЗ рдЕрдВрддрд░рд╛рд▓ рдХреЗ рднреАрддрд░ рд╣реБрдП рдереЗ. рдзрдорд╛рдХреЗ рдЧреБрд░реБрд╡рд╛рд░ рд╕реБрдмрд╣ рдЖрда рдмрдЬрдХрд░ рдкрдЪрд╛рд╕ рдорд┐рдирдЯ рдкрд░ рд╣реБрдП рдереЗ. рд▓рдВрджрди рдЕрдВрдбрд░рдЧреНрд░рд╛рдЙрдВрдб рд╕реЗ рдорд┐рд▓реА рд╡рд┐рд╕реНрддреГрдд рддрдХрдиреАрдХреА рд╕реВрдЪрдирд╛рдУрдВ рд╕реЗ рдпрд╣ рддрдереНрдп рд╕рд╛рдордиреЗ рдЖрдпрд╛ рд╣реИ. рдЗрд╕рд╕реЗ рдкрд╣рд▓реЗ рдмрдо рдзрдорд╛рдХреЛрдВ рдореЗрдВ рдЕрдЪреНрдЫреЗ рдЦрд╛рд╕реЗ рдЕрдВрддрд░рд╛рд▓ рдХреА рдмрд╛рдд рдХреА рдЬрд╛ рд░рд╣реА рдереЗ.</s> <2hi>", add_special_tokens=False, return_tensors="pt", padding=True).input_ids
out = tokenizer("<2hi>рдкрд░рд┐рдЬрдиреЛрдВ рдХреА рдорджрдж рдХреА рдЬрд╝рд┐рдореНрдореЗрджрд╛рд░реА рдордВрддреНрд░реА рдкрд░ </s>", add_special_tokens=False, return_tensors="pt", padding=True).input_ids
model_outputs=model(input_ids=inp, decoder_input_ids=out[:,0:-1], labels=out[:,1:])
# For loss
model_outputs.loss ## This is not label smoothed.
# For logits
model_outputs.logits
# For generation. Pardon the messiness. Note the decoder_start_token_id.
model.eval() # Set dropouts to zero
model_output=model.generate(inp, use_cache=True, num_beams=4, max_length=20, min_length=1, early_stopping=True, pad_token_id=pad_id, bos_token_id=bos_id, eos_token_id=eos_id, decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc("<2en>"))
# Decode to get output strings
decoded_output=tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(decoded_output) # рд▓рдВрджрди рдзрдорд╛рдХреЛрдВ рдореЗрдВ рдорд╛рд░реЗ рдЧрдП рд▓реЛрдЧреЛрдВ рдХреА рд╕реВрдЪреА рдЬрд╛рд░реА
ЁЯУЪ Documentation
Benchmarks
Scores on the IndicBART-XLSum
test sets are as follows:
Language | Rouge - 1 / Rouge - 2 / Rouge - L |
---|---|
bn | 0.172331 / 0.051777 / 0.160245 |
gu | 0.143240 / 0.039993 / 0.133981 |
hi | 0.220394 / 0.065464 / 0.198816 |
mr | 0.172568 / 0.062591 / 0.160403 |
pa | 0.218274 / 0.066087 / 0.192010 |
ta | 0.177317 / 0.058636 / 0.166324 |
te | 0.156386 / 0.041042 / 0.144179 |
average | 0.180073 / 0.055084 / 0.165137 |
ЁЯТб Usage Tip
- This model is compatible with the latest version of
transformers
, but it was developed with version 4.3.2. Consider using 4.3.2 if possible. - While the example only shows how to get logits, loss, and generate outputs, you can perform most operations supported by the
MBartForConditionalGeneration
class as described in https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartForConditionalGeneration. - Note that the tokenizer used here is based on sentencepiece, not BPE. Therefore, the
AlbertTokenizer
class is used instead of theMBartTokenizer
class.
Phi 2 GGUF
Other
Phi-2 is a small yet powerful language model developed by Microsoft, featuring 2.7 billion parameters, focusing on efficient inference and high-quality text generation.
Large Language Model Supports Multiple Languages
P
TheBloke
41.5M
205
Roberta Large
MIT
A large English language model pre-trained with masked language modeling objectives, using improved BERT training methods
Large Language Model English
R
FacebookAI
19.4M
212
Distilbert Base Uncased
Apache-2.0
DistilBERT is a distilled version of the BERT base model, maintaining similar performance while being more lightweight and efficient, suitable for natural language processing tasks such as sequence classification and token classification.
Large Language Model English
D
distilbert
11.1M
669
Llama 3.1 8B Instruct GGUF
Meta Llama 3.1 8B Instruct is a multilingual large language model optimized for multilingual dialogue use cases, excelling in common industry benchmarks.
Large Language Model English
L
modularai
9.7M
4
Xlm Roberta Base
MIT
XLM-RoBERTa is a multilingual model pretrained on 2.5TB of filtered CommonCrawl data across 100 languages, using masked language modeling as the training objective.
Large Language Model Supports Multiple Languages
X
FacebookAI
9.6M
664
Roberta Base
MIT
An English pre-trained model based on Transformer architecture, trained on massive text through masked language modeling objectives, supporting text feature extraction and downstream task fine-tuning
Large Language Model English
R
FacebookAI
9.3M
488
Opt 125m
Other
OPT is an open pre-trained Transformer language model suite released by Meta AI, with parameter sizes ranging from 125 million to 175 billion, designed to match the performance of the GPT-3 series while promoting open research in large-scale language models.
Large Language Model English
O
facebook
6.3M
198
1
A pretrained model based on the transformers library, suitable for various NLP tasks
Large Language Model
Transformers

1
unslothai
6.2M
1
Llama 3.1 8B Instruct
Llama 3.1 is Meta's multilingual large language model series, featuring 8B, 70B, and 405B parameter scales, supporting 8 languages and code generation, with optimized multilingual dialogue scenarios.
Large Language Model
Transformers Supports Multiple Languages

L
meta-llama
5.7M
3,898
T5 Base
Apache-2.0
The T5 Base Version is a text-to-text Transformer model developed by Google with 220 million parameters, supporting multilingual NLP tasks.
Large Language Model Supports Multiple Languages
T
google-t5
5.4M
702
Featured Recommended AI Models
┬й 2025AIbase