🚀 Hishab Titulm Llama 3.2-3B Model
This model is a continually pretrained version of Llama 3.2-3B with extended Bangla tokens, designed for high - quality Bangla text generation and language understanding.
🚀 Quick Start
Starting with transformers >= 4.43.0
, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate()
function.
Make sure to update your transformers
installation via pip install --upgrade transformers
.
import torch
from transformers import pipeline
model_id = "hishab/titulm-llama-3.2-3b-v2.0"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("আমাদের দেশের নাম")
✨ Features
- Continually pretrained on Bangla data with extended tokens to enhance Bangla text generation ability.
- Supports both Bengali (primary) and English (secondary) languages.
- Uses Grouped - Query Attention (GQA) for improved inference scalability.
📦 Installation
Ensure you have the transformers
library installed. You can update it via the following command:
pip install --upgrade transformers
💻 Usage Examples
Basic Usage
import torch
from transformers import pipeline
model_id = "hishab/titulm-llama-3.2-3b-v2.0"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("আমাদের দেশের নাম")
📚 Documentation
Model Information
This model is a continually pretrained version of the [meta - llama/Llama - 3.2 - 3B](https://huggingface.co/meta - llama/Llama - 3.2 - 3B) architecture with extended about 42K Bangla tokens, fine - tuned on extensive Bangla datasets. The primary goal of the continual pretraining with token extending was to enhance the model's ability to generate high - quality Bangla text.
Property |
Details |
Model Type |
Llama 3.2, an auto - regressive language model with optimized transformer architecture |
Training Data |
Hishab curated Bangla text corpus |
Params |
3B(3.21B) |
Input Modalities |
Monolingual Text (Bangla) |
Output Modalities |
Monolingual Text (Bangla) |
Context Length |
4096 |
GQA |
Yes |
Shared Embeddings |
Yes |
Token Count |
37B tokens |
Knowledge Cutoff |
|
Supported Languages |
Bengali (primary) and English (secondary) |
Model Release Date |
October 24, 2024 |
Status |
A static model trained on an offline dataset. Future versions may be released to improve model capabilities |
License |
Similar to Llama 3.2, governed by the [Llama 3.2 Community License](https://github.com/meta - llama/llama - models/blob/main/models/llama3_2/LICENSE) |
Paper |
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking |
Hardware and Software
We used the [llama - factory](https://github.com/hiyouga/LLaMA - Factory) training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine - tuning, annotation, and evaluation were also performed on cloud infrastructure.
Training Data
We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code - mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB, and the total trained tokens are 37B tokens.
Token Extending
We trained a separate Bangla tokenizer using the Tiktoken library on 48 GB Bangla datasets (sampled from main pretraining data) with a vocabulary size of 48k and separated 42k tokens for adding with the pretrained model. We extended the model's vocabulary with these tokens and continued the pretraining process on Bangla data. The updated vocab size is 170K, whereas the original llama - 3.2 vocab size is 128k.
Benchmarks - Bangla Text
Evaluation Datasets
We evaluated our pretrained models on both Bangla and English benchmark datasets.
Bangla Benchmark Datasets:
- Bangla MMLU: A private multiple - choice question dataset developed by Hishab curated from various sources.
- [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa - bn): A Bangla translation of the CommonsenseQA dataset, translated using Expressive Semantic Translation (EST).
- [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa - bn): A Bangla translation of the OpenbookQA dataset, translated using EST.
- [Piqa Bangla](https://huggingface.co/datasets/hishab/piqa - bn): A Bangla translation of the Piqa dataset, translated using EST.
- BoolQ Bangla: Contains 15,942 examples, each a triplet of (question, passage, answer).
English Benchmark Datasets:
- MMLU: A massive multitask test with multiple - choice questions.
- CommonseQa: A multiple - choice question - answering dataset.
- OpenbookQA: Promotes research in advanced question - answering.
- Piqa: Focuses on physical commonsense reasoning.
- BoolQ: A question - answer dataset for yes/no questions.
Evaluation Results
Evaluation on Bangla Benchmark Datasets:
Model |
Shots |
Bangla MMLU |
BoolQ BN |
Commonsense QA BN |
OpenBook QA BN |
PIQA BN |
llama - 3.2 - 3b |
0 - shot |
0.36 |
0.55 |
0.26 |
0.31 |
0.56 |
|
5 - shot |
0.38 |
- |
0.29 |
0.32 |
0.58 |
titulm - llama - 3.2 - 3b - v2.0 |
0 - shot |
0.26 |
0.57 |
0.27 |
0.32 |
0.58 |
|
5 - shot |
0.24 |
0.59 |
0.33 |
0.34 |
0.60 |
Evaluation on English Benchmark Datasets:
Model |
Shots |
MMLU |
BoolQ |
Commonsense QA |
OpenBook QA |
PIQA |
llama - 3.2 - 3b |
0 - shot |
0.54 |
0.73 |
0.64 |
0.43 |
0.77 |
|
5 - shot |
0.56 |
0.74 |
0.67 |
0.45 |
0.80 |
titulm - llama - 3.2 - 3b - v2.0 |
0 - shot |
0.24 |
0.49 |
0.20 |
0.22 |
0.57 |
|
5 - shot |
0.26 |
0.59 |
0.20 |
0.24 |
0.57 |
Instruction Tuned Models
No detailed information provided in the original document.
Intended Use
- Bangla text generation
- Bangla language understanding tasks
- Bangla instruction fine - tuning tasks
🔧 Technical Details
The model is based on the Llama 3.2 architecture. Continual pretraining with extended Bangla tokens is used to enhance its performance on Bangla language tasks. The use of Grouped - Query Attention (GQA) helps in improving inference scalability.
📄 License
We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta - llama/llama - models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).
Citation
@misc{nahin2025titullmsfamilybanglallms,
title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking},
author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam},
year={2025},
eprint={2502.11187},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11187},
}