🚀 Titulm Llama 3.2-1B Model
This project presents a continually pre-trained model based on Llama 3.2-1B, fine-tuned on extensive Bangla datasets to enhance Bangla text generation and understanding capabilities.
🚀 Quick Start
Starting with transformers >= 4.43.0
, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate()
function.
Make sure to update your transformers
installation via pip install --upgrade transformers
.
import torch
from transformers import pipeline
model_id = "hishab/titulm-llama-3.2-1b-v1.1"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("আমাদের দেশের নাম")
✨ Features
- Continually pre-trained on Llama 3.2-1B architecture for better Bangla text generation.
- Supports both Bengali (primary) and English (secondary) languages.
- Uses Grouped-Query Attention (GQA) for improved inference scalability.
📦 Installation
Ensure you have transformers >= 4.43.0
installed. You can update it via:
pip install --upgrade transformers
💻 Usage Examples
Basic Usage
import torch
from transformers import pipeline
model_id = "hishab/titulm-llama-3.2-1b-v1.1"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("আমাদের দেশের নাম")
📚 Documentation
Model Information
This model is a continually pre-trained version of the meta-llama/Llama-3.2-1B architecture, fine-tuned on extensive Bangla datasets. The main goal is to enhance the model's ability to generate high-quality Bangla text.
Property |
Details |
Model Type |
Llama 3.2 (auto-regressive language model with optimized transformer architecture) |
Training Data |
Hishab curated Bangla text corpus |
Params |
1B (1.23B) |
Input modalities |
Monolingual Text (Bangla) |
Output modalities |
Monolingual Text (Bangla) |
Context Length |
4096 |
GQA |
Yes |
Shared Embeddings |
Yes |
Token count |
8.5B tokens |
Knowledge cutoff |
N/A |
Supported Languages: Bengali (primary) and English (secondary)
Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
Model Release Date: October 24, 2024
Status: This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities.
License: We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).
More information can be found in the paper TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking and on the project page.
Hardware and Software
Training Factors: We used llama-factory training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
Training Data
Overview: We have collected a large Bangla raw dataset of text data from various sources, including web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered to ensure data quality, with a size of roughly 268 GB. We separated 33GB data, and the total trained tokens are 8.5B tokens.
Data sources summary:
- Web documents: Extracted, clean, and filtered common crawl data
- Books: Extracted, clean, filtered book data
- Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
- Translation data: Trained an English-Bangla translation LLM model and used it to translate English data to Bangla
- Code-mixed data: Trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
- Transliteration data: Trained a Bangla-English transliteration LLM model and used it to generate transliterated data
- Synthetic data: Generated synthetic data using a Bangla LLM model
- Others: Scraped some selected website data, used open-source data, and used some other data sources
Benchmarks
Evaluation Datasets
We evaluated our pre-trained models on both Bangla and English benchmark datasets.
Bangla Benchmark datasets:
- Bangla MMLU: A private multiple choice question dataset developed by Hishab curated from various sources.
- CommonsenseQa Bangla: A Bangla translation of the CommonsenseQA dataset, translated using Expressive Semantic Translation (EST).
- OpenbookQA Bangla: A Bangla translation of the OpenbookQA dataset, translated using Expressive Semantic Translation (EST).
- Piqa Bangla: A Bangla translation of the Piqa dataset, translated using Expressive Semantic Translation (EST).
- BoolQ Bangla: Contains 15,942 examples, with each entry consisting of a triplet: (question, passage, answer).
English Benchmark datasets:
- MMLU: A massive multitask test with multiple-choice questions from various knowledge branches.
- CommonseQa: A multiple-choice question-answering dataset requiring commonsense knowledge.
- OpenbookQA: Promotes research in advanced question-answering.
- Piqa: Focuses on physical commonsense reasoning.
- BoolQ: A question-answer dataset for yes/no questions with 15942 examples.
Evaluation Results
Evaluation of Bangla Benchmark datasets:
Model |
Shots |
Bangla MMLU |
BoolQ BN |
Commonsense QA BN |
OpenBook QA BN |
PIQA BN |
llama-3.2-1b |
0-shot |
0.29 |
0.55 |
0.22 |
0.33 |
0.53 |
|
5-shot |
0.28 |
- |
0.23 |
0.31 |
0.54 |
hishab/titulm-llama-3.2-1b-v1.1 |
0-shot |
0.28 |
0.54 |
0.28 |
0.31 |
0.56 |
|
5-shot |
0.28 |
- |
0.31 |
0.34 |
0.57 |
Evaluation of English Benchmark datasets:
Model |
Shots |
MMLU |
BoolQ |
Commonsense QA |
OpenBook QA |
PIQA |
llama-3.2-1b |
0-shot |
0.38 |
0.64 |
0.47 |
0.37 |
0.75 |
|
5-shot |
0.309 |
0.662 |
0.317 |
0.396 |
0.759 |
titulm-llama-3.2-1b-v1.1 |
0-shot |
0.26 |
0.62 |
0.34 |
0.35 |
0.73 |
|
5-shot |
0.26 |
0.62 |
0.25 |
0.39 |
0.74 |
Instruction Tuned Models
Intended Use
- Bangla text generation
- Bangla language understanding tasks
- Bangla instruction fine-tuning tasks
📄 License
We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).
🔧 Technical Details
The model is based on the Llama 3.2 architecture, which is an auto-regressive language model with an optimized transformer architecture. It uses Grouped-Query Attention (GQA) for improved inference scalability.
📚 Citation
@misc{nahin2025titullmsfamilybanglallms,
title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking},
author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam},
year={2025},
eprint={2502.11187},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11187},
}