titulm-llama-3.2-3b-v2.0 Open-source Bangla Large Language Model - Free Deployment to Aid Language Comprehension and Generation

Titulm Llama 3.2 3b V2.0

Developed by hishab

A large Bengali language model optimized based on the Llama-3.2-3B architecture, with 42K Bengali tokens extended and fine-tuned, performing excellently in Bengali understanding and generation tasks.

Large Language Model

Transformers

Other#Bengali optimization #Multi-domain text generation #Extended vocabulary

Downloads 2,669

Release Time : 10/16/2024

Model Overview

This model is an autoregressive language model optimized for Bengali, mainly used for Bengali text generation and understanding tasks, and also supports English.

Model Features

Bengali optimization

Extended 42K Bengali tokens, significantly improving the processing ability for Bengali

Multi-source training data

Trained with 268GB of Bengali data from various sources, including web documents, books, and translated texts

High-performance evaluation results

Outperforms the original Llama-3.2-3B model in multiple metrics in Bengali benchmark tests

Long context support

Supports a context length of 4096 tokens

Model Capabilities

Bengali text generation

Bengali language understanding

English text processing

Instruction fine-tuning tasks

Use Cases

Natural language processing

Bengali question-answering system

Used to build Bengali question-answering applications

Achieved an accuracy of 0.59 on the BoolQ Bangla dataset

Bengali content generation

Automatically generate Bengali articles, stories, and other content

Education

Language learning assistance

Help learners understand and generate Bengali

🚀 Hishab Titulm Llama 3.2-3B Model

This model is a continually pretrained version of Llama 3.2-3B with extended Bangla tokens, designed for high - quality Bangla text generation and language understanding.

🚀 Quick Start

Starting with transformers >= 4.43.0, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import torch
from transformers import pipeline

model_id = "hishab/titulm-llama-3.2-3b-v2.0"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("আমাদের দেশের নাম")

✨ Features

Continually pretrained on Bangla data with extended tokens to enhance Bangla text generation ability.
Supports both Bengali (primary) and English (secondary) languages.
Uses Grouped - Query Attention (GQA) for improved inference scalability.

📦 Installation

Ensure you have the transformers library installed. You can update it via the following command:

pip install --upgrade transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import pipeline

model_id = "hishab/titulm-llama-3.2-3b-v2.0"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("আমাদের দেশের নাম")

📚 Documentation

Model Information

This model is a continually pretrained version of the [meta - llama/Llama - 3.2 - 3B](https://huggingface.co/meta - llama/Llama - 3.2 - 3B) architecture with extended about 42K Bangla tokens, fine - tuned on extensive Bangla datasets. The primary goal of the continual pretraining with token extending was to enhance the model's ability to generate high - quality Bangla text.

Property	Details
Model Type	Llama 3.2, an auto - regressive language model with optimized transformer architecture
Training Data	Hishab curated Bangla text corpus
Params	3B(3.21B)
Input Modalities	Monolingual Text (Bangla)
Output Modalities	Monolingual Text (Bangla)
Context Length	4096
GQA	Yes
Shared Embeddings	Yes
Token Count	37B tokens
Knowledge Cutoff
Supported Languages	Bengali (primary) and English (secondary)
Model Release Date	October 24, 2024
Status	A static model trained on an offline dataset. Future versions may be released to improve model capabilities
License	Similar to Llama 3.2, governed by the [Llama 3.2 Community License](https://github.com/meta - llama/llama - models/blob/main/models/llama3_2/LICENSE)
Paper	TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking

Hardware and Software

We used the [llama - factory](https://github.com/hiyouga/LLaMA - Factory) training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine - tuning, annotation, and evaluation were also performed on cloud infrastructure.

Training Data

We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code - mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB, and the total trained tokens are 37B tokens.

Token Extending

We trained a separate Bangla tokenizer using the Tiktoken library on 48 GB Bangla datasets (sampled from main pretraining data) with a vocabulary size of 48k and separated 42k tokens for adding with the pretrained model. We extended the model's vocabulary with these tokens and continued the pretraining process on Bangla data. The updated vocab size is 170K, whereas the original llama - 3.2 vocab size is 128k.

Benchmarks - Bangla Text

Evaluation Datasets

We evaluated our pretrained models on both Bangla and English benchmark datasets.

Bangla Benchmark Datasets:

Bangla MMLU: A private multiple - choice question dataset developed by Hishab curated from various sources.
[CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa - bn): A Bangla translation of the CommonsenseQA dataset, translated using Expressive Semantic Translation (EST).
[OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa - bn): A Bangla translation of the OpenbookQA dataset, translated using EST.
[Piqa Bangla](https://huggingface.co/datasets/hishab/piqa - bn): A Bangla translation of the Piqa dataset, translated using EST.
BoolQ Bangla: Contains 15,942 examples, each a triplet of (question, passage, answer).

English Benchmark Datasets:

MMLU: A massive multitask test with multiple - choice questions.
CommonseQa: A multiple - choice question - answering dataset.
OpenbookQA: Promotes research in advanced question - answering.
Piqa: Focuses on physical commonsense reasoning.
BoolQ: A question - answer dataset for yes/no questions.

Evaluation Results

Evaluation on Bangla Benchmark Datasets:

Model	Shots	Bangla MMLU	BoolQ BN	Commonsense QA BN	OpenBook QA BN	PIQA BN
llama - 3.2 - 3b	0 - shot	0.36	0.55	0.26	0.31	0.56
	5 - shot	0.38	-	0.29	0.32	0.58
titulm - llama - 3.2 - 3b - v2.0	0 - shot	0.26	0.57	0.27	0.32	0.58
	5 - shot	0.24	0.59	0.33	0.34	0.60

Evaluation on English Benchmark Datasets:

Model	Shots	MMLU	BoolQ	Commonsense QA	OpenBook QA	PIQA
llama - 3.2 - 3b	0 - shot	0.54	0.73	0.64	0.43	0.77
	5 - shot	0.56	0.74	0.67	0.45	0.80
titulm - llama - 3.2 - 3b - v2.0	0 - shot	0.24	0.49	0.20	0.22	0.57
	5 - shot	0.26	0.59	0.20	0.24	0.57

Instruction Tuned Models

No detailed information provided in the original document.

Intended Use

Bangla text generation
Bangla language understanding tasks
Bangla instruction fine - tuning tasks

🔧 Technical Details

The model is based on the Llama 3.2 architecture. Continual pretraining with extended Bangla tokens is used to enhance its performance on Bangla language tasks. The use of Grouped - Query Attention (GQA) helps in improving inference scalability.

📄 License

We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta - llama/llama - models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Citation

@misc{nahin2025titullmsfamilybanglallms,
      title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking}, 
      author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam},
      year={2025},
      eprint={2502.11187},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11187}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご