titulm-llama-3.2-1b-v1.1 Open-source Model - Significantly Improve the Ability to Generate and Understand Bengali Texts

Titulm Llama 3.2 1b V1.1

Developed by hishab

A Bengali large language model continuously pre-trained based on the Llama 3.2 architecture, fine-tuned on a large Bengali dataset to improve the ability of Bengali text generation and understanding.

Large Language Model

Transformers

Other#Bengali generation #GQA optimization #Multilingual support

Downloads 209

Release Time : 10/4/2024

Model Overview

This model focuses on improving the ability of Bengali text generation and understanding, supports both Bengali and English, and is suitable for a variety of natural language processing tasks.

Model Features

Bengali optimization

Fine-tuned on a large Bengali dataset to significantly improve the ability of Bengali text generation and understanding.

Multilingual support

Primarily supports Bengali and secondarily supports English, suitable for bilingual tasks.

Efficient inference

Uses Grouped-Query Attention (GQA) technology to improve inference scalability.

High-quality data

The training data is strictly cleaned and filtered to ensure data quality and model performance.

Model Capabilities

Bengali text generation

Bengali language understanding

English text generation

English language understanding

Use Cases

Natural language processing

Bengali text generation

Generate high-quality Bengali text, suitable for scenarios such as content creation and translation.

Performs excellently in Bengali benchmark tests.

Bengali question-answering system

Build a Bengali question-answering system to answer users' questions.

Performs well on datasets such as BoolQ BN and Commonsense QA BN.

Education

Bengali learning assistance

Assist in learning Bengali, providing support in grammar, vocabulary, etc.

🚀 Titulm Llama 3.2-1B Model

This project presents a continually pre-trained model based on Llama 3.2-1B, fine-tuned on extensive Bangla datasets to enhance Bangla text generation and understanding capabilities.

🚀 Quick Start

Starting with transformers >= 4.43.0, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import torch
from transformers import pipeline

model_id = "hishab/titulm-llama-3.2-1b-v1.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("আমাদের দেশের নাম")

✨ Features

Continually pre-trained on Llama 3.2-1B architecture for better Bangla text generation.
Supports both Bengali (primary) and English (secondary) languages.
Uses Grouped-Query Attention (GQA) for improved inference scalability.

📦 Installation

Ensure you have transformers >= 4.43.0 installed. You can update it via:

pip install --upgrade transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import pipeline

model_id = "hishab/titulm-llama-3.2-1b-v1.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("আমাদের দেশের নাম")

📚 Documentation

Model Information

This model is a continually pre-trained version of the meta-llama/Llama-3.2-1B architecture, fine-tuned on extensive Bangla datasets. The main goal is to enhance the model's ability to generate high-quality Bangla text.

Property	Details
Model Type	Llama 3.2 (auto-regressive language model with optimized transformer architecture)
Training Data	Hishab curated Bangla text corpus
Params	1B (1.23B)
Input modalities	Monolingual Text (Bangla)
Output modalities	Monolingual Text (Bangla)
Context Length	4096
GQA	Yes
Shared Embeddings	Yes
Token count	8.5B tokens
Knowledge cutoff	N/A

Supported Languages: Bengali (primary) and English (secondary)

Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.

Model Release Date: October 24, 2024

Status: This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities.

License: We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).

More information can be found in the paper TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking and on the project page.

Hardware and Software

Training Factors: We used llama-factory training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.

Training Data

Overview: We have collected a large Bangla raw dataset of text data from various sources, including web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered to ensure data quality, with a size of roughly 268 GB. We separated 33GB data, and the total trained tokens are 8.5B tokens.

Data sources summary:

Web documents: Extracted, clean, and filtered common crawl data
Books: Extracted, clean, filtered book data
Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
Translation data: Trained an English-Bangla translation LLM model and used it to translate English data to Bangla
Code-mixed data: Trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
Transliteration data: Trained a Bangla-English transliteration LLM model and used it to generate transliterated data
Synthetic data: Generated synthetic data using a Bangla LLM model
Others: Scraped some selected website data, used open-source data, and used some other data sources

Benchmarks

Evaluation Datasets

We evaluated our pre-trained models on both Bangla and English benchmark datasets.

Bangla Benchmark datasets:

Bangla MMLU: A private multiple choice question dataset developed by Hishab curated from various sources.
CommonsenseQa Bangla: A Bangla translation of the CommonsenseQA dataset, translated using Expressive Semantic Translation (EST).
OpenbookQA Bangla: A Bangla translation of the OpenbookQA dataset, translated using Expressive Semantic Translation (EST).
Piqa Bangla: A Bangla translation of the Piqa dataset, translated using Expressive Semantic Translation (EST).
BoolQ Bangla: Contains 15,942 examples, with each entry consisting of a triplet: (question, passage, answer).

English Benchmark datasets:

MMLU: A massive multitask test with multiple-choice questions from various knowledge branches.
CommonseQa: A multiple-choice question-answering dataset requiring commonsense knowledge.
OpenbookQA: Promotes research in advanced question-answering.
Piqa: Focuses on physical commonsense reasoning.
BoolQ: A question-answer dataset for yes/no questions with 15942 examples.

Evaluation Results

Evaluation of Bangla Benchmark datasets:

Model	Shots	Bangla MMLU	BoolQ BN	Commonsense QA BN	OpenBook QA BN	PIQA BN
llama-3.2-1b	0-shot	0.29	0.55	0.22	0.33	0.53
	5-shot	0.28	-	0.23	0.31	0.54
hishab/titulm-llama-3.2-1b-v1.1	0-shot	0.28	0.54	0.28	0.31	0.56
	5-shot	0.28	-	0.31	0.34	0.57

Evaluation of English Benchmark datasets:

Model	Shots	MMLU	BoolQ	Commonsense QA	OpenBook QA	PIQA
llama-3.2-1b	0-shot	0.38	0.64	0.47	0.37	0.75
	5-shot	0.309	0.662	0.317	0.396	0.759
titulm-llama-3.2-1b-v1.1	0-shot	0.26	0.62	0.34	0.35	0.73
	5-shot	0.26	0.62	0.25	0.39	0.74

Instruction Tuned Models

Intended Use

Bangla text generation
Bangla language understanding tasks
Bangla instruction fine-tuning tasks

📄 License

We are using a similar license to Llama 3.2. Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).

🔧 Technical Details

The model is based on the Llama 3.2 architecture, which is an auto-regressive language model with an optimized transformer architecture. It uses Grouped-Query Attention (GQA) for improved inference scalability.

📚 Citation

@misc{nahin2025titullmsfamilybanglallms,
      title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking}, 
      author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam},
      year={2025},
      eprint={2502.11187},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11187}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご