Granite-3.1-1B-A400M-Base Open Source Language Model - A Powerful Tool for Long Text Analysis Supporting Multiple Languages and Various Text Processing

Granite 3.1 1b A400m Base

Developed by ibm-granite

Granite-3.1-1B-A400M-Base is a language model developed by IBM. Through a progressive training strategy, the context length is extended from 4K to 128K, supporting multilingual and various text processing tasks.

Large Language Model

Transformers

Open Source License:Apache-2.0 #128K Long Context #Multilingual Generation #MoE Architecture

Downloads 3,299

Release Time : 12/6/2024

Model Overview

This model is mainly used for various tasks such as text generation, summarization, classification, extraction, and question answering. It supports 12 languages and adopts a sparse mixture of experts (MoE) Transformer architecture.

Model Features

Long Context Support

Through a progressive training strategy, the context length is extended from 4K to 128K.

Multilingual Support

Supports 12 languages, including English, Chinese, Japanese, etc.

Sparse Mixture of Experts Architecture

Adopts the MoE architecture, including fine-grained experts, no-drop token routing, and load balancing loss.

Model Capabilities

Text Generation

Text Summarization

Text Classification

Information Extraction

Question Answering System

Use Cases

Text Processing

Question Answering System

Answer questions raised by users, such as 'Where is the Thomas J. Watson Research Center located?'

Generate accurate answers

Text Summarization

Summarize long texts and extract key information

Generate concise summaries

🚀 Granite-3.1-1B-A400M-Base

Granite-3.1-1B-A400M-Base extends the context length of its predecessor, enabling more comprehensive text processing.

🚀 Quick Start

Installation

Install the following libraries:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Usage

Copy the code snippet below to run the example:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "auto"
model_path = "ibm-granite/granite-3.1-1b-a400m-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# change input text as desired
input_text = "Where is the Thomas J. Watson Research Center located?"
# tokenize the text
input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens,
                        max_length=4000)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output)

✨ Features

Extended Context Length: Extends the context length from 4K to 128K using a progressive training strategy.
Multilingual Support: Supports 12 languages, including English, German, Spanish, etc., and can be finetuned for other languages.
Versatile Use Cases: Suitable for various text - to - text generation tasks such as summarization, text classification, extraction, and question - answering.

📦 Installation

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "auto"
model_path = "ibm-granite/granite-3.1-1b-a400m-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# change input text as desired
input_text = "Where is the Thomas J. Watson Research Center located?"
# tokenize the text
input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens,
                        max_length=4000)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output)

📚 Documentation

Model Summary

Granite-3.1-1B-A400M-Base extends the context length of Granite-3.0-1B-A400M-Base from 4K to 128K using a progressive training strategy by increasing the supported context length in increments while adjusting RoPE theta until the model has successfully adapted to desired length of 128K. This long-context pre-training stage was performed using approximately 500B tokens.

Supported Languages

English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.

Intended Use

Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering, and more. All Granite Base models are able to handle these tasks as they were trained on a large amount of data from various domains. Moreover, they can serve as baseline to create specialized models for specific application scenarios.

Evaluation Results

HuggingFace Open LLM Leaderboard V1

Models	ARC - Challenge	Hellaswag	MMLU	TruthfulQA	Winogrande	GSM8K	Avg
Granite-3.1-8B-Base	63.99	83.27	63.45	51.29	78.92	60.19	66.85
Granite-3.1-2B-Base	53.58	77.67	52.86	39.02	72.84	47.99	57.32
Granite-3.1-3B-A800M-Base	50.76	74.45	48.31	39.91	69.29	40.56	53.88
Granite-3.1-1B-A400M-Base	39.42	66.13	26.53	37.67	2.03	18.87	31.78

HuggingFace Open LLM Leaderboard V2

Models	IFEval	BBH	MATH Lvl 5	GPQA	MUSR	MMLU - Pro	Avg
Granite-3.1-8B-Base	42.21	26.02	9.52	9.51	8.36	24.8	20.07
Granite-3.1-2B-Base	35.22	16.84	5.59	3.69	3.9	13.9	13.19
Granite-3.1-3B-A800M-Base	29.96	11.91	4	3.69	1.11	8.81	9.91
Granite-3.1-1B-A400M-Base	25.19	6.43	2.19	0.22	1.76	1.55	6.22

Model Architecture

Granite-3.1-1B-A400M-Base is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. Core components of this architecture are: Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss.

Model	2B Dense	8B Dense	1B MoE	3B MoE
Embedding size	2048	4096	1024	1536
Number of layers	40	40	24	32
Attention head size	64	128	64	64
Number of attention heads	32	32	16	24
Number of KV heads	8	8	8	8
MLP hidden size	8192	12800	512	512
MLP activation	SwiGLU	SwiGLU	SwiGLU	SwiGLU
Number of experts	—	—	32	40
MoE TopK	—	—	8	8
Initialization std	0.1	0.1	0.1	0.1
Sequence length	128K	128K	128K	128K
Position embedding	RoPE	RoPE	RoPE	RoPE
# Parameters	2.5B	8.1B	1.3B	3.3B
# Active parameters	2.5B	8.1B	400M	800M
# Training tokens	12T	12T	10T	10T

Training Data

This model is trained on a mix of open source and proprietary data following a two-stage training strategy.

Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data.
Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks.
Stage 3 data: The data for stage 3 consists of original stage-2 pretraining data with additional synthetic long-context data in form of QA/summary pairs where the answer contains a recitation of the related paragraph before the answer.

A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List.

Infrastructure

We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.

Ethical Considerations and Limitations

The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-3.1-1B-A400M-Base model is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration.

🔧 Technical Details

Progressive Training Strategy: Increases the context length incrementally and adjusts RoPE theta until the model adapts to 128K context length.
Two - Stage Training Data: Uses a mix of open - source and proprietary data from diverse domains, with a second stage focusing on high - quality and multilingual data for specific task performance improvement.
Sparse MoE Transformer Architecture: Based on a decoder - only sparse Mixture of Experts architecture with key components like Fine - grained Experts, Dropless Token Routing, and Load Balancing Loss.

📄 License

Apache 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご