Model Overview
Model Features
Model Capabilities
Use Cases
đ llm-jp-3.1-1.8b
LLM-jp-3.1 is a series of large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. It builds upon the LLM-jp-3 series, incorporating mid-training (instruction pre-training), which significantly enhances instruction-following capabilities compared to the original LLM-jp-3 models. This repository provides the llm-jp-3.1-1.8b model.
đ Quick Start
This section provides a high - level overview of getting started with the llm - jp - 3.1 - 1.8b model. You can find detailed usage examples and requirements below.
⨠Features
- Enhanced Instruction - Following: Incorporating mid - training (instruction pre - training), the LLM - jp - 3.1 models have significantly improved instruction - following capabilities compared to the original LLM - jp - 3 models.
- Multiple Language Support: Trained on a diverse set of datasets including Japanese, English, Code, Chinese, and Korean.
- Fine - Tuning Options: Fine - tuned with supervised fine - tuning and further aligned with Direct Preference Optimization.
đĻ Installation
Required Libraries and Their Versions
- torch>=2.3.0
- transformers>=4.40.1
- tokenizers>=0.19.1
- accelerate>=0.29.3
- flash - attn>=2.5.8
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-3.1-1.8b")
model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-3.1-1.8b", device_map="auto", torch_dtype=torch.bfloat16)
text = "čĒįļč¨čĒåĻįã¨ã¯äŊã"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
tokenized_input,
max_new_tokens=100,
do_sample=True,
top_p=0.95,
temperature=0.7,
repetition_penalty=1.05,
)[0]
print(tokenizer.decode(output))
đ Documentation
Model Details
Property | Details |
---|---|
Model Type | Transformer - based Language Model |
Checkpoints format | Hugging Face Transformers |
Architectures
Dense model:
Params | Layers | Hidden size | Heads | Context length | Embedding parameters | Non - embedding parameters |
---|---|---|---|---|---|---|
1.8b | 24 | 2048 | 16 | 4096 | 407,498,752 | 1,459,718,144 |
13b | 40 | 5120 | 40 | 4096 | 1,018,746,880 | 12,688,184,320 |
MoE model:
Params | Layers | Hidden size | Heads | Routed Experts | Activated Experts | Context length | Embedding parameters | Non - embedding parameters | Activated parameters | Total parameters |
---|---|---|---|---|---|---|---|---|---|---|
8x13b | 40 | 5120 | 40 | 8 | 2 | 4096 | 1,018,746,880 | 72,144,081,920 | 22,200,806,400 | 73,162,828,800 |
Tokenizer
The tokenizer of this model is based on huggingface/tokenizers Unigram byte - fallback model. The vocabulary entries were converted from [llm - jp - tokenizer v3.0
](https://github.com/llm - jp/llm - jp - tokenizer/releases/tag/v3.0b2). Please refer to [README.md](https://github.com/llm - jp/llm - jp - tokenizer) of llm - jp - tokenizer
for details on the vocabulary construction procedure.
Datasets
Pre - training
The models have been pre - trained using a blend of the following datasets:
Language | Dataset | Tokens |
---|---|---|
Japanese | [Wikipedia](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 2.6B |
[Common Crawl](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 762.8B | |
[WARP/PDF](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 237.3B | |
[WARP/HTML](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 2.7B | |
[Kaken](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 1.8B | |
English | [Wikipedia](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 4.7B |
[Dolma/CC - head](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 608.5B | |
[Dolma/C4](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 181.6B | |
[Dolma/Reddit](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 83.1B | |
[Dolma/PeS2o](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 62.9B | |
[Dolma/Gutenberg](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 5.5B | |
[Dolma/Wiki](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3) | 3.9B | |
Code | [The Stack](https://huggingface.co/datasets/bigcode/the - stack) | 114.1B |
Chinese | [Wikipedia](https://huggingface.co/datasets/bigcode/the - stack) | 0.8B |
Korean | [Wikipedia](https://huggingface.co/datasets/bigcode/the - stack) | 0.3B |
Mid - training
In the LLM - jp - 3.1 series, continuous pre - training was performed based on [Instruction Pre - Training](https://aclanthology.org/2024.emnlp - main.148/). Approximately 90B tokens of instruction - response data were prepared and mixed with pre - training datasets, conducting continuous pre - training on a total of 400B tokens. Each model was initialized from existing checkpoints and underwent continuous instruction pre - training. Since the LLM - jp - 3 series was originally pre - trained on 2.1T tokens, the total pre - training token count amounts to 2.5T tokens.
Post - training
The pre - trained checkpoint was fine - tuned with supervised fine - tuning and further aligned with Direct Preference Optimization.
Supervised Fine - Tuning
Language | Dataset | Description |
---|---|---|
Japanese | [ichikara - instruction - 004 - 002](https://liat - aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/) | A manually constructed instruction dataset. |
[AnswerCarefully (ver2.0)](https://huggingface.co/datasets/llm - jp/AnswerCarefully) | A manually constructed instruction dataset focusing on LLMs' safety. | |
ichikara - instruction - format | A small subset of the ichikara - instruction dataset, edited with some constraints on the output format. | |
[AutoMultiTurnByCalm3 - 22B](https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByCalm3 - 22B) | A synthetic instruction dataset. | |
[ramdom - to - fixed - multiturn - Calm3](https://huggingface.co/datasets/kanhatakeyama/ramdom - to - fixed - multiturn - Calm3) | A synthetic instruction dataset. | |
[wizardlm8x22b - logical - math - coding - sft - ja](https://huggingface.co/datasets/llm - jp/wizardlm8x22b - logical - math - coding - sft - ja) | A synthetic instruction dataset. | |
[magpie - sft - v1.0](https://huggingface.co/datasets/llm - jp/magpie - sft - v1.0) | A synthetic instruction dataset we created. | |
[jaster v1.4.1](https://github.com/llm - jp/llm - jp - eval/tree/v1.4.1) | - | |
[extraction - wiki - ja](https://huggingface.co/datasets/llm - jp/extraction - wiki - ja) | A synthetic instruction dataset we created. | |
English | [Daring - Anteater](https://huggingface.co/datasets/nvidia/Daring - Anteater) | - |
Japanese & English | [Synthetic - JP - EN - Coding - Dataset](https://huggingface.co/datasets/llm - jp/Synthetic - JP - EN - Coding - Dataset) | A synthetic instruction dataset. |
Direct Preference Optimization
For Direct Preference Optimization (DPO), rejection sampling was adopted. Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt. These responses were then scored (by [Qwen/Qwen2.5 - 32B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 32B - Instruct)), and DPO was performed by treating high - scoring responses as positive examples and low - scoring responses as negative examples. DPO was conducted in two stages, and in the second stage, [ac - self - inst](https://huggingface.co/datasets/llm - jp/ac - self - inst), a Japanese preference dataset focused on safety, was additionally used.
Evaluation
MT Bench (Japanese and English)
The models were evaluated using gpt - 4o - 2024 - 08 - 06
. The scores represent the average values obtained from three rounds of inference and evaluation.
Model Name | JA | EN |
---|---|---|
gpt - 35 - turbo - 1106 | 6.48 | 7.56 |
gpt - 4 - 0613 | 7.29 | 7.72 |
gpt - 4o - 2024 - 08 - 06 | 8.10 | 8.38 |
[sbintuitions/sarashina2.2 - 1b - instruct - v0.1](https://huggingface.co/sbintuitions/sarashina2.2 - 1b - instruct - v0.1) | 5.30 | 5.66 |
[sbintuitions/sarashina2.2 - 3b - instruct - v0.1](https://huggingface.co/sbintuitions/sarashina2.2 - 3b - instruct - v0.1) | 7.07 | 6.96 |
[Rakuten/RakutenAI - 2.0 - 8x7B - instruct](https://huggingface.co/Rakuten/RakutenAI - 2.0 - 8x7B - instruct) | 6.68 | 6.33 |
[cyberagent/calm3 - 22b - chat](https://huggingface.co/cyberagent/calm3 - 22b - chat) | 6.86 | 6.77 |
[Qwen/Qwen2.5 - 14B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 14B - Instruct) | 7.07 | 7.99 |
[Qwen/Qwen2.5 - 32B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 32B - Instruct) | 7.64 | 8.27 |
[Qwen/Qwen3 - 1.7B](https://huggingface.co/Qwen/Qwen3 - 1.7B) | 5.46 | 6.95 |
[Qwen/Qwen3 - 14B](https://huggingface.co/Qwen/Qwen3 - 14B) | 8.00 | 8.30 |
[Qwen/Qwen3 - 32B](https://huggingface.co/Qwen/Qwen3 - 32B) | 8.36 | 8.33 |
[tokyotech - llm/Llama - 3.3 - Swallow - 70B - Instruct - v0.4](https://huggingface.co/tokyotech - llm/Llama - 3.3 - Swallow - 70B - Instruct - v0.4) | 7.64 | 8.02 |
[stockmark/Stockmark - 2 - 100B - Instruct - beta](https://huggingface.co/stockmark/Stockmark - 2 - 100B - Instruct - beta) | 7.42 | 7.17 |
[llm - jp - 3 - 1.8b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 1.8b - instruct3) | 4.64 | 4.09 |
[llm - jp - 3 - 13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 13b - instruct3) | 6.21 | 6.13 |
[llm - jp - 3 - 8x13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 8x13b - instruct3) | 6.60 | 6.49 |
[llm - jp - 3.1 - 1.8b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 1.8b - instruct4) | 6.30 | 5.70 |
[llm - jp - 3.1 - 13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 13b - instruct4) | 7.37 | 7.01 |
[llm - jp - 3.1 - 8x13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 8x13b - instruct4) | 7.50 | 7.05 |
AnswerCarefully - Eval
[AnswerCarefully - Eval](https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/Q4 - 19.pdf) assesses the safety of Japanese language model outputs using the LLM - as - a - Judge approach, based on the test set from [llm - jp/AnswerCarefully](https://huggingface.co/datasets/llm - jp/AnswerCarefully). The models were evaluated using gpt - 4o - 2024 - 08 - 06
. The scores represent the average values obtained from three rounds of inference and evaluation.
Model name | Score | Acceptance rate (%, â) | Violation rate (%, â) |
---|---|---|---|
gpt - 35 - turbo - 1106 | 3.98 | 71.7 | 12.6 |
gpt - 4 - 0613 | 4.06 | 72.3 | 13.2 |
gpt - 4o - 2024 - 08 - 06 | 4.09 | 72.7 | 12.5 |
[llm - jp - 3 - 1.8b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 1.8b - instruct3) | 4.03 | 75.9 | 12.2 |
[llm - jp - 3 - 13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 13b - instruct3) | 4.37 | 88.4 | 6.5 |
[llm - jp - 3 - 8x13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 8x13b - instruct3) | 4.48 | 91.6 | 4.3 |
[llm - jp - 3.1 - 1.8b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 1.8b - instruct4) | 3.66 | 64.7 | 24.3 |
[llm - jp - 3.1 - 13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 13b - instruct4) | 4.17 | 82.4 | 12.2 |
[llm - jp - 3.1 - 8x13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 8x13b - instruct4) | 4.26 | 83.1 | 11.6 |
đ§ Technical Details
The models are based on a Transformer - based architecture. Checkpoints are in the Hugging Face Transformers format. The training process involves pre - training, mid - training (instruction pre - training), and post - training (supervised fine - tuning and Direct Preference Optimization).
đ License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0)
đĢ Risks and Limitations
The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
đ§ Send Questions to
llm - jp(at)nii.ac.jp
đĨ Model Card Authors
The names are listed in alphabetical order.
Hirokazu Kiyomaru and Takashi Kodama.

