LLM-jp-3.1-1.8b Open-Source Large Language Model - Enhanced Instruction-Following Ability and High Practical Value

Llm Jp 3.1 1.8b

Developed by llm-jp

LLM-jp-3.1-1.8b is a large language model developed by the National Institute of Informatics in Japan. Based on the LLM-jp-3 series, it incorporates instruction pre-training to enhance the instruction-following ability.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese optimization #Instruction enhancement #Multilingual support

Downloads 572

Release Time : 5/27/2025

Model Overview

LLM-jp-3.1-1.8b is a large language model based on the Transformer architecture, supporting multilingual processing. It is specifically optimized for the instruction-following ability in Japanese and English.

Model Features

Instruction pre-training enhancement

Instruction pre-training is incorporated in the middle of training, significantly improving the model's instruction-following ability

Multilingual support

Supports the processing of multiple languages such as Japanese, English, Chinese, and Korean

Optional parameter scale

Provides model versions with different parameter scales to meet different computing requirements

Model Capabilities

Japanese text generation

English text generation

Multilingual translation

Instruction understanding and execution

Code generation

Use Cases

Natural language processing

Japanese question - answering system

Build an intelligent question - answering application based on Japanese

Scored 6.30 in the Japanese MT Bench evaluation

Multilingual translation

Supports translation between Japanese and languages such as English and Chinese

Code assistance

Code generation

Generate code snippets based on natural language descriptions

🚀 llm-jp-3.1-1.8b

LLM-jp-3.1 is a series of large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. It builds upon the LLM-jp-3 series, incorporating mid-training (instruction pre-training), which significantly enhances instruction-following capabilities compared to the original LLM-jp-3 models. This repository provides the llm-jp-3.1-1.8b model.

🚀 Quick Start

This section provides a high - level overview of getting started with the llm - jp - 3.1 - 1.8b model. You can find detailed usage examples and requirements below.

✨ Features

Enhanced Instruction - Following: Incorporating mid - training (instruction pre - training), the LLM - jp - 3.1 models have significantly improved instruction - following capabilities compared to the original LLM - jp - 3 models.
Multiple Language Support: Trained on a diverse set of datasets including Japanese, English, Code, Chinese, and Korean.
Fine - Tuning Options: Fine - tuned with supervised fine - tuning and further aligned with Direct Preference Optimization.

📦 Installation

Required Libraries and Their Versions

torch>=2.3.0
transformers>=4.40.1
tokenizers>=0.19.1
accelerate>=0.29.3
flash - attn>=2.5.8

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-3.1-1.8b")
model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-3.1-1.8b", device_map="auto", torch_dtype=torch.bfloat16)
text = "自然言語処理とは何か"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.95,
        temperature=0.7,
        repetition_penalty=1.05,
    )[0]
print(tokenizer.decode(output))

📚 Documentation

Model Details

Property	Details
Model Type	Transformer - based Language Model
Checkpoints format	Hugging Face Transformers

Architectures

Dense model:

Params	Layers	Hidden size	Heads	Context length	Embedding parameters	Non - embedding parameters
1.8b	24	2048	16	4096	407,498,752	1,459,718,144
13b	40	5120	40	4096	1,018,746,880	12,688,184,320

MoE model:

Params	Layers	Hidden size	Heads	Routed Experts	Activated Experts	Context length	Embedding parameters	Non - embedding parameters	Activated parameters	Total parameters
8x13b	40	5120	40	8	2	4096	1,018,746,880	72,144,081,920	22,200,806,400	73,162,828,800

Tokenizer

The tokenizer of this model is based on huggingface/tokenizers Unigram byte - fallback model. The vocabulary entries were converted from [llm - jp - tokenizer v3.0](https://github.com/llm - jp/llm - jp - tokenizer/releases/tag/v3.0b2). Please refer to [README.md](https://github.com/llm - jp/llm - jp - tokenizer) of llm - jp - tokenizer for details on the vocabulary construction procedure.

Datasets

Pre - training

The models have been pre - trained using a blend of the following datasets:

Language	Dataset	Tokens
Japanese	[Wikipedia](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	2.6B
	[Common Crawl](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	762.8B
	[WARP/PDF](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	237.3B
	[WARP/HTML](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	2.7B
	[Kaken](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	1.8B
English	[Wikipedia](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	4.7B
	[Dolma/CC - head](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	608.5B
	[Dolma/C4](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	181.6B
	[Dolma/Reddit](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	83.1B
	[Dolma/PeS2o](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	62.9B
	[Dolma/Gutenberg](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	5.5B
	[Dolma/Wiki](https://gitlab.llm - jp.nii.ac.jp/datasets/llm - jp - corpus - v3)	3.9B
Code	[The Stack](https://huggingface.co/datasets/bigcode/the - stack)	114.1B
Chinese	[Wikipedia](https://huggingface.co/datasets/bigcode/the - stack)	0.8B
Korean	[Wikipedia](https://huggingface.co/datasets/bigcode/the - stack)	0.3B

Mid - training

In the LLM - jp - 3.1 series, continuous pre - training was performed based on [Instruction Pre - Training](https://aclanthology.org/2024.emnlp - main.148/). Approximately 90B tokens of instruction - response data were prepared and mixed with pre - training datasets, conducting continuous pre - training on a total of 400B tokens. Each model was initialized from existing checkpoints and underwent continuous instruction pre - training. Since the LLM - jp - 3 series was originally pre - trained on 2.1T tokens, the total pre - training token count amounts to 2.5T tokens.

Post - training

The pre - trained checkpoint was fine - tuned with supervised fine - tuning and further aligned with Direct Preference Optimization.

Supervised Fine - Tuning

Language	Dataset	Description
Japanese	[ichikara - instruction - 004 - 002](https://liat - aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)	A manually constructed instruction dataset.
	[AnswerCarefully (ver2.0)](https://huggingface.co/datasets/llm - jp/AnswerCarefully)	A manually constructed instruction dataset focusing on LLMs' safety.
	ichikara - instruction - format	A small subset of the ichikara - instruction dataset, edited with some constraints on the output format.
	[AutoMultiTurnByCalm3 - 22B](https://huggingface.co/datasets/kanhatakeyama/AutoMultiTurnByCalm3 - 22B)	A synthetic instruction dataset.
	[ramdom - to - fixed - multiturn - Calm3](https://huggingface.co/datasets/kanhatakeyama/ramdom - to - fixed - multiturn - Calm3)	A synthetic instruction dataset.
	[wizardlm8x22b - logical - math - coding - sft - ja](https://huggingface.co/datasets/llm - jp/wizardlm8x22b - logical - math - coding - sft - ja)	A synthetic instruction dataset.
	[magpie - sft - v1.0](https://huggingface.co/datasets/llm - jp/magpie - sft - v1.0)	A synthetic instruction dataset we created.
	[jaster v1.4.1](https://github.com/llm - jp/llm - jp - eval/tree/v1.4.1)	-
	[extraction - wiki - ja](https://huggingface.co/datasets/llm - jp/extraction - wiki - ja)	A synthetic instruction dataset we created.
English	[Daring - Anteater](https://huggingface.co/datasets/nvidia/Daring - Anteater)	-
Japanese & English	[Synthetic - JP - EN - Coding - Dataset](https://huggingface.co/datasets/llm - jp/Synthetic - JP - EN - Coding - Dataset)	A synthetic instruction dataset.

Direct Preference Optimization

For Direct Preference Optimization (DPO), rejection sampling was adopted. Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt. These responses were then scored (by [Qwen/Qwen2.5 - 32B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 32B - Instruct)), and DPO was performed by treating high - scoring responses as positive examples and low - scoring responses as negative examples. DPO was conducted in two stages, and in the second stage, [ac - self - inst](https://huggingface.co/datasets/llm - jp/ac - self - inst), a Japanese preference dataset focused on safety, was additionally used.

Evaluation

MT Bench (Japanese and English)

The models were evaluated using gpt - 4o - 2024 - 08 - 06. The scores represent the average values obtained from three rounds of inference and evaluation.

Model Name	JA	EN
gpt - 35 - turbo - 1106	6.48	7.56
gpt - 4 - 0613	7.29	7.72
gpt - 4o - 2024 - 08 - 06	8.10	8.38
[sbintuitions/sarashina2.2 - 1b - instruct - v0.1](https://huggingface.co/sbintuitions/sarashina2.2 - 1b - instruct - v0.1)	5.30	5.66
[sbintuitions/sarashina2.2 - 3b - instruct - v0.1](https://huggingface.co/sbintuitions/sarashina2.2 - 3b - instruct - v0.1)	7.07	6.96
[Rakuten/RakutenAI - 2.0 - 8x7B - instruct](https://huggingface.co/Rakuten/RakutenAI - 2.0 - 8x7B - instruct)	6.68	6.33
[cyberagent/calm3 - 22b - chat](https://huggingface.co/cyberagent/calm3 - 22b - chat)	6.86	6.77
[Qwen/Qwen2.5 - 14B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 14B - Instruct)	7.07	7.99
[Qwen/Qwen2.5 - 32B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 32B - Instruct)	7.64	8.27
[Qwen/Qwen3 - 1.7B](https://huggingface.co/Qwen/Qwen3 - 1.7B)	5.46	6.95
[Qwen/Qwen3 - 14B](https://huggingface.co/Qwen/Qwen3 - 14B)	8.00	8.30
[Qwen/Qwen3 - 32B](https://huggingface.co/Qwen/Qwen3 - 32B)	8.36	8.33
[tokyotech - llm/Llama - 3.3 - Swallow - 70B - Instruct - v0.4](https://huggingface.co/tokyotech - llm/Llama - 3.3 - Swallow - 70B - Instruct - v0.4)	7.64	8.02
[stockmark/Stockmark - 2 - 100B - Instruct - beta](https://huggingface.co/stockmark/Stockmark - 2 - 100B - Instruct - beta)	7.42	7.17
[llm - jp - 3 - 1.8b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 1.8b - instruct3)	4.64	4.09
[llm - jp - 3 - 13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 13b - instruct3)	6.21	6.13
[llm - jp - 3 - 8x13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 8x13b - instruct3)	6.60	6.49
[llm - jp - 3.1 - 1.8b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 1.8b - instruct4)	6.30	5.70
[llm - jp - 3.1 - 13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 13b - instruct4)	7.37	7.01
[llm - jp - 3.1 - 8x13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 8x13b - instruct4)	7.50	7.05

AnswerCarefully - Eval

[AnswerCarefully - Eval](https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/Q4 - 19.pdf) assesses the safety of Japanese language model outputs using the LLM - as - a - Judge approach, based on the test set from [llm - jp/AnswerCarefully](https://huggingface.co/datasets/llm - jp/AnswerCarefully). The models were evaluated using gpt - 4o - 2024 - 08 - 06. The scores represent the average values obtained from three rounds of inference and evaluation.

Model name	Score	Acceptance rate (%, ↑)	Violation rate (%, ↓)
gpt - 35 - turbo - 1106	3.98	71.7	12.6
gpt - 4 - 0613	4.06	72.3	13.2
gpt - 4o - 2024 - 08 - 06	4.09	72.7	12.5
[llm - jp - 3 - 1.8b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 1.8b - instruct3)	4.03	75.9	12.2
[llm - jp - 3 - 13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 13b - instruct3)	4.37	88.4	6.5
[llm - jp - 3 - 8x13b - instruct3](https://huggingface.co/llm - jp/llm - jp - 3 - 8x13b - instruct3)	4.48	91.6	4.3
[llm - jp - 3.1 - 1.8b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 1.8b - instruct4)	3.66	64.7	24.3
[llm - jp - 3.1 - 13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 13b - instruct4)	4.17	82.4	12.2
[llm - jp - 3.1 - 8x13b - instruct4](https://huggingface.co/llm - jp/llm - jp - 3.1 - 8x13b - instruct4)	4.26	83.1	11.6

🔧 Technical Details

The models are based on a Transformer - based architecture. Checkpoints are in the Hugging Face Transformers format. The training process involves pre - training, mid - training (instruction pre - training), and post - training (supervised fine - tuning and Direct Preference Optimization).

📄 License

[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0)

🚫 Risks and Limitations

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

📧 Send Questions to

llm - jp(at)nii.ac.jp

👥 Model Card Authors

The names are listed in alphabetical order.

Hirokazu Kiyomaru and Takashi Kodama.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご