Shisa V2 Open-source Bilingual Chat Model - Free Deployment, Specialized in Japanese with English Communication Support

Shisa V2 Mistral Nemo 12b

Developed by shisa-ai

Shisa V2 is a bilingual (Japanese/English) general-purpose chat model trained by Shisa.AI, focusing on optimizing Japanese tasks while maintaining English capabilities.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese Optimization #Bilingual Dialogue #Long Context

Downloads 53

Release Time : 4/12/2025

Model Overview

Shisa V2 is a 12B-parameter bilingual model based on Mistral-Nemo-Instruct-2407, optimized for Japanese output quality through improved post-training stages, suitable for Japanese-English bilingual dialogue and task processing.

Model Features

Bilingual Optimization

Specially optimized for Japanese tasks while maintaining strong English capabilities

High-Quality Japanese Output

Significantly improved Japanese generation quality through enhanced post-training stages

Extensive Evaluation Benchmarks

Excellent performance in multiple Japanese-specific benchmark tests

Flexible Inference Settings

Supports different temperature settings to adapt to various task needs such as translation and creative tasks

Model Capabilities

Japanese text generation

English text generation

Japanese-English translation

Role-playing dialogue

Instruction following

Multi-turn conversation

Use Cases

Language Services

Japanese-English Translation

High-quality bidirectional Japanese-English translation service

Scored 6.39 on the shisa-jp-tl-bench benchmark

Dialogue Systems

Japanese Chatbot

Natural and smooth Japanese conversation experience

Scored 8.46 in casual conversation evaluations

Role-Playing Dialogue

Supports diverse role-playing scenarios

Scored 4.55 on the shisa-jp-rp-bench evaluation

Education

Japanese Learning Assistance

Helps learners practice Japanese conversation and writing

🚀 Shisa V2

Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to perform excellently in Japanese language tasks while maintaining strong English capabilities.

🚀 Quick Start

Shisa V2 models inherit the chat templates of their respective base models. They have been tested and validated for proper inference with both vLLM and SGLang.

When running sampler sweeps, the models operate well across a variety of temperatures in most settings. For translation tasks, a lower temperature (0.2) is recommended to increase accuracy. For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield good results. To prevent cross - lingual token leakage, a top_p of 0.9 or min_p of 0.1 is recommended.

✨ Features

Bilingual Excellence: Shisa V2 models are proficient in both Japanese and English, aiming to excel in Japanese language tasks while retaining robust English capabilities.
Optimized Post - training: Instead of tokenizer extension and costly continued pre - training, the focus is on optimizing post - training, resulting in substantial performance gains.
High - quality Output: All models demonstrate improved Japanese output quality compared to their respective base models and perform well against other models in their class sizes.

📚 Documentation

Model Family Overview

The Shisa V2 family includes models with parameters ranging from 7B to 70B:

License	Model Name	Parameters	Context Length	JA AVG	EN AVG
Apache 2.0	shisa-v2-qwen2.5-7b	7B	128K/8K	71.06	54.86
Llama 3.1	shisa-v2-llama3.1-8b¹	8B	128K	70.83	54.75
Apache 2.0	shisa-v2-mistral-nemo-12b	12B	128K	72.83	53.33
MIT	shisa-v2-unphi4-14b	14B	16K	75.89	60.10
Apache 2.0	shisa-v2-qwen2.5-32b	32B	128K/8K	76.97	67.41
Llama 3.3	shisa-v2-llama3.3-70b¹	70B	128K	79.72	67.71

These models were trained using the same datasets and training recipes, except for scaling the learning rate based on model size and modifying the global batch size for the 70B model.

Performance

All Shisa V2 models show improved Japanese output quality compared to their base models. Here are some performance comparisons:

Model Name	JA Avg	EN Avg	Shaberi Avg	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm-jp-eval	shisa-jp-ifeval	shisa-jp-rp-bench	shisa-jp-tl-bench	MixEval	LiveBench	IFEval	EvalPlus
shisa-ai/shisa-v2-mistral-nemo-12b	72.83	53.33	8.46	8.38	8.79	9.06	7.63	0.58	0.31	4.55	6.39	0.39	33.4	0.74	0.68
mistralai/Mistral-Nemo-Instruct-2407	58.44	48.07	7.68	7.29	8.03	8.68	6.73	0.55	0.13	3.60	2.11	0.31	30.0	0.64	0.68

The Shisa V2 models also perform well against other models in their class sizes:

License	Model Name	JA AVG	EN AVG	Shaberi AVG	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm-jp-eval	shisa-jp-ifeval	shisa-jp-rp-bench	shisa-jp-tl-bench	MixEval	LiveBench	IFEval	EvalPlus
MIT	shisa-ai/shisa-v2-unphi4-14b	75.89	60.10	8.50	8.45	8.84	8.96	7.73	0.62	0.43	4.76	6.79	0.53	40.7	0.67	0.80
Gemma	google/gemma-3-12b-it	75.15	62.10	8.48	8.34	8.67	9.02	7.88	0.60	0.35	4.64	7.40	0.44	45.3	0.83	0.76
Apache 2.0	shisa-ai/shisa-v2-mistral-nemo-12b	72.83	53.33	8.46	8.38	8.79	9.06	7.63	0.58	0.31	4.55	6.39	0.39	33.4	0.74	0.68
MIT	microsoft/phi-4	72.47	61.14	8.48	8.49	8.65	9.11	7.68	0.58	0.35	4.55	5.62	0.52	42.1	0.69	0.81
Apache 2.0	cyberagent/Mistral-Nemo-Japanese-Instruct-2408	71.12	48.00	8.28	8.11	8.55	9.21	7.24	0.58	0.26	4.59	6.25	0.34	28.5	0.62	0.67
Apache 2.0	Qwen/Qwen2.5-14B-Instruct	71.02	62.54	8.27	8.15	8.64	8.70	7.59	0.63	0.34	4.51	5.03	0.52	41.4	0.81	0.76
Apache 2.0	mistralai/Mistral-Nemo-Instruct-2407	58.44	48.07	7.68	7.29	8.03	8.68	6.73	0.55	0.13	3.60	2.11	0.31	30.0	0.64	0.68

Testing Notes

Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of:

The results were statistically validated to be comparable to both gpt-4-1106-preview and human - reviewed "gold standard" ratings.

Dynamic RoPE extension was utilized when necessary for testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.

Standard benchmarks used for evaluation include:

ELYZA Tasks 100
[JA MT - Bench](https://github.com/Stability - AI/FastChat/tree/jp - stable/fastchat/llm_judge) (dataset)
Rakuda
Tengu Bench
[llm - jp - eval](https://github.com/llm - jp/llm - jp - eval) (v1.4.1)
MixEval
LiveBench (2024 - 11 - 25)
IFEval (Lighteval)
EvalPlus

New Japanese Benchmarks

During model development, several new evaluations were created to measure performance on important Japanese downstream tasks:

shisa - jp - ifeval: Inspired by IFEval, it evaluates instruction - following abilities specific to Japanese grammar and linguistics (closed form).
shisa - jp - rp - bench: Assesses performance on Japanese role - play and character/persona - based multi - turn conversations based on Aratako's [Japanese - RP - Bench](https://github.com/Aratako/Japanese - RP - Bench) (LLM judge).
shisa - jp - tl - bench: Tests Japanese - English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring).

These benchmarks are expected to be useful and will be open - sourced in the near future to support the Japanese LLM research community.

Datasets

Supervised Fine - Tuning (SFT) Dataset

The SFT stage dataset consists of approximately 360K samples totaling roughly 420M Llama 3 tokens:

[shisa - ai/shisa - v2 - sharegpt](https://huggingface.co/datasets/shisa - ai/shisa - v2 - sharegpt): A filtered, regenerated and resampled version of the original Shisa V1 [augmxnt/ultra - orca - boros - en - ja - v1](https://huggingface.co/datasets/augmxnt/ultra - orca - boros - en - ja - v1) dataset. It was the backbone of Shisa V2 training and out - performed all existing mixes/additions.
[shisa - ai/rewild - set - deepseek - subset](https://huggingface.co/datasets/shisa - ai/rewild - set - deepseek - subset): A filtered version of Rewild (WildChat) prompts translated into Japanese, with responses generated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324).
shisa - ai/magpie - ultra - set: Japanese generations based on [argilla/magpie - ultra - v1.0](https://huggingface.co/datasets/argilla/magpie - ultra - v1.0).
shisa - ai/magpie - advanced - questions - set: [Magpie](https://magpie - align.github.io/)-generated questions about advanced college - level topics across a variety of academic fields.
shisa - ai/japan - magpie - set: [Magpie](https://magpie - align.github.io/)-generated questions about Japan's economy and history as well as cultural and business practices.
shisa - ai/shisa - v2 - roleplaying - sft: Synthetically - generated roleplaying data featuring a wide variety of characters, situations, and genres.
shisa - ai/translation_expanded_master_set_filtered: A synthetic dataset involving a wide range of translation tasks, including essays, conversations, and fiction.
shisa - ai/shisa - v2 - instruction - following - sft: An instruction following dataset based on prompts from ([Aratako/Magpie - Tanuki - 8B - annotated - 96k](https://huggingface.co/datasets/Aratako/Magpie - Tanuki - 8B - annotated - 96k)) and a list of instruction - following constraints.

Final DPO Mix

The final DPO mix is 113K samples totaling approximately 115M Llama 3 tokens:

[shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo](https://huggingface.co/datasets/shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo): A version of [princeton - nlp/gemma2 - ultrafeedback - armorm](https://huggingface.co/datasets/princeton - nlp/gemma2 - ultrafeedback - armorm) with chosen responses regenerated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324). It out - performed both JA/EN DPO sets and larger sets.
shisa - ai/shisa - v2 - roleplaying - dpo: A DPO variant of the roleplaying - sft set that uses an UltraFeedback-style rating system.
shisa - ai/translation - no - extra - text - dpo - dataset: A DPO set that aims to reduce the tendency of models to output extraneous explanatory text for translations when not wanted.
shisa - ai/shisa - v2 - instruction - following - dpo: A DPO variant of the instruction - following - sft set to further enhance instruction - following performance.
shisa - ai/politeness - dpo - set: A set to allow for greater controllability of speaking style for Japanese responses.

Training

Over 200 models were trained to empirically test a wide range of variables. Beyond hyper - parameter and data - mix testing, numerous tests were run on data ordering, multilingual - specific ordering, curriculum learning, multi - stage training, various forms of self - play, preference tuning, and some of the latest RL/verifiable reward techniques.

A full discussion of these learnings is out of scope here, but the [shisa - v2 wiki](https://github.com/shisa - ai/shisa - v2/wiki) and the Shisa.AI website will be updated with forthcoming writeups.

Most of the training was done on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster. Training was mostly done with [Axolotl](https://github.com/axolotl - ai - cloud/axolotl/) with DeepSpeed and [Liger Kernels](https://github.com/linkedin/Liger - Kernel). The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. The training logs are [publicly available on Weights and Biases](https://wandb.ai/augmxnt/shisa - v2).

Credits

The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).

Compute was provided by Ubitus K.K. and METI GENIAC.

Thanks to [Meta Llama](https://huggingface.co/meta - llama), Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open source community, Unsloth for their [llamafied conversion of Phi - 4](https://huggingface.co/unsloth/phi - 4), the Tulu team for their detailed writeups and fast responses, and Chanvichet Vong of the Axolotl team for his tireless work in the Axolotl Discord.

Thanks also go to all open source AI developers and researchers. Without their publicly shared research, tooling, and datasets, this work would not be possible. The developers hope that their contributions will further support the broader community.

A special thanks to Jon Durbin for his work on Shisa V1.

For more details on the development and insights, please visit the [Shisa V2 Github repository](https://github.com/shisa - ai/shisa - v2) and the Shisa.AI website.

^{1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa - v2 - llama3.1 - 8b" and "Llama 3.3 shisa - v2 - llama3.3 - 70b"}

📄 License

The Shisa V2 models are released under different licenses depending on the model:

Model Name	License
[shisa - v2 - qwen2.5 - 7b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 7b)	Apache 2.0
[shisa - v2 - llama3.1 - 8b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.1 - 8b)	Llama 3.1
[shisa - v2 - mistral - nemo - 12b](https://huggingface.co/shisa - ai/shisa - v2 - mistral - nemo - 12b)	Apache 2.0
[shisa - v2 - unphi4 - 14b](https://huggingface.co/shisa - ai/shisa - v2 - unphi4 - 14b)	MIT
[shisa - v2 - qwen2.5 - 32b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 32b)	Apache 2.0
[shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b)	Llama 3.3

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご