Model Overview
Model Features
Model Capabilities
Use Cases
đ Shisa V2
Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to perform excellently in Japanese language tasks while maintaining strong English capabilities.
đ Quick Start
Shisa V2 models inherit the chat templates of their respective base models. They have been tested and validated for proper inference with both vLLM and SGLang.
When running sampler sweeps, the models operate well across a variety of temperatures in most settings. For translation tasks, a lower temperature (0.2) is recommended to increase accuracy. For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield good results. To prevent cross - lingual token leakage, a top_p of 0.9 or min_p of 0.1 is recommended.
⨠Features
- Bilingual Excellence: Shisa V2 models are proficient in both Japanese and English, aiming to excel in Japanese language tasks while retaining robust English capabilities.
- Optimized Post - training: Instead of tokenizer extension and costly continued pre - training, the focus is on optimizing post - training, resulting in substantial performance gains.
- High - quality Output: All models demonstrate improved Japanese output quality compared to their respective base models and perform well against other models in their class sizes.
đ Documentation
Model Family Overview
The Shisa V2 family includes models with parameters ranging from 7B to 70B:
License | Model Name | Parameters | Context Length | JA AVG | EN AVG |
---|---|---|---|---|---|
Apache 2.0 | shisa-v2-qwen2.5-7b | 7B | 128K/8K | 71.06 | 54.86 |
Llama 3.1 | shisa-v2-llama3.1-8b1 | 8B | 128K | 70.83 | 54.75 |
Apache 2.0 | shisa-v2-mistral-nemo-12b | 12B | 128K | 72.83 | 53.33 |
MIT | shisa-v2-unphi4-14b | 14B | 16K | 75.89 | 60.10 |
Apache 2.0 | shisa-v2-qwen2.5-32b | 32B | 128K/8K | 76.97 | 67.41 |
Llama 3.3 | shisa-v2-llama3.3-70b1 | 70B | 128K | 79.72 | 67.71 |
These models were trained using the same datasets and training recipes, except for scaling the learning rate based on model size and modifying the global batch size for the 70B model.
Performance
All Shisa V2 models show improved Japanese output quality compared to their base models. Here are some performance comparisons:
Model Name | JA Avg | EN Avg | Shaberi Avg | ELYZA 100 | JA MT Bench | Rakuda | Tengu | llm-jp-eval | shisa-jp-ifeval | shisa-jp-rp-bench | shisa-jp-tl-bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
shisa-ai/shisa-v2-mistral-nemo-12b | 72.83 | 53.33 | 8.46 | 8.38 | 8.79 | 9.06 | 7.63 | 0.58 | 0.31 | 4.55 | 6.39 | 0.39 | 33.4 | 0.74 | 0.68 |
mistralai/Mistral-Nemo-Instruct-2407 | 58.44 | 48.07 | 7.68 | 7.29 | 8.03 | 8.68 | 6.73 | 0.55 | 0.13 | 3.60 | 2.11 | 0.31 | 30.0 | 0.64 | 0.68 |
The Shisa V2 models also perform well against other models in their class sizes:
License | Model Name | JA AVG | EN AVG | Shaberi AVG | ELYZA 100 | JA MT Bench | Rakuda | Tengu | llm-jp-eval | shisa-jp-ifeval | shisa-jp-rp-bench | shisa-jp-tl-bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MIT | shisa-ai/shisa-v2-unphi4-14b | 75.89 | 60.10 | 8.50 | 8.45 | 8.84 | 8.96 | 7.73 | 0.62 | 0.43 | 4.76 | 6.79 | 0.53 | 40.7 | 0.67 | 0.80 |
Gemma | google/gemma-3-12b-it | 75.15 | 62.10 | 8.48 | 8.34 | 8.67 | 9.02 | 7.88 | 0.60 | 0.35 | 4.64 | 7.40 | 0.44 | 45.3 | 0.83 | 0.76 |
Apache 2.0 | shisa-ai/shisa-v2-mistral-nemo-12b | 72.83 | 53.33 | 8.46 | 8.38 | 8.79 | 9.06 | 7.63 | 0.58 | 0.31 | 4.55 | 6.39 | 0.39 | 33.4 | 0.74 | 0.68 |
MIT | microsoft/phi-4 | 72.47 | 61.14 | 8.48 | 8.49 | 8.65 | 9.11 | 7.68 | 0.58 | 0.35 | 4.55 | 5.62 | 0.52 | 42.1 | 0.69 | 0.81 |
Apache 2.0 | cyberagent/Mistral-Nemo-Japanese-Instruct-2408 | 71.12 | 48.00 | 8.28 | 8.11 | 8.55 | 9.21 | 7.24 | 0.58 | 0.26 | 4.59 | 6.25 | 0.34 | 28.5 | 0.62 | 0.67 |
Apache 2.0 | Qwen/Qwen2.5-14B-Instruct | 71.02 | 62.54 | 8.27 | 8.15 | 8.64 | 8.70 | 7.59 | 0.63 | 0.34 | 4.51 | 5.03 | 0.52 | 41.4 | 0.81 | 0.76 |
Apache 2.0 | mistralai/Mistral-Nemo-Instruct-2407 | 58.44 | 48.07 | 7.68 | 7.29 | 8.03 | 8.68 | 6.73 | 0.55 | 0.13 | 3.60 | 2.11 | 0.31 | 30.0 | 0.64 | 0.68 |
Testing Notes
Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of:
The results were statistically validated to be comparable to both gpt-4-1106-preview
and human - reviewed "gold standard" ratings.
Dynamic RoPE extension was utilized when necessary for testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.
Standard benchmarks used for evaluation include:
- ELYZA Tasks 100
- [JA MT - Bench](https://github.com/Stability - AI/FastChat/tree/jp - stable/fastchat/llm_judge) (dataset)
- Rakuda
- Tengu Bench
- [llm - jp - eval](https://github.com/llm - jp/llm - jp - eval) (v1.4.1)
- MixEval
- LiveBench (2024 - 11 - 25)
- IFEval (Lighteval)
- EvalPlus
New Japanese Benchmarks
During model development, several new evaluations were created to measure performance on important Japanese downstream tasks:
- shisa - jp - ifeval: Inspired by IFEval, it evaluates instruction - following abilities specific to Japanese grammar and linguistics (closed form).
- shisa - jp - rp - bench: Assesses performance on Japanese role - play and character/persona - based multi - turn conversations based on Aratako's [Japanese - RP - Bench](https://github.com/Aratako/Japanese - RP - Bench) (LLM judge).
- shisa - jp - tl - bench: Tests Japanese - English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring).
These benchmarks are expected to be useful and will be open - sourced in the near future to support the Japanese LLM research community.
Datasets
Supervised Fine - Tuning (SFT) Dataset
The SFT stage dataset consists of approximately 360K samples totaling roughly 420M Llama 3 tokens:
- [shisa - ai/shisa - v2 - sharegpt](https://huggingface.co/datasets/shisa - ai/shisa - v2 - sharegpt): A filtered, regenerated and resampled version of the original Shisa V1 [augmxnt/ultra - orca - boros - en - ja - v1](https://huggingface.co/datasets/augmxnt/ultra - orca - boros - en - ja - v1) dataset. It was the backbone of Shisa V2 training and out - performed all existing mixes/additions.
- [shisa - ai/rewild - set - deepseek - subset](https://huggingface.co/datasets/shisa - ai/rewild - set - deepseek - subset): A filtered version of Rewild (WildChat) prompts translated into Japanese, with responses generated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324).
- shisa - ai/magpie - ultra - set: Japanese generations based on [argilla/magpie - ultra - v1.0](https://huggingface.co/datasets/argilla/magpie - ultra - v1.0).
- shisa - ai/magpie - advanced - questions - set: [Magpie](https://magpie - align.github.io/)-generated questions about advanced college - level topics across a variety of academic fields.
- shisa - ai/japan - magpie - set: [Magpie](https://magpie - align.github.io/)-generated questions about Japan's economy and history as well as cultural and business practices.
- shisa - ai/shisa - v2 - roleplaying - sft: Synthetically - generated roleplaying data featuring a wide variety of characters, situations, and genres.
- shisa - ai/translation_expanded_master_set_filtered: A synthetic dataset involving a wide range of translation tasks, including essays, conversations, and fiction.
- shisa - ai/shisa - v2 - instruction - following - sft: An instruction following dataset based on prompts from ([Aratako/Magpie - Tanuki - 8B - annotated - 96k](https://huggingface.co/datasets/Aratako/Magpie - Tanuki - 8B - annotated - 96k)) and a list of instruction - following constraints.
Final DPO Mix
The final DPO mix is 113K samples totaling approximately 115M Llama 3 tokens:
- [shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo](https://huggingface.co/datasets/shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo): A version of [princeton - nlp/gemma2 - ultrafeedback - armorm](https://huggingface.co/datasets/princeton - nlp/gemma2 - ultrafeedback - armorm) with
chosen
responses regenerated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324). It out - performed both JA/EN DPO sets and larger sets. - shisa - ai/shisa - v2 - roleplaying - dpo: A DPO variant of the roleplaying - sft set that uses an UltraFeedback-style rating system.
- shisa - ai/translation - no - extra - text - dpo - dataset: A DPO set that aims to reduce the tendency of models to output extraneous explanatory text for translations when not wanted.
- shisa - ai/shisa - v2 - instruction - following - dpo: A DPO variant of the instruction - following - sft set to further enhance instruction - following performance.
- shisa - ai/politeness - dpo - set: A set to allow for greater controllability of speaking style for Japanese responses.
Training
Over 200 models were trained to empirically test a wide range of variables. Beyond hyper - parameter and data - mix testing, numerous tests were run on data ordering, multilingual - specific ordering, curriculum learning, multi - stage training, various forms of self - play, preference tuning, and some of the latest RL/verifiable reward techniques.
A full discussion of these learnings is out of scope here, but the [shisa - v2 wiki](https://github.com/shisa - ai/shisa - v2/wiki) and the Shisa.AI website will be updated with forthcoming writeups.
Most of the training was done on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster. Training was mostly done with [Axolotl](https://github.com/axolotl - ai - cloud/axolotl/) with DeepSpeed and [Liger Kernels](https://github.com/linkedin/Liger - Kernel). The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. The training logs are [publicly available on Weights and Biases](https://wandb.ai/augmxnt/shisa - v2).
Credits
The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).
Compute was provided by Ubitus K.K. and METI GENIAC.
Thanks to [Meta Llama](https://huggingface.co/meta - llama), Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open source community, Unsloth for their [llamafied conversion of Phi - 4](https://huggingface.co/unsloth/phi - 4), the Tulu team for their detailed writeups and fast responses, and Chanvichet Vong of the Axolotl team for his tireless work in the Axolotl Discord.
Thanks also go to all open source AI developers and researchers. Without their publicly shared research, tooling, and datasets, this work would not be possible. The developers hope that their contributions will further support the broader community.
A special thanks to Jon Durbin for his work on Shisa V1.
For more details on the development and insights, please visit the [Shisa V2 Github repository](https://github.com/shisa - ai/shisa - v2) and the Shisa.AI website.
1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa - v2 - llama3.1 - 8b" and "Llama 3.3 shisa - v2 - llama3.3 - 70b"
đ License
The Shisa V2 models are released under different licenses depending on the model:
Model Name | License |
---|---|
[shisa - v2 - qwen2.5 - 7b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 7b) | Apache 2.0 |
[shisa - v2 - llama3.1 - 8b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.1 - 8b) | Llama 3.1 |
[shisa - v2 - mistral - nemo - 12b](https://huggingface.co/shisa - ai/shisa - v2 - mistral - nemo - 12b) | Apache 2.0 |
[shisa - v2 - unphi4 - 14b](https://huggingface.co/shisa - ai/shisa - v2 - unphi4 - 14b) | MIT |
[shisa - v2 - qwen2.5 - 32b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 32b) | Apache 2.0 |
[shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b) | Llama 3.3 |

