
Model Overview
Model Features
Model Capabilities
Use Cases
đ Shisa V2
Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to excel in Japanese language tasks while maintaining strong English capabilities.
đ Quick Start
Shisa V2 models inherit the chat templates of their respective base models. They have been tested and validated for proper inference with both vLLM and SGLang.
When running sampler sweeps, the models perform well across a variety of temperatures in most settings. For translation tasks, a lower temperature (0.2) is recommended to increase accuracy. For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield good results. To prevent cross - lingual token leakage, it is recommended to set a top_p of 0.9 or min_p of 0.1.
⨠Features
- Bilingual Excellence: Shisa V2 models are proficient in both Japanese and English, with a focus on improving Japanese language performance.
- Optimized Post - training: Instead of tokenizer extension and costly continued pre - training, the development team focused on optimizing post - training, resulting in significant performance gains.
- Scalable Performance: The training recipe shows robust scaling, improving Japanese language performance across all evaluated model sizes.
đ Documentation
Model Family Overview
The Shisa V2 family consists of models ranging from 7B to 70B parameters:
License | Model | Parameters | Context Length | JA AVG | EN AVG |
---|---|---|---|---|---|
Apache 2.0 | [shisa - v2 - qwen2.5 - 7b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 7b) | 7B | 128K/8K | 71.06 | 54.86 |
Llama 3.1 | [shisa - v2 - llama3.1 - 8b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.1 - 8b)1 | 8B | 128K | 70.83 | 54.75 |
Apache 2.0 | [shisa - v2 - mistral - nemo - 12b](https://huggingface.co/shisa - ai/shisa - v2 - mistral - nemo - 12b) | 12B | 128K | 72.83 | 53.33 |
MIT | [shisa - v2 - unphi4 - 14b](https://huggingface.co/shisa - ai/shisa - v2 - unphi4 - 14b) | 14B | 16K | 75.89 | 60.10 |
Apache 2.0 | [shisa - v2 - qwen2.5 - 32b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 32b) | 32B | 128K/8K | 76.97 | 67.41 |
Llama 3.3 | [shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b)1 | 70B | 128K | 79.72 | 67.71 |
These models were trained using the same datasets and training recipes, with adjustments to the learning rate based on model size and the global batch size for the 70B model.
Performance
All Shisa V2 models demonstrate improved Japanese output quality compared to their respective base models. Here are the performance comparisons:
Model | JA AVG | EN AVG | Shaberi AVG | ELYZA 100 | JA MT Bench | Rakuda | Tengu | llm - jp - eval | shisa - jp - ifeval | shisa - jp - rp - bench | shisa - jp - tl - bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[shisa - ai/shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b) | 79.72 | 67.71 | 8.86 | 8.98 | 9.03 | 9.32 | 8.11 | 0.63 | 0.42 | 4.72 | 8.37 | 0.59 | 48.7 | 0.84 | 0.79 |
[meta - llama/Llama - 3.3 - 70B - Instruct](https://huggingface.co/meta - llama/Llama - 3.3 - 70B - Instruct) | 72.75 | 71.48 | 8.28 | 8.09 | 8.76 | 8.88 | 7.40 | 0.66 | 0.35 | 4.65 | 5.75 | 0.64 | 51.8 | 0.92 | 0.79 |
The Shisa V2 models also perform well against other models in their respective class sizes:
License | Model | JA AVG | EN AVG | Shaberi AVG | ELYZA 100 | JA MT Bench | Rakuda | Tengu | llm - jp - eval | shisa - jp - ifeval | shisa - jp - rp - bench | shisa - jp - tl - bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama 3.3 | [shisa - ai/shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b) | 79.72 | 67.71 | 8.86 | 8.98 | 9.03 | 9.32 | 8.11 | 0.63 | 0.42 | 4.72 | 8.37 | 0.59 | 48.7 | 0.84 | 0.79 |
Qwen | Qwen/Qwen2.5 - 72B - Instruct | 77.57 | 68.12 | 8.81 | 8.97 | 8.83 | 9.23 | 8.22 | 0.67 | 0.47 | 4.52 | 6.39 | 0.54 | 53.8 | 0.86 | 0.79 |
Llama 3.3 | tokyotech - llm/Llama - 3.3 - Swallow - 70B - Instruct - v0.4 | 75.59 | 61.03 | 8.55 | 8.34 | 8.81 | 9.15 | 7.90 | 0.66 | 0.39 | 4.55 | 6.63 | 0.50 | 41.6 | 0.80 | 0.73 |
Llama 3.1 | allenai/Llama - 3.1 - Tulu - 3 - 70B | 74.64 | 64.48 | 8.60 | 8.31 | 8.84 | 9.36 | 7.91 | 0.65 | 0.41 | 4.70 | 5.31 | 0.54 | 42.4 | 0.86 | 0.76 |
Llama 3.1 | cyberagent/Llama - 3.1 - 70B - Japanese - Instruct - 2407 | 73.67 | 64.47 | 8.68 | 8.93 | 8.61 | 9.14 | 8.06 | 0.63 | 0.36 | 4.05 | 6.25 | 0.56 | 43.6 | 0.85 | 0.73 |
Llama 3.3 | meta - llama/Llama - 3.3 - 70B - Instruct | 72.75 | 71.48 | 8.28 | 8.09 | 8.76 | 8.88 | 7.40 | 0.66 | 0.35 | 4.65 | 5.75 | 0.64 | 51.8 | 0.92 | 0.79 |
Llama 3 | [shisa - ai/shisa - v1 - llama3 - 70b](https://huggingface.co/shisa - ai/shisa - v1 - llama3 - 70b) | 60.63 | 52.96 | 7.73 | 7.33 | 8.06 | 8.88 | 6.65 | 0.26 | 0.24 | 4.51 | 3.51 | 0.56 | 27.4 | 0.65 | 0.63 |
Testing Notes
Japanese functional tests were conducted using the [shisa - ai/shaberi](https://github.com/shisa - ai/shaberi/) fork of the [LightBlue Shaberi](https://github.com/lightblue - tech/japanese_llm_eval) evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of:
- [Athene - V2](https://huggingface.co/Nexusflow/Athene - V2 - Chat)
- [Llama 3.3 70B](https://huggingface.co/meta - llama/Llama - 3.3 - 70B - Instruct)
- [Tulu 3 405B FP8](https://huggingface.co/shisa - ai/Llama - 3.1 - Tulu - 3 - 405B - FP8 - Dynamic)
The results were statistically validated to be comparable to both gpt - 4 - 1106 - preview
and human - reviewed "gold standard" ratings.
Dynamic RoPE extension was utilized when necessary for testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of [vLLM](https://github.com/vllm - project/vllm) or [SGLang](https://github.com/sgl - project/sglang).
Standard benchmarks used for model evaluation include:
- [ELYZA Tasks 100](https://huggingface.co/datasets/elyza/ELYZA - tasks - 100)
- [JA MT - Bench](https://github.com/Stability - AI/FastChat/tree/jp - stable/fastchat/llm_judge) ([dataset](https://huggingface.co/datasets/shisa - ai/ja - mt - bench - 1shot))
- [Rakuda](https://huggingface.co/datasets/yuzuai/rakuda - questions)
- Tengu Bench
- [llm - jp - eval](https://github.com/llm - jp/llm - jp - eval) (v1.4.1)
- MixEval
- LiveBench (2024 - 11 - 25)
- IFEval (Lighteval)
- EvalPlus
New Japanese Benchmarks
During model development, several new evaluations were created to measure performance on important Japanese downstream tasks:
- shisa - jp - ifeval: Inspired by IFEval, it evaluates instruction - following abilities specific to Japanese grammar and linguistics (closed form).
- shisa - jp - rp - bench: Assesses performance on Japanese role - play and character/persona - based multi - turn conversations based on Aratako's [Japanese - RP - Bench](https://github.com/Aratako/Japanese - RP - Bench) (LLM judge).
- shisa - jp - tl - bench: Tests Japanese - English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring).
These benchmarks are planned to be open - sourced in the near future to support the Japanese LLM research community.
Datasets
Supervised Fine - Tuning (SFT) Stage
The SFT stage dataset consists of approximately 360K samples totaling roughly 420M Llama 3 tokens:
- [shisa - ai/shisa - v2 - sharegpt](https://huggingface.co/datasets/shisa - ai/shisa - v2 - sharegpt): A filtered, regenerated and resampled version of the original Shisa V1 [augmxnt/ultra - orca - boros - en - ja - v1](https://huggingface.co/datasets/augmxnt/ultra - orca - boros - en - ja - v1) dataset.
- [shisa - ai/rewild - set - deepseek - subset](https://huggingface.co/datasets/shisa - ai/rewild - set - deepseek - subset): A filtered version of Rewild (WildChat) prompts translated into Japanese, with responses generated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324).
- shisa - ai/magpie - ultra - set: Japanese generations based on [argilla/magpie - ultra - v1.0](https://huggingface.co/datasets/argilla/magpie - ultra - v1.0).
- shisa - ai/magpie - advanced - questions - set: [Magpie](https://magpie - align.github.io/)-generated questions about advanced college - level topics across a variety of academic fields.
- shisa - ai/japan - magpie - set: [Magpie](https://magpie - align.github.io/)-generated questions about Japan's economy and history as well as cultural and business practices.
- shisa - ai/shisa - v2 - roleplaying - sft: Synthetically - generated roleplaying data featuring a wide variety of characters, situations, and genres.
- shisa - ai/translation_expanded_master_set_filtered: A synthetic dataset involving a wide range of translation tasks, including essays, conversations, and fiction.
- shisa - ai/shisa - v2 - instruction - following - sft: An instruction following dataset based on prompts from ([Aratako/Magpie - Tanuki - 8B - annotated - 96k](https://huggingface.co/datasets/Aratako/Magpie - Tanuki - 8B - annotated - 96k)) and a list of instruction - following constraints.
Final DPO Mix
The final DPO mix is 113K samples totaling approximately 115M Llama 3 tokens:
- [shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo](https://huggingface.co/datasets/shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo): A version of [princeton - nlp/gemma2 - ultrafeedback - armorm](https://huggingface.co/datasets/princeton - nlp/gemma2 - ultrafeedback - armorm) with
chosen
responses regenerated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324). - shisa - ai/shisa - v2 - roleplaying - dpo: A DPO variant of the roleplaying - sft set that uses an UltraFeedback-style rating system.
- shisa - ai/translation - no - extra - text - dpo - dataset: A DPO set that aims to reduce the tendency of models to output extraneous explanatory text for translations when not wanted.
- shisa - ai/shisa - v2 - instruction - following - dpo: A DPO variant of the instruction - following - sft set to further enhance instruction - following performance.
- shisa - ai/politeness - dpo - set: A set to allow for greater controllability of speaking style for Japanese responses.
Training
Over 200 models were trained to empirically test a wide range of variables. In addition to hyper - parameter and data - mix testing, numerous tests were run on data ordering, multilingual - specific ordering, curriculum learning, multi - stage training, various forms of self - play, preference tuning, and some of the latest RL/verifiable reward techniques.
A full discussion of these learnings will be updated on the [shisa - v2 wiki](https://github.com/shisa - ai/shisa - v2/wiki) and the Shisa.AI website.
Most of the training was done on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster. Training was mostly done with [Axolotl](https://github.com/axolotl - ai - cloud/axolotl/) with DeepSpeed and [Liger Kernels](https://github.com/linkedin/Liger - Kernel). The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. The training logs are [publicly available on Weights and Biases](https://wandb.ai/augmxnt/shisa - v2).
Credits
The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).
Compute was provided by Ubitus K.K. and METI GENIAC.
Thanks to [Meta Llama](https://huggingface.co/meta - llama), Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open source community, Unsloth for their [llamafied conversion of Phi - 4](https://huggingface.co/unsloth/phi - 4), the Tulu team for their detailed writeups and fast responses, and Chanvichet Vong of the Axolotl team for his work in the Axolotl Discord.
Special thanks also go to all open source AI developers and researchers. Their publicly shared research, tooling, and datasets made this work possible.
A special thanks to Jon Durbin for his work on Shisa V1.
For more details, please visit the [Shisa V2 Github repository](https://github.com/shisa - ai/shisa - v2) and the Shisa.AI website.
1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa - v2 - llama3.1 - 8b" and "Llama 3.3 shisa - v2 - llama3.3 - 70b"
đ License
The Shisa V2 models are released under different licenses depending on the model:
Model | License |
---|---|
[shisa - v2 - qwen2.5 - 7b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 7b) | Apache 2.0 |
[shisa - v2 - llama3.1 - 8b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.1 - 8b) | Llama 3.1 |
[shisa - v2 - mistral - nemo - 12b](https://huggingface.co/shisa - ai/shisa - v2 - mistral - nemo - 12b) | Apache 2.0 |
[shisa - v2 - unphi4 - 14b](https://huggingface.co/shisa - ai/shisa - v2 - unphi4 - 14b) | MIT |
[shisa - v2 - qwen2.5 - 32b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 32b) | Apache 2.0 |
[shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b) | Llama 3.3 |

