Model Overview
Model Features
Model Capabilities
Use Cases
đ Shisa V2
Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to perform excellently in Japanese language tasks while maintaining strong English capabilities.
⨠Features
- Since the release of the initial Shisa 7B, the baseline Japanese capabilities of open - weight language models have improved significantly. Shisa V2 focuses on optimizing post - training, expanding and refining the synthetic - data driven approach.
- The Shisa V2 family includes models ranging from 7B to 70B parameters, all trained using the same datasets and training recipes (with adjustments for model size).
- All Shisa V2 models show improved Japanese output quality compared to their respective base models and perform well against other models in their class sizes.
đĻ Model Family Overview
The Shisa V2 family consists of a series of models with different parameter sizes:
License | Model Name | Parameters | Context Length | JA AVG | EN AVG |
---|---|---|---|---|---|
Apache 2.0 | shisa-v2-qwen2.5-7b | 7B | 128K/8K | 71.06 | 54.86 |
Llama 3.1 | shisa-v2-llama3.1-8b1 | 8B | 128K | 70.83 | 54.75 |
Apache 2.0 | shisa-v2-mistral-nemo-12b | 12B | 128K | 72.83 | 53.33 |
MIT | shisa-v2-unphi4-14b | 14B | 16K | 75.89 | 60.10 |
Apache 2.0 | shisa-v2-qwen2.5-32b | 32B | 128K/8K | 76.97 | 67.41 |
Llama 3.3 | shisa-v2-llama3.3-70b1 | 70B | 128K | 79.72 | 67.71 |
đ Performance
Comparison with Base Models
All Shisa V2 models demonstrate improved Japanese output quality compared to their respective base models:
Model Name | JA Avg | EN Avg | Shaberi Avg | ELYZA 100 | JA MT Bench | Rakuda | Tengu | llm - jp - eval | shisa - jp - ifeval | shisa - jp - rp - bench | shisa - jp - tl - bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
shisa-ai/shisa-v2-qwen2.5-7b | 71.06 | 54.86 | 8.21 | 7.81 | 8.49 | 8.91 | 7.62 | 0.59 | 0.32 | 4.49 | 5.98 | 0.44 | 32.9 | 0.70 | 0.73 |
Qwen/Qwen2.5-7B-Instruct | 65.30 | 58.11 | 8.03 | 7.81 | 8.09 | 8.68 | 7.53 | 0.57 | 0.29 | 4.15 | 3.29 | 0.44 | 33.9 | 0.76 | 0.79 |
Comparison with Other Models
The Shisa V2 models perform well against other models in their respective class sizes:
License | Model Name | JA Avg | EN Avg | Shaberi Avg | ELYZA 100 | JA MT Bench | Rakuda | Tengu | llm - jp - eval | shisa - jp - ifeval | shisa - jp - rp - bench | shisa - jp - tl - bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Apache 2.0 | shisa-ai/shisa-v2-qwen2.5-7b | 71.06 | 54.86 | 8.21 | 7.81 | 8.49 | 8.91 | 7.62 | 0.59 | 0.32 | 4.49 | 5.98 | 0.44 | 32.9 | 0.70 | 0.73 |
Llama 3.1 | shisa-ai/shisa-v2-llama3.1-8b | 70.83 | 54.75 | 8.20 | 7.67 | 8.32 | 9.24 | 7.56 | 0.57 | 0.31 | 4.61 | 5.91 | 0.45 | 31.7 | 0.82 | 0.61 |
Llama 3.1 | shisa-ai/shisa-v2-llama3.1-8b-preview | 68.03 | 54.56 | 8.12 | 7.55 | 8.57 | 9.03 | 7.33 | 0.56 | 0.19 | 4.67 | 5.18 | 0.46 | 32.0 | 0.79 | 0.62 |
Llama 3.1 | tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 | 67.44 | 42.20 | 8.22 | 8.01 | 8.40 | 9.10 | 7.37 | 0.56 | 0.25 | 4.36 | 4.22 | 0.30 | 26.4 | 0.64 | 0.48 |
Apache 2.0 | Qwen/Qwen2.5-7B-Instruct | 65.30 | 58.11 | 8.03 | 7.81 | 8.09 | 8.68 | 7.53 | 0.57 | 0.29 | 4.15 | 3.29 | 0.44 | 33.9 | 0.76 | 0.79 |
Llama 3.1 | AXCXEPT/Llama-3.1-8B-EZO-1.1-it | 63.80 | 53.94 | 7.93 | 7.57 | 8.26 | 8.61 | 7.28 | 0.39 | 0.22 | 4.53 | 4.17 | 0.46 | 30.4 | 0.77 | 0.62 |
Llama 3 | elyza/Llama-3-ELYZA-JP-8B | 60.92 | 39.09 | 7.91 | 7.61 | 8.08 | 8.92 | 7.04 | 0.41 | 0.24 | 4.39 | 1.75 | 0.34 | 17.5 | 0.62 | 0.43 |
Llama 3.1 | allenai/Llama-3.1-Tulu-3.1-8B | 60.86 | 54.21 | 7.42 | 6.84 | 7.69 | 8.61 | 6.52 | 0.51 | 0.22 | 4.39 | 2.90 | 0.40 | 31.3 | 0.82 | 0.63 |
Apache 2.0 | llm-jp/llm-jp-3-7.2b-instruct3 | 56.05 | 23.46 | 7.66 | 6.99 | 7.70 | 9.16 | 6.79 | 0.47 | 0.20 | 3.03 | 1.49 | 0.22 | 5.2 | 0.49 | 0.18 |
Llama 3.1 | meta-llama/Llama-3.1-8B-Instruct | 53.43 | 53.43 | 7.34 | 6.95 | 7.67 | 8.36 | 6.40 | 0.25 | 0.16 | 4.13 | 1.03 | 0.44 | 27.7 | 0.80 | 0.63 |
Llama 3 | shisa-ai/shisa-v1-llama3-8b | 53.08 | 42.80 | 7.17 | 6.40 | 7.50 | 8.31 | 6.48 | 0.23 | 0.09 | 4.20 | 2.24 | 0.36 | 20.2 | 0.63 | 0.52 |
Apache 2.0 | weblab-GENIAC/Tanuki-8B-dpo-v1.0 | 52.25 | 27.04 | 7.10 | 6.97 | 6.58 | 8.40 | 6.46 | 0.23 | 0.17 | 3.67 | 2.02 | 0.24 | 14.4 | 0.38 | 0.32 |
Apache 2.0 | augmxnt/shisa-gamma-7b-v1 | 48.88 | 20.88 | 6.20 | 5.74 | 5.93 | 7.28 | 5.87 | 0.52 | 0.13 | 3.20 | 1.43 | 0.26 | 2.2 | 0.37 | 0.18 |
Testing Notes
- Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of Athene-V2, Llama 3.3 70B, and Tulu 3 405B FP8.
- Dynamic RoPE extension was used when testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.
- A custom "multieval" harness was developed to automate model evaluations. Standard benchmarks include ELYZA Tasks 100, JA MT - Bench (dataset), etc.
New Japanese Benchmarks
- shisa-jp-ifeval: Inspired by IFEval, it evaluates instruction - following abilities specific to Japanese grammar and linguistics (closed form).
- shisa-jp-rp-bench: Assesses performance on Japanese role - play and character/persona - based multi - turn conversations based on Aratako's Japanese - RP - Bench (LLM judge).
- shisa-jp-tl-bench: Tests Japanese - English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring).
đģ Usage
All Shisa V2 models inherit the chat templates of their respective base models and have been tested and validated for proper inference with both vLLM and SGLang.
Usage Tips
- For translation tasks, it is recommended to use a lower temperature (0.2) to improve accuracy.
- For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield better results.
- To prevent cross - lingual token leakage, a top_p of 0.9 or min_p of 0.1 is recommended.
â ī¸ Important Note
No additional safety alignment has been done on these models, so they will largely inherit the base models' biases and safety profiles.
đ Datasets
Supervised Fine - Tuning (SFT) Stage Dataset
The SFT stage dataset consists of approximately 360K samples totaling about 420M Llama 3 tokens:
- shisa-ai/shisa-v2-sharegpt: A filtered, regenerated, and resampled version of the original Shisa V1 augmxnt/ultra-orca-boros-en-ja-v1 dataset.
- shisa-ai/rewild-set-deepseek-subset: A filtered version of Rewild (WildChat) prompts translated into Japanese, with responses generated by DeepSeek - V3 - 0324.
- shisa-ai/magpie-ultra-set: Japanese generations based on argilla/magpie-ultra-v1.0.
- shisa-ai/magpie-advanced-questions-set: Magpie-generated questions about advanced college - level topics across various academic fields.
- shisa-ai/japan-magpie-set: Magpie-generated questions about Japan's economy, history, cultural, and business practices.
- shisa-ai/shisa-v2-roleplaying-sft: Synthetically - generated roleplaying data with a wide variety of characters, situations, and genres.
- shisa-ai/translation_expanded_master_set_filtered: A synthetic dataset for a wide range of translation tasks, including essays, conversations, and fiction.
- shisa-ai/shisa-v2-instruction-following-sft: An instruction - following dataset based on prompts from (Aratako/Magpie - Tanuki - 8B - annotated - 96k) and a list of instruction - following constraints.
Final DPO Mix
The final DPO mix contains 113K samples totaling approximately 115M Llama 3 tokens:
- shisa-ai/deepseekv3-ultrafeedback-armorm-dpo: A version of princeton - nlp/gemma2 - ultrafeedback - armorm with
chosen
responses regenerated by DeepSeek - V3 - 0324. - shisa-ai/shisa-v2-roleplaying-dpo: A DPO variant of the roleplaying - sft set using an UltraFeedback - style rating system.
- shisa-ai/translation-no-extra-text-dpo-dataset: A DPO set to reduce the tendency of models to output extraneous explanatory text for translations.
- shisa-ai/shisa-v2-instruction-following-dpo: A DPO variant of the instruction - following - sft set to enhance instruction - following performance.
- shisa-ai/politeness-dpo-set: A set to control the speaking style of Japanese responses.
đ§ Training
Training Process
We trained over 200 models to test a wide range of variables, including hyper - parameters, data - mix, data ordering, multilingual - specific ordering, curriculum learning, multi - stage training, self - play, preference tuning, and some of the latest RL/verifiable reward techniques.
Training Environment
Most of the training was carried out on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster. Training was mainly done with Axolotl using DeepSpeed and [Liger Kernels](https://github.com/linkedin/Liger - Kernel). The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. Our training logs are publicly available on Weights and Biases.
Credits
- The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).
- Compute resources were provided by Ubitus K.K. and METI GENIAC.
- Thanks to Meta Llama, Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open - source community. Also, thanks to Unsloth for the [llamafied conversion of Phi - 4](https://huggingface.co/unsloth/phi - 4), the Tulu team, and Chanvichet Vong of the Axolotl team.
- Special thanks to all open - source AI developers and researchers, and Jon Durbin for his work on Shisa V1.
For more details, please visit the Shisa V2 Github repository and the Shisa.AI website.
1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa - v2 - llama3.1 - 8b" and "Llama 3.3 shisa - v2 - llama3.3 - 70b"

