đ Shisa V2
Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to perform exceptionally well in Japanese language tasks while maintaining strong English capabilities.
đ Quick Start
All Shisa V2 models inherit the chat templates of their respective base models. They have been tested and validated for proper inference with both vLLM and SGLang.
When running sampler sweeps, the models operate well across a variety of temperatures in most settings. For translation tasks, a lower temperature (0.2) is recommended to increase accuracy. For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield good results. To prevent cross - lingual token leakage, a top_p of 0.9 or min_p of 0.1 is recommended.
⨠Features
- Bilingual Excellence: Shisa V2 models are proficient in both Japanese and English, excelling in Japanese language tasks while retaining robust English capabilities.
- Optimized Post - training: Instead of tokenizer extension and costly continued pre - training, the focus is on optimizing post - training, resulting in significant performance gains.
- Scalable Performance: The models show robust scaling, with improved Japanese language performance across all evaluated model sizes.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
The README does not contain code examples, so this section is skipped.
đ Documentation
Model Family Overview
The Shisa V2 family consists of models with parameters ranging from 7B to 70B:
These models were trained using the same datasets and training recipes, with adjustments to the learning rate based on model size and the global batch size for the 70B model.
Performance
All Shisa V2 models show improved Japanese output quality compared to their respective base models. Here are some performance comparisons:
Model |
JA Avg |
EN Avg |
Shaberi Avg |
ELYZA 100 |
JA MT Bench |
Rakuda |
Tengu |
llm-jp-eval |
shisa-jp-ifeval |
shisa-jp-rp-bench |
shisa-jp-tl-bench |
MixEval |
LiveBench |
IFEval |
EvalPlus |
shisa-ai/shisa-v2-llama3.1-8b |
70.83 |
54.75 |
8.20 |
7.67 |
8.32 |
9.24 |
7.56 |
0.57 |
0.31 |
4.61 |
5.91 |
0.45 |
31.7 |
0.82 |
0.61 |
meta-llama/Llama-3.1-8B-Instruct |
53.43 |
53.88 |
7.34 |
6.95 |
7.67 |
8.36 |
6.40 |
0.25 |
0.16 |
4.13 |
1.03 |
0.44 |
27.7 |
0.80 |
0.63 |
Testing Notes
- Evaluation Harness: Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness.
- LLM Jury: Shaberi ratings were performed with a PoLL (LLM Jury) consisting of Athene-V2, Llama 3.3 70B, and Tulu 3 405B FP8.
- Testing Tools: Dynamic RoPE extension was used when necessary for models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.
- Standard Benchmarks: The custom "multieval" harness was developed for model evaluations. Standard benchmarks include ELYZA Tasks 100, JA MT - Bench, etc.
New Japanese Benchmarks
- shisa-jp-ifeval: Evaluates instruction - following abilities specific to Japanese grammar and linguistics.
- shisa-jp-rp-bench: Assesses performance on Japanese role - play and multi - turn conversations.
- shisa-jp-tl-bench: Tests Japanese - English translation proficiency.
Usage
- Inference: The models inherit the chat templates of their base models and support inference with vLLM and SGLang.
- Temperature Settings: Different temperature settings are recommended for different tasks. Lower temperatures (0.2) for translation tasks and higher temperatures (e.g., 1.0) for role - play and creative tasks.
- Safety: The models inherit the biases and safety profiles of their base models as no additional safety alignment has been done.
Datasets
Training
- Model Testing: Over 200 models were trained to test various variables, including hyper - parameters, data mixes, data ordering, etc.
- Training Tools: Most training was done on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster using Axolotl, DeepSpeed, and Liger Kernels. The Phi 4 and Llama 3.3 70B versions were trained with OpenRLHF.
- Training Logs: The training logs are publicly available on Weights and Biases.
Credits
The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI). Thanks go to many organizations and individuals, including Meta Llama, Microsoft Research, Mistral AI, and Qwen Team, as well as all open - source AI developers and researchers.
1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa-v2-llama3.1-8b" and "Llama 3.3 shisa-v2-llama3.3-70b"
đ§ Technical Details
The README does not provide in - depth technical details, so this section is skipped.
đ License
The models in the Shisa V2 family have different licenses, including Apache 2.0, Llama 3.1, and MIT. For example, shisa-v2-qwen2.5-7b is under the Apache 2.0 license, while shisa-v2-llama3.1-8b is under the Llama 3.1 license.