Shisa V2-LLama3.3-70B Open-Source Chat Model - Free Deployment for Seamless Conversations in Japanese and English

Shisa V2 Llama3.3 70b

Developed by shisa-ai

Shisa V2 is a bilingual (Japanese/English) general-purpose chat model series trained by Shisa.AI, optimized based on Llama-3.3-70B-Instruct, focusing on improving Japanese task performance while maintaining English capabilities.

Large Language Model

Transformers

Supports Multiple Languages#Japanese Optimization #Bilingual Dialogue #Large Model Fine-tuning

Downloads 144

Release Time : 4/13/2025

Model Overview

Shisa V2 is a bilingual large language model specialized in Japanese optimization, employing a synthetic data-driven approach to significantly enhance Japanese output quality while maintaining English proficiency, making it an outstanding model for Japanese tasks.

Model Features

Bilingual Optimization

Specifically optimized for Japanese and English bilingual capabilities, excelling in Japanese tasks while maintaining English proficiency.

Synthetic Data-Driven

Utilizes an innovative synthetic data-driven training method to significantly enhance model performance without relying on continuous pre-training.

Extended Evaluation System

Developed multiple new Japanese evaluation benchmarks to comprehensively measure model performance on Japanese-specific tasks.

Multi-Scale Support

Offers models ranging from 7B to 70B parameters to meet the needs of different application scenarios.

Model Capabilities

Japanese Text Generation

English Text Generation

Japanese-English Translation

Role-Playing Dialogue

Instruction Following

Multi-Turn Dialogue

Use Cases

Chat Applications

Japanese AI Assistant

Provides high-quality dialogue and Q&A services for Japanese users.

Achieved a score of 8.98 in the ELYZA 100 test.

Content Creation

Japanese Content Generation

Generates various text content that conforms to Japanese language conventions.

Achieved a score of 9.32 in the Rakuda test.

Language Services

Japanese-English Translation

Provides high-quality bidirectional Japanese-English translation services.

Achieved a score of 8.37 in translation tests.

🚀 Shisa V2

Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to excel in Japanese language tasks while maintaining strong English capabilities.

🚀 Quick Start

Shisa V2 models inherit the chat templates of their respective base models. They have been tested and validated for proper inference with both vLLM and SGLang.

When running sampler sweeps, the models perform well across a variety of temperatures in most settings. For translation tasks, a lower temperature (0.2) is recommended to increase accuracy. For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield good results. To prevent cross - lingual token leakage, it is recommended to set a top_p of 0.9 or min_p of 0.1.

✨ Features

Bilingual Excellence: Shisa V2 models are proficient in both Japanese and English, with a focus on improving Japanese language performance.
Optimized Post - training: Instead of tokenizer extension and costly continued pre - training, the development team focused on optimizing post - training, resulting in significant performance gains.
Scalable Performance: The training recipe shows robust scaling, improving Japanese language performance across all evaluated model sizes.

📚 Documentation

Model Family Overview

The Shisa V2 family consists of models ranging from 7B to 70B parameters:

License	Model	Parameters	Context Length	JA AVG	EN AVG
Apache 2.0	[shisa - v2 - qwen2.5 - 7b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 7b)	7B	128K/8K	71.06	54.86
Llama 3.1	[shisa - v2 - llama3.1 - 8b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.1 - 8b)¹	8B	128K	70.83	54.75
Apache 2.0	[shisa - v2 - mistral - nemo - 12b](https://huggingface.co/shisa - ai/shisa - v2 - mistral - nemo - 12b)	12B	128K	72.83	53.33
MIT	[shisa - v2 - unphi4 - 14b](https://huggingface.co/shisa - ai/shisa - v2 - unphi4 - 14b)	14B	16K	75.89	60.10
Apache 2.0	[shisa - v2 - qwen2.5 - 32b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 32b)	32B	128K/8K	76.97	67.41
Llama 3.3	[shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b)¹	70B	128K	79.72	67.71

These models were trained using the same datasets and training recipes, with adjustments to the learning rate based on model size and the global batch size for the 70B model.

Performance

All Shisa V2 models demonstrate improved Japanese output quality compared to their respective base models. Here are the performance comparisons:

Model	JA AVG	EN AVG	Shaberi AVG	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm - jp - eval	shisa - jp - ifeval	shisa - jp - rp - bench	shisa - jp - tl - bench	MixEval	LiveBench	IFEval	EvalPlus
[shisa - ai/shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b)	79.72	67.71	8.86	8.98	9.03	9.32	8.11	0.63	0.42	4.72	8.37	0.59	48.7	0.84	0.79
[meta - llama/Llama - 3.3 - 70B - Instruct](https://huggingface.co/meta - llama/Llama - 3.3 - 70B - Instruct)	72.75	71.48	8.28	8.09	8.76	8.88	7.40	0.66	0.35	4.65	5.75	0.64	51.8	0.92	0.79

The Shisa V2 models also perform well against other models in their respective class sizes:

License	Model	JA AVG	EN AVG	Shaberi AVG	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm - jp - eval	shisa - jp - ifeval	shisa - jp - rp - bench	shisa - jp - tl - bench	MixEval	LiveBench	IFEval	EvalPlus
Llama 3.3	[shisa - ai/shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b)	79.72	67.71	8.86	8.98	9.03	9.32	8.11	0.63	0.42	4.72	8.37	0.59	48.7	0.84	0.79
Qwen	Qwen/Qwen2.5 - 72B - Instruct	77.57	68.12	8.81	8.97	8.83	9.23	8.22	0.67	0.47	4.52	6.39	0.54	53.8	0.86	0.79
Llama 3.3	tokyotech - llm/Llama - 3.3 - Swallow - 70B - Instruct - v0.4	75.59	61.03	8.55	8.34	8.81	9.15	7.90	0.66	0.39	4.55	6.63	0.50	41.6	0.80	0.73
Llama 3.1	allenai/Llama - 3.1 - Tulu - 3 - 70B	74.64	64.48	8.60	8.31	8.84	9.36	7.91	0.65	0.41	4.70	5.31	0.54	42.4	0.86	0.76
Llama 3.1	cyberagent/Llama - 3.1 - 70B - Japanese - Instruct - 2407	73.67	64.47	8.68	8.93	8.61	9.14	8.06	0.63	0.36	4.05	6.25	0.56	43.6	0.85	0.73
Llama 3.3	meta - llama/Llama - 3.3 - 70B - Instruct	72.75	71.48	8.28	8.09	8.76	8.88	7.40	0.66	0.35	4.65	5.75	0.64	51.8	0.92	0.79
Llama 3	[shisa - ai/shisa - v1 - llama3 - 70b](https://huggingface.co/shisa - ai/shisa - v1 - llama3 - 70b)	60.63	52.96	7.73	7.33	8.06	8.88	6.65	0.26	0.24	4.51	3.51	0.56	27.4	0.65	0.63

Testing Notes

Japanese functional tests were conducted using the [shisa - ai/shaberi](https://github.com/shisa - ai/shaberi/) fork of the [LightBlue Shaberi](https://github.com/lightblue - tech/japanese_llm_eval) evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of:

[Athene - V2](https://huggingface.co/Nexusflow/Athene - V2 - Chat)
[Llama 3.3 70B](https://huggingface.co/meta - llama/Llama - 3.3 - 70B - Instruct)
[Tulu 3 405B FP8](https://huggingface.co/shisa - ai/Llama - 3.1 - Tulu - 3 - 405B - FP8 - Dynamic)

The results were statistically validated to be comparable to both gpt - 4 - 1106 - preview and human - reviewed "gold standard" ratings.

Dynamic RoPE extension was utilized when necessary for testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of [vLLM](https://github.com/vllm - project/vllm) or [SGLang](https://github.com/sgl - project/sglang).

Standard benchmarks used for model evaluation include:

[ELYZA Tasks 100](https://huggingface.co/datasets/elyza/ELYZA - tasks - 100)
[JA MT - Bench](https://github.com/Stability - AI/FastChat/tree/jp - stable/fastchat/llm_judge) ([dataset](https://huggingface.co/datasets/shisa - ai/ja - mt - bench - 1shot))
[Rakuda](https://huggingface.co/datasets/yuzuai/rakuda - questions)
Tengu Bench
[llm - jp - eval](https://github.com/llm - jp/llm - jp - eval) (v1.4.1)
MixEval
LiveBench (2024 - 11 - 25)
IFEval (Lighteval)
EvalPlus

New Japanese Benchmarks

During model development, several new evaluations were created to measure performance on important Japanese downstream tasks:

shisa - jp - ifeval: Inspired by IFEval, it evaluates instruction - following abilities specific to Japanese grammar and linguistics (closed form).
shisa - jp - rp - bench: Assesses performance on Japanese role - play and character/persona - based multi - turn conversations based on Aratako's [Japanese - RP - Bench](https://github.com/Aratako/Japanese - RP - Bench) (LLM judge).
shisa - jp - tl - bench: Tests Japanese - English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring).

These benchmarks are planned to be open - sourced in the near future to support the Japanese LLM research community.

Datasets

Supervised Fine - Tuning (SFT) Stage

The SFT stage dataset consists of approximately 360K samples totaling roughly 420M Llama 3 tokens:

[shisa - ai/shisa - v2 - sharegpt](https://huggingface.co/datasets/shisa - ai/shisa - v2 - sharegpt): A filtered, regenerated and resampled version of the original Shisa V1 [augmxnt/ultra - orca - boros - en - ja - v1](https://huggingface.co/datasets/augmxnt/ultra - orca - boros - en - ja - v1) dataset.
[shisa - ai/rewild - set - deepseek - subset](https://huggingface.co/datasets/shisa - ai/rewild - set - deepseek - subset): A filtered version of Rewild (WildChat) prompts translated into Japanese, with responses generated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324).
shisa - ai/magpie - ultra - set: Japanese generations based on [argilla/magpie - ultra - v1.0](https://huggingface.co/datasets/argilla/magpie - ultra - v1.0).
shisa - ai/magpie - advanced - questions - set: [Magpie](https://magpie - align.github.io/)-generated questions about advanced college - level topics across a variety of academic fields.
shisa - ai/japan - magpie - set: [Magpie](https://magpie - align.github.io/)-generated questions about Japan's economy and history as well as cultural and business practices.
shisa - ai/shisa - v2 - roleplaying - sft: Synthetically - generated roleplaying data featuring a wide variety of characters, situations, and genres.
shisa - ai/translation_expanded_master_set_filtered: A synthetic dataset involving a wide range of translation tasks, including essays, conversations, and fiction.
shisa - ai/shisa - v2 - instruction - following - sft: An instruction following dataset based on prompts from ([Aratako/Magpie - Tanuki - 8B - annotated - 96k](https://huggingface.co/datasets/Aratako/Magpie - Tanuki - 8B - annotated - 96k)) and a list of instruction - following constraints.

Final DPO Mix

The final DPO mix is 113K samples totaling approximately 115M Llama 3 tokens:

[shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo](https://huggingface.co/datasets/shisa - ai/deepseekv3 - ultrafeedback - armorm - dpo): A version of [princeton - nlp/gemma2 - ultrafeedback - armorm](https://huggingface.co/datasets/princeton - nlp/gemma2 - ultrafeedback - armorm) with chosen responses regenerated by [DeepSeek - V3 - 0324](https://huggingface.co/deepseek - ai/DeepSeek - V3 - 0324).
shisa - ai/shisa - v2 - roleplaying - dpo: A DPO variant of the roleplaying - sft set that uses an UltraFeedback-style rating system.
shisa - ai/translation - no - extra - text - dpo - dataset: A DPO set that aims to reduce the tendency of models to output extraneous explanatory text for translations when not wanted.
shisa - ai/shisa - v2 - instruction - following - dpo: A DPO variant of the instruction - following - sft set to further enhance instruction - following performance.
shisa - ai/politeness - dpo - set: A set to allow for greater controllability of speaking style for Japanese responses.

Training

Over 200 models were trained to empirically test a wide range of variables. In addition to hyper - parameter and data - mix testing, numerous tests were run on data ordering, multilingual - specific ordering, curriculum learning, multi - stage training, various forms of self - play, preference tuning, and some of the latest RL/verifiable reward techniques.

A full discussion of these learnings will be updated on the [shisa - v2 wiki](https://github.com/shisa - ai/shisa - v2/wiki) and the Shisa.AI website.

Most of the training was done on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster. Training was mostly done with [Axolotl](https://github.com/axolotl - ai - cloud/axolotl/) with DeepSpeed and [Liger Kernels](https://github.com/linkedin/Liger - Kernel). The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. The training logs are [publicly available on Weights and Biases](https://wandb.ai/augmxnt/shisa - v2).

Credits

The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).

Compute was provided by Ubitus K.K. and METI GENIAC.

Thanks to [Meta Llama](https://huggingface.co/meta - llama), Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open source community, Unsloth for their [llamafied conversion of Phi - 4](https://huggingface.co/unsloth/phi - 4), the Tulu team for their detailed writeups and fast responses, and Chanvichet Vong of the Axolotl team for his work in the Axolotl Discord.

Special thanks also go to all open source AI developers and researchers. Their publicly shared research, tooling, and datasets made this work possible.

A special thanks to Jon Durbin for his work on Shisa V1.

For more details, please visit the [Shisa V2 Github repository](https://github.com/shisa - ai/shisa - v2) and the Shisa.AI website.

^{1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa - v2 - llama3.1 - 8b" and "Llama 3.3 shisa - v2 - llama3.3 - 70b"}

📄 License

The Shisa V2 models are released under different licenses depending on the model:

Model	License
[shisa - v2 - qwen2.5 - 7b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 7b)	Apache 2.0
[shisa - v2 - llama3.1 - 8b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.1 - 8b)	Llama 3.1
[shisa - v2 - mistral - nemo - 12b](https://huggingface.co/shisa - ai/shisa - v2 - mistral - nemo - 12b)	Apache 2.0
[shisa - v2 - unphi4 - 14b](https://huggingface.co/shisa - ai/shisa - v2 - unphi4 - 14b)	MIT
[shisa - v2 - qwen2.5 - 32b](https://huggingface.co/shisa - ai/shisa - v2 - qwen2.5 - 32b)	Apache 2.0
[shisa - v2 - llama3.3 - 70b](https://huggingface.co/shisa - ai/shisa - v2 - llama3.3 - 70b)	Llama 3.3

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご