Shisa V2-Qwen2.5-7b Open-Source Bilingual Dialogue Model - Free Deployment, Excellent Performance in Japanese-English Communication

Shisa V2 Qwen2.5 7b

Developed by shisa-ai

Shisa V2 is a Japanese-English bilingual general-purpose dialogue model developed by Shisa.AI, focusing on improving performance in Japanese tasks while maintaining strong English capabilities.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese-English Bilingual Optimization #Long Context Handling #Instruction Fine-Tuning Enhancement

Downloads 38

Release Time : 4/12/2025

Model Overview

Shisa V2 is a Japanese-English bilingual large language model based on Qwen2.5-7B-Instruct, with significant improvements in Japanese task performance, suitable for various scenarios such as bilingual dialogue and translation.

Model Features

Bilingual Optimization

Significantly improves Japanese task performance while maintaining English capabilities

Efficient Tokenization

Higher efficiency in Japanese tokenization, optimizing Japanese output quality

Extended Context

Supports 128K/8K context length

Synthetic Data-Driven

Utilizes an improved synthetic data-driven approach for training optimization

Model Capabilities

Japanese Text Generation

English Text Generation

Bilingual Dialogue

Japanese-English Translation

Instruction Following

Role-Playing

Use Cases

Language Services

Japanese-English Translation

High-quality Japanese-English translation services

Achieved a score of 5.98 in translation task evaluations

Dialogue Systems

Bilingual Customer Support

Provides Japanese-English bilingual customer service dialogue

Achieved a score of 8.21 in conversation evaluations

Content Creation

Japanese Content Generation

Generates high-quality Japanese text content

Performed excellently in Japanese evaluations

🚀 Shisa V2

Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to perform excellently in Japanese language tasks while maintaining strong English capabilities.

✨ Features

Since the release of the initial Shisa 7B, the baseline Japanese capabilities of open - weight language models have improved significantly. Shisa V2 focuses on optimizing post - training, expanding and refining the synthetic - data driven approach.
The Shisa V2 family includes models ranging from 7B to 70B parameters, all trained using the same datasets and training recipes (with adjustments for model size).
All Shisa V2 models show improved Japanese output quality compared to their respective base models and perform well against other models in their class sizes.

📦 Model Family Overview

The Shisa V2 family consists of a series of models with different parameter sizes:

License	Model Name	Parameters	Context Length	JA AVG	EN AVG
Apache 2.0	shisa-v2-qwen2.5-7b	7B	128K/8K	71.06	54.86
Llama 3.1	shisa-v2-llama3.1-8b¹	8B	128K	70.83	54.75
Apache 2.0	shisa-v2-mistral-nemo-12b	12B	128K	72.83	53.33
MIT	shisa-v2-unphi4-14b	14B	16K	75.89	60.10
Apache 2.0	shisa-v2-qwen2.5-32b	32B	128K/8K	76.97	67.41
Llama 3.3	shisa-v2-llama3.3-70b¹	70B	128K	79.72	67.71

📊 Performance

Comparison with Base Models

All Shisa V2 models demonstrate improved Japanese output quality compared to their respective base models:

Model Name	JA Avg	EN Avg	Shaberi Avg	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm - jp - eval	shisa - jp - ifeval	shisa - jp - rp - bench	shisa - jp - tl - bench	MixEval	LiveBench	IFEval	EvalPlus
shisa-ai/shisa-v2-qwen2.5-7b	71.06	54.86	8.21	7.81	8.49	8.91	7.62	0.59	0.32	4.49	5.98	0.44	32.9	0.70	0.73
Qwen/Qwen2.5-7B-Instruct	65.30	58.11	8.03	7.81	8.09	8.68	7.53	0.57	0.29	4.15	3.29	0.44	33.9	0.76	0.79

Comparison with Other Models

The Shisa V2 models perform well against other models in their respective class sizes:

License	Model Name	JA Avg	EN Avg	Shaberi Avg	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm - jp - eval	shisa - jp - ifeval	shisa - jp - rp - bench	shisa - jp - tl - bench	MixEval	LiveBench	IFEval	EvalPlus
Apache 2.0	shisa-ai/shisa-v2-qwen2.5-7b	71.06	54.86	8.21	7.81	8.49	8.91	7.62	0.59	0.32	4.49	5.98	0.44	32.9	0.70	0.73
Llama 3.1	shisa-ai/shisa-v2-llama3.1-8b	70.83	54.75	8.20	7.67	8.32	9.24	7.56	0.57	0.31	4.61	5.91	0.45	31.7	0.82	0.61
Llama 3.1	shisa-ai/shisa-v2-llama3.1-8b-preview	68.03	54.56	8.12	7.55	8.57	9.03	7.33	0.56	0.19	4.67	5.18	0.46	32.0	0.79	0.62
Llama 3.1	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3	67.44	42.20	8.22	8.01	8.40	9.10	7.37	0.56	0.25	4.36	4.22	0.30	26.4	0.64	0.48
Apache 2.0	Qwen/Qwen2.5-7B-Instruct	65.30	58.11	8.03	7.81	8.09	8.68	7.53	0.57	0.29	4.15	3.29	0.44	33.9	0.76	0.79
Llama 3.1	AXCXEPT/Llama-3.1-8B-EZO-1.1-it	63.80	53.94	7.93	7.57	8.26	8.61	7.28	0.39	0.22	4.53	4.17	0.46	30.4	0.77	0.62
Llama 3	elyza/Llama-3-ELYZA-JP-8B	60.92	39.09	7.91	7.61	8.08	8.92	7.04	0.41	0.24	4.39	1.75	0.34	17.5	0.62	0.43
Llama 3.1	allenai/Llama-3.1-Tulu-3.1-8B	60.86	54.21	7.42	6.84	7.69	8.61	6.52	0.51	0.22	4.39	2.90	0.40	31.3	0.82	0.63
Apache 2.0	llm-jp/llm-jp-3-7.2b-instruct3	56.05	23.46	7.66	6.99	7.70	9.16	6.79	0.47	0.20	3.03	1.49	0.22	5.2	0.49	0.18
Llama 3.1	meta-llama/Llama-3.1-8B-Instruct	53.43	53.43	7.34	6.95	7.67	8.36	6.40	0.25	0.16	4.13	1.03	0.44	27.7	0.80	0.63
Llama 3	shisa-ai/shisa-v1-llama3-8b	53.08	42.80	7.17	6.40	7.50	8.31	6.48	0.23	0.09	4.20	2.24	0.36	20.2	0.63	0.52
Apache 2.0	weblab-GENIAC/Tanuki-8B-dpo-v1.0	52.25	27.04	7.10	6.97	6.58	8.40	6.46	0.23	0.17	3.67	2.02	0.24	14.4	0.38	0.32
Apache 2.0	augmxnt/shisa-gamma-7b-v1	48.88	20.88	6.20	5.74	5.93	7.28	5.87	0.52	0.13	3.20	1.43	0.26	2.2	0.37	0.18

Testing Notes

Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of Athene-V2, Llama 3.3 70B, and Tulu 3 405B FP8.
Dynamic RoPE extension was used when testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.
A custom "multieval" harness was developed to automate model evaluations. Standard benchmarks include ELYZA Tasks 100, JA MT - Bench (dataset), etc.

New Japanese Benchmarks

shisa-jp-ifeval: Inspired by IFEval, it evaluates instruction - following abilities specific to Japanese grammar and linguistics (closed form).
shisa-jp-rp-bench: Assesses performance on Japanese role - play and character/persona - based multi - turn conversations based on Aratako's Japanese - RP - Bench (LLM judge).
shisa-jp-tl-bench: Tests Japanese - English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring).

💻 Usage

All Shisa V2 models inherit the chat templates of their respective base models and have been tested and validated for proper inference with both vLLM and SGLang.

Usage Tips

For translation tasks, it is recommended to use a lower temperature (0.2) to improve accuracy.
For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield better results.
To prevent cross - lingual token leakage, a top_p of 0.9 or min_p of 0.1 is recommended.

⚠️ Important Note

No additional safety alignment has been done on these models, so they will largely inherit the base models' biases and safety profiles.

📚 Datasets

Supervised Fine - Tuning (SFT) Stage Dataset

The SFT stage dataset consists of approximately 360K samples totaling about 420M Llama 3 tokens:

shisa-ai/shisa-v2-sharegpt: A filtered, regenerated, and resampled version of the original Shisa V1 augmxnt/ultra-orca-boros-en-ja-v1 dataset.
shisa-ai/rewild-set-deepseek-subset: A filtered version of Rewild (WildChat) prompts translated into Japanese, with responses generated by DeepSeek - V3 - 0324.
shisa-ai/magpie-ultra-set: Japanese generations based on argilla/magpie-ultra-v1.0.
shisa-ai/magpie-advanced-questions-set: Magpie-generated questions about advanced college - level topics across various academic fields.
shisa-ai/japan-magpie-set: Magpie-generated questions about Japan's economy, history, cultural, and business practices.
shisa-ai/shisa-v2-roleplaying-sft: Synthetically - generated roleplaying data with a wide variety of characters, situations, and genres.
shisa-ai/translation_expanded_master_set_filtered: A synthetic dataset for a wide range of translation tasks, including essays, conversations, and fiction.
shisa-ai/shisa-v2-instruction-following-sft: An instruction - following dataset based on prompts from (Aratako/Magpie - Tanuki - 8B - annotated - 96k) and a list of instruction - following constraints.

Final DPO Mix

The final DPO mix contains 113K samples totaling approximately 115M Llama 3 tokens:

shisa-ai/deepseekv3-ultrafeedback-armorm-dpo: A version of princeton - nlp/gemma2 - ultrafeedback - armorm with chosen responses regenerated by DeepSeek - V3 - 0324.
shisa-ai/shisa-v2-roleplaying-dpo: A DPO variant of the roleplaying - sft set using an UltraFeedback - style rating system.
shisa-ai/translation-no-extra-text-dpo-dataset: A DPO set to reduce the tendency of models to output extraneous explanatory text for translations.
shisa-ai/shisa-v2-instruction-following-dpo: A DPO variant of the instruction - following - sft set to enhance instruction - following performance.
shisa-ai/politeness-dpo-set: A set to control the speaking style of Japanese responses.

🔧 Training

Training Process

We trained over 200 models to test a wide range of variables, including hyper - parameters, data - mix, data ordering, multilingual - specific ordering, curriculum learning, multi - stage training, self - play, preference tuning, and some of the latest RL/verifiable reward techniques.

Training Environment

Most of the training was carried out on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster. Training was mainly done with Axolotl using DeepSpeed and [Liger Kernels](https://github.com/linkedin/Liger - Kernel). The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. Our training logs are publicly available on Weights and Biases.

Credits

The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).
Compute resources were provided by Ubitus K.K. and METI GENIAC.
Thanks to Meta Llama, Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open - source community. Also, thanks to Unsloth for the [llamafied conversion of Phi - 4](https://huggingface.co/unsloth/phi - 4), the Tulu team, and Chanvichet Vong of the Axolotl team.
Special thanks to all open - source AI developers and researchers, and Jon Durbin for his work on Shisa V1.

For more details, please visit the Shisa V2 Github repository and the Shisa.AI website.

^{1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa - v2 - llama3.1 - 8b" and "Llama 3.3 shisa - v2 - llama3.3 - 70b"}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご