Shisa V2 Llama3.1-8b Open-source Chat Model - Free to Deploy, Supporting Bilingual Conversations in Japanese and English

Shisa V2 Llama3.1 8b

Developed by shisa-ai

Shisa V2 is a series of Japanese-English bilingual general chat models trained by Shisa.AI, focusing on Japanese task performance while maintaining strong English capabilities.

Large Language Model

Transformers

Supports Multiple Languages#Japanese-English Bilingual Optimization #High-precision Japanese Generation #Large Language Model Review

Downloads 120

Release Time : 4/12/2025

Model Overview

Shisa V2 is a series of Japanese-English bilingual general chat models trained using a synthetic data-driven approach with an optimized training process, significantly improving the quality of Japanese output.

Model Features

Bilingual Capability

It has excellent Japanese and English processing capabilities, especially outstanding in Japanese tasks.

Performance Improvement

Compared with the base model, there is a significant improvement in the quality of Japanese output.

Data-driven

It is trained using a synthetic data-driven approach with an optimized training process.

Long Context Support

It supports a context length of 128K/8K, suitable for processing long text tasks.

Model Capabilities

Japanese Text Generation

English Text Generation

Bilingual Conversation

Instruction Following

Role-playing

Translation Task

Use Cases

Chat Conversation

Japanese Chat Assistant

As a Japanese chat assistant, it provides a natural and fluent conversation experience.

It performs excellently in Japanese chat tasks

English Chat Assistant

As an English chat assistant, it maintains strong English conversation capabilities.

It performs well in English chat tasks

Translation Task

Japanese-English Translation

It performs Japanese-English translation tasks and provides accurate translation results.

It performs excellently in translation tasks

Role-playing

Role-playing Conversation

It supports role-playing conversations of various roles, scenarios, and types.

It performs well in role-playing tasks

🚀 Shisa V2

Shisa V2 is a family of bilingual Japanese and English (JA/EN) general - purpose chat models developed by Shisa.AI. These models are designed to perform exceptionally well in Japanese language tasks while maintaining strong English capabilities.

🚀 Quick Start

All Shisa V2 models inherit the chat templates of their respective base models. They have been tested and validated for proper inference with both vLLM and SGLang.

When running sampler sweeps, the models operate well across a variety of temperatures in most settings. For translation tasks, a lower temperature (0.2) is recommended to increase accuracy. For role - play and creative tasks, a higher temperature (e.g., 1.0) seems to yield good results. To prevent cross - lingual token leakage, a top_p of 0.9 or min_p of 0.1 is recommended.

✨ Features

Bilingual Excellence: Shisa V2 models are proficient in both Japanese and English, excelling in Japanese language tasks while retaining robust English capabilities.
Optimized Post - training: Instead of tokenizer extension and costly continued pre - training, the focus is on optimizing post - training, resulting in significant performance gains.
Scalable Performance: The models show robust scaling, with improved Japanese language performance across all evaluated model sizes.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

The README does not contain code examples, so this section is skipped.

📚 Documentation

Model Family Overview

The Shisa V2 family consists of models with parameters ranging from 7B to 70B:

License	Model	Parameters	Context Length	JA AVG	EN AVG
Apache 2.0	shisa-v2-qwen2.5-7b	7B	128K/8K	71.06	54.86
Llama 3.1	shisa-v2-llama3.1-8b¹	8B	128K	70.83	54.75
Apache 2.0	shisa-v2-mistral-nemo-12b	12B	128K	72.83	53.33
MIT	shisa-v2-unphi4-14b	14B	16K	75.89	60.10
Apache 2.0	shisa-v2-qwen2.5-32b	32B	128K/8K	76.97	67.41
Llama 3.3	shisa-v2-llama3.3-70b¹	70B	128K	79.72	67.71

These models were trained using the same datasets and training recipes, with adjustments to the learning rate based on model size and the global batch size for the 70B model.

Performance

All Shisa V2 models show improved Japanese output quality compared to their respective base models. Here are some performance comparisons:

Model	JA Avg	EN Avg	Shaberi Avg	ELYZA 100	JA MT Bench	Rakuda	Tengu	llm-jp-eval	shisa-jp-ifeval	shisa-jp-rp-bench	shisa-jp-tl-bench	MixEval	LiveBench	IFEval	EvalPlus
shisa-ai/shisa-v2-llama3.1-8b	70.83	54.75	8.20	7.67	8.32	9.24	7.56	0.57	0.31	4.61	5.91	0.45	31.7	0.82	0.61
meta-llama/Llama-3.1-8B-Instruct	53.43	53.88	7.34	6.95	7.67	8.36	6.40	0.25	0.16	4.13	1.03	0.44	27.7	0.80	0.63

Testing Notes

Evaluation Harness: Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness.
LLM Jury: Shaberi ratings were performed with a PoLL (LLM Jury) consisting of Athene-V2, Llama 3.3 70B, and Tulu 3 405B FP8.
Testing Tools: Dynamic RoPE extension was used when necessary for models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.
Standard Benchmarks: The custom "multieval" harness was developed for model evaluations. Standard benchmarks include ELYZA Tasks 100, JA MT - Bench, etc.

New Japanese Benchmarks

shisa-jp-ifeval: Evaluates instruction - following abilities specific to Japanese grammar and linguistics.
shisa-jp-rp-bench: Assesses performance on Japanese role - play and multi - turn conversations.
shisa-jp-tl-bench: Tests Japanese - English translation proficiency.

Usage

Inference: The models inherit the chat templates of their base models and support inference with vLLM and SGLang.
Temperature Settings: Different temperature settings are recommended for different tasks. Lower temperatures (0.2) for translation tasks and higher temperatures (e.g., 1.0) for role - play and creative tasks.
Safety: The models inherit the biases and safety profiles of their base models as no additional safety alignment has been done.

Datasets

SFT Dataset: The supervised fine - tuning (SFT) stage dataset contains approximately 360K samples with about 420M Llama 3 tokens. It includes datasets like shisa-ai/shisa-v2-sharegpt, shisa-ai/rewild-set-deepseek-subset, etc.
DPO Mix: The final DPO mix consists of 113K samples with approximately 115M Llama 3 tokens. It includes datasets such as shisa-ai/deepseekv3-ultrafeedback-armorm-dpo, shisa-ai/shisa-v2-roleplaying-dpo, etc.

Training

Model Testing: Over 200 models were trained to test various variables, including hyper - parameters, data mixes, data ordering, etc.
Training Tools: Most training was done on a small AWS Sagemaker - deployed 4 - node H100 slurm cluster using Axolotl, DeepSpeed, and Liger Kernels. The Phi 4 and Llama 3.3 70B versions were trained with OpenRLHF.
Training Logs: The training logs are publicly available on Weights and Biases.

Credits

The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI). Thanks go to many organizations and individuals, including Meta Llama, Microsoft Research, Mistral AI, and Qwen Team, as well as all open - source AI developers and researchers.

^{1: Per the Llama Community License Agreements, the official names of the Llama - based models are "Llama 3.1 shisa-v2-llama3.1-8b" and "Llama 3.3 shisa-v2-llama3.3-70b"}

🔧 Technical Details

The README does not provide in - depth technical details, so this section is skipped.

📄 License

The models in the Shisa V2 family have different licenses, including Apache 2.0, Llama 3.1, and MIT. For example, shisa-v2-qwen2.5-7b is under the Apache 2.0 license, while shisa-v2-llama3.1-8b is under the Llama 3.1 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご