Shisa-v1-llama3-8b Open-Source Japanese Large Language Model - Excellent performance in benchmark tests and free to use

Shisa V1 Llama3 8b

Developed by shisa-ai

A Japanese-optimized large language model fine-tuned based on Meta-Llama-3-8B-Instruct, excelling in multiple Japanese benchmark tests

Large Language Model

Transformers

#Japanese optimization #Multi-turn dialogue #Low-resource fine-tuning

Downloads 28

Release Time : 5/21/2024

Model Overview

This is an 8B-parameter large language model optimized for Japanese, fine-tuned on the Llama 3 architecture, achieving excellent results on Japanese evaluation benchmarks such as ELYZA100 and Japanese MT-Bench

Model Features

Japanese optimization

Specifically optimized for Japanese tasks, with Japanese characters accounting for over 91%

Outstanding benchmark performance

Surpasses similar models on multiple Japanese evaluation benchmarks including ELYZA100, Japanese MT-Bench, and Rakuda

Precise fine-tuning

Determined 8e-6 as the optimal parameter through various learning rate experiments to avoid overfitting

Model Capabilities

Japanese text generation

Japanese Q&A

Japanese text comprehension

Multi-turn dialogue

Use Cases

Japanese NLP applications

Japanese customer service chatbot

Used for handling Japanese customer inquiries

Scored 7.05 on the Rakuda benchmark, outperforming most similar models

Japanese content creation

Generates text content that conforms to Japanese expression habits

Japanese characters account for 91.3%, with high expression naturalness

🚀 shisa-v1-llama3-8b Model

This model is a fine - tuned version of Llama 3, aiming to provide high - performance language processing capabilities in multiple language tasks, especially in English and Japanese.

🚀 Quick Start

This model is a fine - tuned version of [meta - llama/Meta - Llama - 3 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct). You can refer to the following details for more information.

✨ Features

High - performance: Through fine - tuning, it shows excellent performance in multiple evaluation metrics.
Multilingual support: Trained on English - Japanese bilingual data, it can handle tasks in both languages well.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Model Information

Property	Details
Model Type	Fine - tuned version of meta - llama/Meta - Llama - 3 - 8B - Instruct
Training Data	augmxnt/ultra - orca - boros - en - ja - v1
License	llama3

Evaluation Results

Per the Llama 3 Community License Agreement, the official name of this model is "LLama 3 shisa - v1 - llama3 - 8b".

8e6 moved in as it is a slightly superior model, and some cleanup and renaming will be done soon.

The tests were run for 2 runs to try to lower variance. All tests used temp 0.2, min_p 0.1, freq penalty 0.5

Model	AVG Score	ELYZA100	JA MT - Bench	Rakuda	Tengu - Bench	JA Char %
shisa - v1 - llama3 - 8b.lr - 2e4	3.97	4.60	4.54	3.33	3.42	92.42%
shisa - v1 - llama3 - 8b.lr - 5e5	5.73	6.28	6.45	5.37	4.81	90.93%
shisa - v1 - llama3 - 8b.2e5	6.33	6.51	6.66	6.68	5.48	91.51%
shisa - v1 - llama3 - 8b (8 - e6)	6.59	6.67	6.95	7.05	5.68	91.30%
shisa - v1 - llama3 - 8b.5e6	6.42	6.33	6.76	7.15	5.45	91.56%
shisa - v1 - llama3 - 8b.2e6	6.31	6.26	6.88	6.73	5.38	92.00%

The 2e - 4 and 5e - 5 are definitely overtrained and perform significantly worse.
2e - 5 is on the edge since weightwacher shows the embed as slightly overtrained for 2e - 5, but the NEFTune version is not.
8e - 6 performs the best, and 5e - 6 also performed slightly better than 2e - 5.

Comparison with Other Models

Model	Average	ELYZA - tasks - 100	MT - Bench	Rakuda	Tengu - Bench
gpt - 4 - turbo - 2024 - 04 - 09	8.75	8.78	8.74	9.18	8.31
gpt - 4o - 2024 - 05 - 13	8.72	8.88	8.69	9.15	8.16
gemini - 1.5 - pro	8.58	8.58	8.93	9.20	7.61
claude - 3 - opus - 20240229	8.55	8.64	8.58	8.75	8.23
CohereForAI/c4ai - command - r - plus	7.69	7.50	7.43	9.05	6.79
shisa - ai/shisa - v1 - llama3 - 70b	7.30	7.34	7.67	8.15	6.04
gpt - 3.5 - turbo - 0125	7.17	7.24	6.98	7.64	6.82
shisa - ai/shisa - v1 - llama3 - 70b.2e5	7.17	7.16	7.45	7.98	6.09
karakuri - ai/karakuri - lm - 8x7b - chat - v0.1	7.00	7.18	6.30	7.98	6.55
karakuri - ai/karakuri - lm - 70b - chat - v0.1	6.84	6.86	6.43	7.85	6.23
lightblue/ao - karasu - 72B	6.81	7.19	6.54	7.25	6.27
shisa - ai/shisa - v1 - llama3 - 8b	6.59	6.67	6.95	7.05	5.68
shisa - ai/shisa - swallowmx - 13a47b - v1	6.17	6.48	6.07	7.11	5.03
lightblue/suzume - llama - 3 - 8B - japanese	5.96	6.68	4.96	6.68	5.53
augmxnt/shisa - gamma - 7b - v1	5.82	5.96	5.02	6.85	5.47
shisa - ai/shisa - v1 - phi3 - 14b	5.77	6.28	5.26	6.55	5.01
shisa - ai/shisa - v1 - gemma - 8b	5.64	6.50	5.42	5.10	5.55
Rakuten/RakutenAI - 7B - chat	5.58	5.92	4.60	6.58	5.24
lightblue/qarasu - 14B - chat - plus - unleashed	5.20	5.58	4.74	5.46	5.01
shisa - ai/shisa - v1 - mistral0.3 - 7b	5.11	5.64	6.10	3.83	4.86
cyberagent/calm2 - 7b - chat	4.76	4.90	3.58	5.75	4.81
mistralai/Mistral - 7B - Instruct - v0.2	4.69	5.78	4.65	3.80	4.53
shisa - ai/shisa - v1 - yi1.5 - 9b	4.63	5.98	4.28	3.26	5.00
augmxnt/shisa - 7b - v1	4.50	4.63	3.95	4.89	4.53

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 8e - 06
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 100
num_epochs: 3

Training Results

Training Loss	Epoch	Step	Validation Loss
1.3951	0.0064	1	0.8645
0.8731	0.5020	79	0.5577
0.8405	1.0040	158	0.5138
0.6888	1.4853	237	0.4982
0.6674	1.9873	316	0.4870
0.5859	2.4694	395	0.4983

Framework Versions

Transformers 4.40.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

Axolotl Config

Compute for training this model was generously provided by Ubitus.

[](https://github.com/OpenAccess - AI - Collective/axolotl)

See axolotl config

axolotl version: 0.4.0

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: llama3
datasets:
  - path: augmxnt/ultra-orca-boros-en-ja-v1
    type: sharegpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/lr-8e6

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

use_wandb: true
wandb_project: shisa-v2
wandb_entity: augmxnt
wandb_name: shisa-v1-llama3-8b.lr-8e6

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: linear
learning_rate: 8e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 0
debug:
deepspeed: axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.00
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>

🔧 Technical Details

No detailed technical implementation information is provided in the original document.

📄 License

The model is under the llama3 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご