đ shisa-v1-llama3-8b Model
This model is a fine - tuned version of Llama 3, aiming to provide high - performance language processing capabilities in multiple language tasks, especially in English and Japanese.
đ Quick Start
This model is a fine - tuned version of [meta - llama/Meta - Llama - 3 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct). You can refer to the following details for more information.
⨠Features
- High - performance: Through fine - tuning, it shows excellent performance in multiple evaluation metrics.
- Multilingual support: Trained on English - Japanese bilingual data, it can handle tasks in both languages well.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Model Information
Property |
Details |
Model Type |
Fine - tuned version of meta - llama/Meta - Llama - 3 - 8B - Instruct |
Training Data |
augmxnt/ultra - orca - boros - en - ja - v1 |
License |
llama3 |
Evaluation Results
Per the Llama 3 Community License Agreement, the official name of this model is "LLama 3 shisa - v1 - llama3 - 8b".
8e6 moved in as it is a slightly superior model, and some cleanup and renaming will be done soon.
The tests were run for 2 runs to try to lower variance. All tests used temp 0.2, min_p 0.1, freq penalty 0.5
Model |
AVG Score |
ELYZA100 |
JA MT - Bench |
Rakuda |
Tengu - Bench |
JA Char % |
shisa - v1 - llama3 - 8b.lr - 2e4 |
3.97 |
4.60 |
4.54 |
3.33 |
3.42 |
92.42% |
shisa - v1 - llama3 - 8b.lr - 5e5 |
5.73 |
6.28 |
6.45 |
5.37 |
4.81 |
90.93% |
shisa - v1 - llama3 - 8b.2e5 |
6.33 |
6.51 |
6.66 |
6.68 |
5.48 |
91.51% |
shisa - v1 - llama3 - 8b (8 - e6) |
6.59 |
6.67 |
6.95 |
7.05 |
5.68 |
91.30% |
shisa - v1 - llama3 - 8b.5e6 |
6.42 |
6.33 |
6.76 |
7.15 |
5.45 |
91.56% |
shisa - v1 - llama3 - 8b.2e6 |
6.31 |
6.26 |
6.88 |
6.73 |
5.38 |
92.00% |
- The 2e - 4 and 5e - 5 are definitely overtrained and perform significantly worse.
- 2e - 5 is on the edge since weightwacher shows the embed as slightly overtrained for 2e - 5, but the NEFTune version is not.
- 8e - 6 performs the best, and 5e - 6 also performed slightly better than 2e - 5.
Comparison with Other Models
Model |
Average |
ELYZA - tasks - 100 |
MT - Bench |
Rakuda |
Tengu - Bench |
gpt - 4 - turbo - 2024 - 04 - 09 |
8.75 |
8.78 |
8.74 |
9.18 |
8.31 |
gpt - 4o - 2024 - 05 - 13 |
8.72 |
8.88 |
8.69 |
9.15 |
8.16 |
gemini - 1.5 - pro |
8.58 |
8.58 |
8.93 |
9.20 |
7.61 |
claude - 3 - opus - 20240229 |
8.55 |
8.64 |
8.58 |
8.75 |
8.23 |
CohereForAI/c4ai - command - r - plus |
7.69 |
7.50 |
7.43 |
9.05 |
6.79 |
shisa - ai/shisa - v1 - llama3 - 70b |
7.30 |
7.34 |
7.67 |
8.15 |
6.04 |
gpt - 3.5 - turbo - 0125 |
7.17 |
7.24 |
6.98 |
7.64 |
6.82 |
shisa - ai/shisa - v1 - llama3 - 70b.2e5 |
7.17 |
7.16 |
7.45 |
7.98 |
6.09 |
karakuri - ai/karakuri - lm - 8x7b - chat - v0.1 |
7.00 |
7.18 |
6.30 |
7.98 |
6.55 |
karakuri - ai/karakuri - lm - 70b - chat - v0.1 |
6.84 |
6.86 |
6.43 |
7.85 |
6.23 |
lightblue/ao - karasu - 72B |
6.81 |
7.19 |
6.54 |
7.25 |
6.27 |
shisa - ai/shisa - v1 - llama3 - 8b |
6.59 |
6.67 |
6.95 |
7.05 |
5.68 |
shisa - ai/shisa - swallowmx - 13a47b - v1 |
6.17 |
6.48 |
6.07 |
7.11 |
5.03 |
lightblue/suzume - llama - 3 - 8B - japanese |
5.96 |
6.68 |
4.96 |
6.68 |
5.53 |
augmxnt/shisa - gamma - 7b - v1 |
5.82 |
5.96 |
5.02 |
6.85 |
5.47 |
shisa - ai/shisa - v1 - phi3 - 14b |
5.77 |
6.28 |
5.26 |
6.55 |
5.01 |
shisa - ai/shisa - v1 - gemma - 8b |
5.64 |
6.50 |
5.42 |
5.10 |
5.55 |
Rakuten/RakutenAI - 7B - chat |
5.58 |
5.92 |
4.60 |
6.58 |
5.24 |
lightblue/qarasu - 14B - chat - plus - unleashed |
5.20 |
5.58 |
4.74 |
5.46 |
5.01 |
shisa - ai/shisa - v1 - mistral0.3 - 7b |
5.11 |
5.64 |
6.10 |
3.83 |
4.86 |
cyberagent/calm2 - 7b - chat |
4.76 |
4.90 |
3.58 |
5.75 |
4.81 |
mistralai/Mistral - 7B - Instruct - v0.2 |
4.69 |
5.78 |
4.65 |
3.80 |
4.53 |
shisa - ai/shisa - v1 - yi1.5 - 9b |
4.63 |
5.98 |
4.28 |
3.26 |
5.00 |
augmxnt/shisa - 7b - v1 |
4.50 |
4.63 |
3.95 |
4.89 |
4.53 |
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 8e - 06
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi - GPU
- num_devices: 8
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- total_eval_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- num_epochs: 3
Training Results
Training Loss |
Epoch |
Step |
Validation Loss |
1.3951 |
0.0064 |
1 |
0.8645 |
0.8731 |
0.5020 |
79 |
0.5577 |
0.8405 |
1.0040 |
158 |
0.5138 |
0.6888 |
1.4853 |
237 |
0.4982 |
0.6674 |
1.9873 |
316 |
0.4870 |
0.5859 |
2.4694 |
395 |
0.4983 |
Framework Versions
- Transformers 4.40.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
Axolotl Config
Compute for training this model was generously provided by Ubitus.
[
](https://github.com/OpenAccess - AI - Collective/axolotl)
See axolotl config
axolotl version: 0.4.0
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
chat_template: llama3
datasets:
- path: augmxnt/ultra-orca-boros-en-ja-v1
type: sharegpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/lr-8e6
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
use_wandb: true
wandb_project: shisa-v2
wandb_entity: augmxnt
wandb_name: shisa-v1-llama3-8b.lr-8e6
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: linear
learning_rate: 8e-6
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 0
debug:
deepspeed: axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.00
fsdp:
fsdp_config:
special_tokens:
pad_token: <|end_of_text|>
đ§ Technical Details
No detailed technical implementation information is provided in the original document.
đ License
The model is under the llama3 license.