Llama-3-Instruct-8B-SPPO-Iter3 Open-source Large Language Model - Free Deployment to Boost Intelligent Dialogue and Communication

Llama 3 Instruct 8B SPPO Iter3

Developed by UCLA-AGI

A large language model developed in the third iteration using the Self-Play Preference Optimization method based on the Meta-Llama-3-8B-Instruct architecture.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Self-play optimization #Synthetic data training #Multi-task evaluation

Downloads 8,539

Release Time : 6/25/2024

Model Overview

This model improves alignment ability through self-play and preference optimization and is trained on synthetic datasets to enhance generalization ability.

Model Features

Self-Play Preference Optimization

Fine-tuned using an advanced self-play preference optimization method to improve the model's alignment ability.

Synthetic data training

Trained using synthetic datasets to enhance the model's generalization ability.

Iterative optimization

Optimized through three iterations, with performance gradually improving.

Model Capabilities

Text generation

Instruction following

Multi-round dialogue

Use Cases

Dialogue system

Intelligent assistant

Can be used to build intelligent dialogue assistants, providing a natural and smooth interaction experience.

Content generation

Text creation

Can be used to generate various types of text content, such as articles and stories.

🚀 Llama-3-Instruct-8B-SPPO-Iter3

This model, Llama-3-Instruct-8B-SPPO-Iter3, is designed for text generation. It uses the Self-Play Preference Optimization method to fine - tune the model, aiming to improve the alignment and performance of the language model.

🚀 Quick Start

This README provides detailed information about the Llama-3-Instruct-8B-SPPO-Iter3 model, including its development, evaluation results, and training hyperparameters.

✨ Features

Optimized Architecture: Developed based on the meta-llama/Meta-Llama-3-8B-Instruct architecture, with further optimization through Self - Play Preference Optimization.
Diverse Datasets: Utilizes prompt sets from the openbmb/UltraFeedback dataset, which is split for multiple iterations.
Synthetic Responses: All responses used during development are synthetic.

📚 Documentation

Model Development

This model was developed using Self-Play Preference Optimization at iteration 3. Starting from the meta-llama/Meta-Llama-3-8B-Instruct architecture, we used the prompt sets from the openbmb/UltraFeedback dataset. The dataset was split into 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.

Links to Other Models

Model Description

Property	Details
Model Type	A 8B parameter GPT - like model fine - tuned on synthetic datasets.
Language(s) (NLP)	Primarily English
License	Apache - 2.0
Finetuned from model	meta-llama/Meta-Llama-3-8B-Instruct

Evaluation Results

AlpacaEval Leaderboard Evaluation Results

Model	LC. Win Rate	Win Rate	Avg. Length
Llama-3-8B-SPPO Iter1	31.73	31.74	1962
Llama-3-8B-SPPO Iter2	35.15	35.98	2021
Llama-3-8B-SPPO Iter3	38.77	39.85	2066

Open LLM Leaderboard Evaluation Results

Results are reported by using lm-evaluation-harness v0.4.1

	arc_challenge	truthfulqa_mc2	winogrande	gsm8k	hellaswag	mmlu	average
Llama-3-8B-SPPO Iter1	63.82	54.96	76.40	75.44	79.80	65.65	69.35
Llama-3-8B-SPPO Iter2	64.93	56.48	76.87	75.13	80.39	65.67	69.91
Llama-3-8B-SPPO Iter3	65.19	58.04	77.11	74.91	80.86	65.60	70.29

Open LLM Leaderboard 2 Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	23.68
IFEval (0 - Shot)	68.28
BBH (3 - Shot)	29.74
MATH Lvl 5 (4 - Shot)	7.33
GPQA (0 - shot)	2.01
MuSR (0 - shot)	3.09
MMLU - PRO (5 - shot)	29.38

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 07
eta: 1000
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
seed: 42
distributed_type: deepspeed_zero3
num_devices: 8
optimizer: RMSProp
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_train_epochs: 6.0 (stop at epoch = 1.0)

📄 License

This model is licensed under the Apache - 2.0 license.

🔧 Technical Details

This model is based on the Self - Play Preference Optimization method, which is described in the paper Self-Play Preference Optimization for Language Model Alignment. The method aims to improve the alignment of language models by optimizing the preference of self - play responses.

📚 Citation

@misc{wu2024self,
      title={Self-Play Preference Optimization for Language Model Alignment}, 
      author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
      year={2024},
      eprint={2405.00675},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご