LLaMA-3-8B-SFR-Iterative-DPO-R Open-source Model - Instruction Optimization Outperforms Models of the Same Scale in Multiple Evaluations

Llama 3 8B SFR Iterative DPO R

Developed by Salesforce

An instruction-optimized model based on Llama-3-8B, trained with iterative DPO reinforcement learning, outperforming same-scale and some larger models in multiple benchmarks

Large Language Model

Transformers

#Online RLHF Optimization #Instruction Fine-tuned Model #Multi-benchmark SOTA

Downloads 55

Release Time : 5/9/2024

Model Overview

An open-source instruction model optimized with reinforcement learning, focusing on improving dialogue quality and task completion capabilities, suitable for various natural language processing tasks

Model Features

Iterative DPO Training

Utilizes an innovative online RLHF training approach, more efficient and easier to tune compared to traditional PPO methods

Outstanding Performance

Surpasses commercial models like GPT-3.5-turbo in benchmarks such as Alpaca-Eval-V2 and MT-Bench

Pure Open-source Data Training

Trained entirely on open-source datasets without any human/GPT4 annotated data

Model Capabilities

Natural language understanding

Instruction following

Multi-turn dialogue

Text generation

Question answering

Use Cases

Intelligent Assistant

Personalized Learning Assistant

Provides personalized guidance such as calligraphy learning suggestions

Capable of offering structured and practical learning advice

Customer Service System

Automated Customer Service

Handles common customer inquiries

Efficient and accurate response capability

🚀 Llama-3-8B-SFR-Iterative-DPO-R

We release a state-of-the-art instruct model that outperforms many models on multiple benchmarks and is trained with open-sourced datasets.

🚀 Quick Start

We introduce Llama-3-8B-SFR-Iterative-DPO-R, a cutting - edge instruct model. It excels on three popular instruct model benchmarks: Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard, outperforming similar - sized models, most large open - sourced models, and strong proprietary models. This model is trained using open - sourced datasets without any additional human - or GPT4 - labeling.

✨ Features

High - performance: Surpasses many models on various benchmarks.
Cost - effective training: Utilizes a DPO - based online RLHF recipe, which is cheaper and simpler to train and tune compared to PPO - based approaches.
Distribution shift mitigation: The online component effectively reduces distribution shifts during policy optimization.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" 

model = AutoModelForCausalLM.from_pretrained("Salesforce/Llama-3-8B-SFR-Iterative-DPO-R")
tokenizer = AutoTokenizer.from_pretrained("Salesforce/Llama-3-8B-SFR-Iterative-DPO-R")

messages = [
    {"role": "user", "content": "I'm trying to teach myself to have nicer handwriting. Can you help?"},
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = model_inputs.to(device)
model.to(device)

output_tokens = model.generate(model_inputs, max_new_tokens=1024, do_sample=True)
model_outputs = tokenizer.batch_decode(output_tokens)
print(model_outputs[0])

📚 Documentation

Model Releases

Training methods

We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO - based and thus much cheaper and simpler to train and tune compared to PPO - based approaches. Unlike widely - used offline DPO, the online component of our approach effectively mitigates distribution shifts during policy optimization. For a detailed exposition, please refer to our accompanying technical report.

Chat Benchmarks

Model	Size	Method	LC Alpaca - Eval - V2	MT - Bench	Chat - Arena - Hard
Small Open - Sourced Models
Gemma - 7B - it	7B	SFT	10.4	6.38	7.5
Zephyr - 7B - beta	7B	Vanilla DPO	13.1	7.34	-
Mistral - 7B - v0.2 - it	7B	SFT	17.1	7.51	12.6
Open - Chat - 0106	7B	SFT	15.6	7.8	-
Starling - 7B - beta	7B	PPO	25.8	8.12	23.0
LLaMA - 3 - 8B - it	8B	RS + DPO + PPO	22.9	8.16	20.6
Ours
Ours (SFT baseline)	8B	SFT	10.2	7.69	5.6
Ours (DPO baseline)	8B	Vanilla DPO	22.5	8.17	22.4
Ours (Online RLHF)	8B	Iterative DPO	31.3	8.46	29.1
Large Open - Sourced Models
Vicuna - 33b - v1.3	33B	SFT	17.6	7.12	8.6
Yi - 34B - Chat	34B	SFT	27.2	-	23.1
Mixtral - 8x7B - it	45B*	SFT	23.7	8.30	23.4
Tulu - 2 - DPO - 70B	70B	Vanilla DPO	21.2	7.89	15.0
LLaMA - 3 - 70B - it	70B	RS + DPO + PPO	34.4	8.95	41.1
Mixtral - 8x22B - it	141B*	SFT	30.9	8.66	36.4
Proprietary Models
GPT - 3.5 - turbo - 1106	-	-	19.3	8.35	18.9
GPT - 3.5 - turbo - 0613	-	-	22.7	8.39	24.8
GPT - 4 - 0613	-	-	30.2	9.18	37.9
Claude - 3 - Opus	-	-	40.5	9.00	60.4
GPT - 4 Turbo (04/09)	-	-	55.0	-	82.6

Academic Benchmarks

Model	Size	Method	GSM - 8K	MMLU	HumanEval	TruthfulQA	ARC	MBPP
LLaMA - 3 - 8B - it	8B	RS + DPO + PPO	79.6	66.0	61.6	43.9	59.5	61.1
Ours (SFT baseline)	8B	SFT	74.2	64.7	65.2	53.4	61.4	62.3
Ours (DPO baseline)	8B	Vanilla DPO	79.8	64.5	63.4	61.8	65.2	60.3
Ours (Iterative RLHF)	8B	Iterative DPO	80.7	65.3	64.6	60.4	64.3	60.8

🔧 Technical Details

The online RLHF recipe is DPO - based. It is designed to be cheaper and simpler to train and tune than PPO - based approaches. The online component helps in reducing distribution shifts during policy optimization. For more details, refer to the accompanying technical report.

📄 License

The model is released under the Llama3 license.

⚠️ Important Note

Llama - 3 - 8B - SFR - Iterative - DPO - R is a research model developed as part of our RLHF initiative at Salesforce. While safety and ethical considerations are integral to our alignment process, there remains the possibility that the model could generate offensive or unethical content, particularly under adversarial conditions. We are committed to continuous improvement in our models to minimize such risks and encourage responsible usage.

💡 Usage Tip

Please cite our papers if you find our models are useful.

@misc{dong2024rlhf,
      title={RLHF Workflow: From Reward Modeling to Online RLHF}, 
      author={Hanze Dong* and Wei Xiong* and Bo Pang* and Haoxiang Wang* and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
      year={2024},
      eprint={2405.07863},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@misc{xiong2024iterative,
      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL - Constraint}, 
      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
      year={2024},
      eprint={2312.11456},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

We also recommend users to evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. Refer to our standard AUP and [AI AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ai - acceptable - use - policy.pdf) for further guidance on use cases.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご