Smaug-34B-v0.1 Open-Source Large Language Model - Free Fine-Tuning Brings a Better Preference Learning Experience

Smaug 34B V0.1

Developed by abacusai

A large language model fine-tuned based on jondurbin/bagel-34b-v0.2, optimized with novel DPO-Positive (DPOP) technology for preference learning

Large Language Model

Transformers

Open Source License:Apache-2.0 #DPOP Optimization #Enhanced Mathematical Reasoning #High-Precision Preference Learning

Downloads 2,694

Release Time : 1/25/2024

Model Overview

Smaug-34B-v0.1 is a 34B-parameter large language model that improves upon standard DPO's shortcomings through DPOP technology, excelling in mathematical reasoning and general tasks.

Model Features

DPOP Optimization Technology

Addresses performance degradation in tasks with small edit distances through the novel DPO-Positive loss function

Multi-Domain Performance Improvement

Outstanding performance on diverse datasets such as ARC, HellaSwag, and MetaMath

Open-Source Tech Stack

Complete training details and datasets are open-sourced via research papers, supporting community-driven optimization

Model Capabilities

Complex text generation

Mathematical problem-solving

Common-sense reasoning

Open-domain question answering

Truthful answer generation

Use Cases

Education

Math Tutoring

Helps students solve math problems like GSM8K

GSM8K score of 72.18

Research

Preference Learning Research

Serves as a benchmark model for DPOP technology

Outperforms standard DPO in multiple tasks

base_model: jondurbin/bagel-34b-v0.2 license: apache-2.0

image/png

This model is a finetune of jondurbin's excellent bagel model. This model has not utilised any form of merging.

We created Smaug-34B-v0.1 using a new fine-tuning technique, DPO-Positive (DPOP), and new pairwise preference versions of ARC, HellaSwag, and MetaMath (as well as other existing datasets). We introduce the technique and the full training details in our new paper: https://arxiv.org/abs/2402.13228.

We show that on datasets in which the edit distance between pairs of completions is low (such as in math-based datasets), standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. Using these insights, we design DPOP, a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions.

We believe this new approach is generally useful in training across a wide range of model types and downstream use cases, and it powers all of our Smaug models. With the release of our paper and datasets, we are excited for the open source community to continue to build on and improve Smaug and spawn more dragons to dominate the LLM space!

Evaluation Results

Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
77.29	74.23	86.76	76.66	70.22	83.66	72.18

Contamination Results

With reference model jondurbin/bagel-34b-v0.2:

ARC	TruthfulQA	GSM8K
0.08	0.38	0.88

Citation

Please cite the paper if you use data, model, or method in this repo.

@article{pal2024smaug,
  title={Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive},
  author={Pal, Arka and Karkhanis, Deep and Dooley, Samuel and Roberts, Manley and Naidu, Siddartha and White, Colin},
  journal={arXiv preprint arXiv:2402.13228},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご