URM-LLaMa-3.1-8B Open-Source Reward Model - Enhance the Alignment Effect of Large Language Models and Perceive Uncertainty

Home

URM LLaMa 3.1 8B

Developed by LxzGordon

URM-LLaMa-3.1-8B is an uncertainty-aware reward model designed to enhance the alignment of large language models.

Large Language Model

Safetensors

#Uncertainty-aware Reward #Multi-attribute Scoring #Language Model Alignment

Downloads 4,688

Release Time : 9/12/2024

Model Overview

This model consists of a base model and attribute-specific value heads with uncertainty awareness. It employs a two-stage training approach (attribute regression and gating layer learning) to provide more reliable reward signals.

Model Features

Uncertainty Awareness

The model can estimate the uncertainty of reward signals, with lower uncertainty indicating more reliable signals, leading to better alignment.

Two-stage Training

The first stage involves attribute regression training, while the second stage learns the gating layer to combine multi-attribute scores.

Gating Layer Learning

The gating layer dynamically combines multi-attribute scores through learning, rather than using fixed weights.

Model Capabilities

Text Quality Assessment

Reward Signal Generation

Uncertainty Estimation

Multi-attribute Scoring

Use Cases

Large Language Model Alignment

Response Quality Assessment

Evaluates the quality of AI assistant responses, including dimensions like helpfulness and correctness.

As shown in the charts, using uncertainty estimation leads to better alignment effects.

Reinforcement Learning

Reward Model

Provides more reliable reward signals for reinforcement learning training.

Reward signals with low uncertainty improve training stability.

🚀 [URM-LLaMa-3.1-8B Uncertainty-Aware Reward Model]

URM-LLaMa-3.1-8B is an uncertain-aware reward model that combines a base model with an uncertainty-aware and attribute-specific value head. It uses two-stage training to improve LLM alignment.

🚀 Quick Start

Paper: https://arxiv.org/pdf/2410.00847
Model: URM-LLaMa-3.1-8B
- Fine-tuned from Skywork-Reward-Llama-3.1-8B

✨ Features

Datasets

Property	Details
Datasets	nvidia/HelpSteer2, Skywork/Skywork-Reward-Preference-80K-v0.1
Pipeline Tag	text-classification

Architecture

URM is one of the RMs in the figure.

Alignment Results

Results of using uncertainty estimates to improve LLM alignment. Rewards with low uncertainty are more reliable and result in better alignment.

Brief

URM-LLaMa-3.1-8B is an uncertain-aware reward model. This RM consists of a base model and an uncertainty-aware and attribute-specific value head. The base model of this RM is from Skywork-Reward-Llama-3.1-8B.

URM involves two-stage training: 1. attributes regression and 2. gating layer learning.

Attribute Regression

Dataset: HelpSteer2

During training, instead of multi-attributes scores, outputs of the uncertainty-aware value head are parameters of a normal distribution, from which scores are sampled. Then we run regression on the outputs with the labels to train the value head. To enable gradient back-propagation, reparameterization technique is used.

Gating Layer Learning

Dataset: Skywork-Reward-Preference-80K-v0.1

Inspired by ArmoRM, we learn a gating layer to combine the multi-attribute scores instead of the fixed weights in SteerLM-RM. Learning objective of the gating layer is to prioritize chosen responses over rejected responses through the BT loss. We only use the five attributes from HelpSteer2: Helpfulness, Correctness, Coherence, Complexity and Verbosity. During this process, the value head and base model are kept frozen.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "LxzGordon/URM-LLaMa-3.1-8B"
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    device_map='auto',
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is the range of the numeric output of a sigmoid node in a neural network?"
response1 = "The output of a sigmoid node is bounded between -1 and 1."
response2 = "The output of a sigmoid node is bounded between 0 and 1."

resp1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
resp2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

# Format and tokenize the conversations
resp1 = tokenizer.apply_chat_template(resp1, tokenize=False)
resp2 = tokenizer.apply_chat_template(resp2, tokenize=False)
resp1 = tokenizer(resp1, return_tensors="pt").to(model.device)
resp2 = tokenizer(resp2, return_tensors="pt").to(model.device)

with torch.no_grad():
    score1 = model(resp1['input_ids'],attention_mask=resp1['attention_mask']).logits[0][0].item()
    score2 = model(resp2['input_ids'],attention_mask=resp2['attention_mask']).logits[0][0].item()
print(score1,score2)

# Response 1 score: 2.3285412788391113, Response 2 score: 12.438033103942871

📚 Documentation

Reference

Please cite

@article{lou2024uncertainty,
  title={Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown},
  author={Lou, Xingzhou and Yan, Dong and Shen, Wei and Yan, Yuzi and Xie, Jian and Zhang, Junge},
  journal={arXiv preprint arXiv:2410.00847},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご