Beaver 7b V1.0 Cost

Developed by PKU-Alignment

The Beaver Cost Model is a preference model trained on the PKU-SafeRLHF dataset, designed to evaluate the safety of model outputs in safe RLHF algorithms.

Large Language Model

Safetensors

English#Safe RLHF #Harmless AI #Dialogue Safety Evaluation

Downloads 3,336

Release Time : 7/10/2023

Model Overview

This model plays a role in safe RLHF algorithms, helping the Beaver model become safer and more harmless, based on the Transformer architecture's autoregressive language model.

Model Features

Safe Reinforcement Learning

Designed specifically for safe RLHF algorithms to help models output safer and more harmless content

Based on LLaMA Architecture

Fine-tuned on LLaMA and Alpaca models, equipped with strong language understanding capabilities

Safety Preference Scoring

Capable of evaluating and scoring the safety of model outputs

Model Capabilities

Safety Preference Scoring

Dialogue Safety Evaluation

Reinforcement Learning Safety Feedback

Use Cases

AI Safety

Dialogue System Safety Evaluation

Evaluate the safety of dialogue system outputs to prevent harmful content generation

Enhance the safety and reliability of dialogue systems

RLHF Training

Provide safety preference signals during reinforcement learning human feedback training

Help train safer AI models

datasets:

PKU-Alignment/PKU-SafeRLHF language:
en tags:
reinforcement-learning-from-human-feedback
reinforcement-learning
beaver
safety
llama
ai-safety
deepspeed
rlhf
alpaca library_name: safe-rlhf

ðŸ¦« Beaver's Cost Model

Model Details

The Beaver cost model is a preference model trained using the PKU-SafeRLHF dataset. It can play a role in the safe RLHF algorithm, helping the Beaver model become more safe and harmless.

Developed by: the PKU-Alignment Team.
Model Type: An auto-regressive language model based on the transformer architecture.
License: Non-commercial license.
Fine-tuned from model: LLaMA, Alpaca.

Model Sources

Repository: https://github.com/PKU-Alignment/safe-rlhf
Beaver: https://huggingface.co/PKU-Alignment/beaver-7b-v1.0
Dataset: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF
Reward Model: https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward
Cost Model: https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost
Dataset Paper: https://arxiv.org/abs/2307.04657
Paper: https://arxiv.org/abs/2310.12773

How to Use the Cost Model

import torch
from transformers import AutoTokenizer
from safe_rlhf.models import AutoModelForScore

model = AutoModelForScore.from_pretrained('PKU-Alignment/beaver-7b-v1.0-cost', torch_dtype=torch.bfloat16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained('PKU-Alignment/beaver-7b-v1.0-cost')

input = 'BEGINNING OF CONVERSATION: USER: hello ASSISTANT:Hello! How can I help you today?'

input_ids = tokenizer(input, return_tensors='pt')
output = model(**input_ids)
print(output)

# ScoreModelOutput(
#     scores=tensor([[[ -9.4375],
#          [ -2.5156],
#          [ -2.6562],
#          [ -2.3594],
#          [ -1.9375],
#          [ -2.5781],
#          [ -1.4766],
#          [ -1.9922],
#          [ -2.6562],
#          [ -3.8125],
#          [ -2.9844],
#          [ -4.1875],
#          [ -3.5938],
#          [ -4.6562],
#          [ -4.0000],
#          [ -3.3438],
#          [ -4.5625],
#          [ -4.8438],
#          [ -5.1875],
#          [ -8.0000],
#          [ -8.4375],
#          [-10.5000],
#          [-10.5000],
#          [ -8.8750],
#          [-10.1250],
#          [-10.2500],
#          [-11.5625],
#          [-10.7500]]], grad_fn=<ToCopyBackward0>),
#     end_scores=tensor([[-10.7500]], grad_fn=<ToCopyBackward0>),
#     last_hidden_state=tensor([[[ 2.2812, -0.4219, -0.2832,  ...,  0.2715,  0.4277,  1.1875],
#          [-0.3730, -0.2158,  1.2891,  ..., -1.3281,  0.6016,  0.7773],
#          [ 0.2285, -1.2422,  1.0625,  ..., -1.3438,  1.1875,  1.1016],
#          ...,
#          [-0.8828, -2.6250,  0.9180,  ..., -0.2773,  1.7500,  0.7695],
#          [ 2.0781, -4.1250, -0.1069,  ..., -0.8008,  0.4844,  0.4102],
#          [ 2.9688, -1.6250,  1.1250,  ...,  0.3223,  0.0439, -2.3281]]],
#        dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
#     end_last_hidden_state=tensor([[ 2.9688, -1.6250,  1.1250,  ...,  0.3223,  0.0439, -2.3281]],
#        dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
#     end_index=tensor([27])
# )

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご