G

Gpt2 Large Harmless Reward Model

Developed by Ray2333
A large GPT2 model trained on the Anthropic/hh - rlhf harmless dataset, specifically for harmful response detection or reinforcement learning from human feedback (RLHF).
Downloads 1,489
Release Time : 1/14/2024

Model Overview

This model achieved an accuracy of 0.73698 on the test set, almost comparable to other larger-scale models. It is mainly used for harmful response detection and RLHF tasks.

Model Features

High accuracy
Achieved an accuracy of 0.73698 on the test set, with performance approaching that of larger-scale models.
Specialized training
Specifically trained on the Anthropic/hh - rlhf harmless dataset, focusing on harmful response detection.
RLHF support
Supports reinforcement learning from human feedback (RLHF) and can be used for model alignment.

Model Capabilities

Harmful response detection
Text classification
Reinforcement learning feedback

Use Cases

Content security
Harmful content filtering
Detect harmful or inappropriate responses in conversations.
Accurately identify harmful content with an accuracy of 0.73698.
AI alignment
Multi-objective alignment
Used for multi-objective alignment (especially 'harmless' and 'useful' alignment) in the Rewards-in-Context project of ICML 2024.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase