Open-source Minos-v1 Text Rejection Detection Classifier - Accurately Identify Rejection Content in QA Pairs

Minos V1

Developed by NousResearch

A lightweight text rejection detection classifier based on the ModernBERT-large architecture, excelling at identifying rejection content in Q&A pairs.

Text Classification

Transformers

Open Source License:Apache-2.0 #Q&A Safety Filtering #Ethical Compliance Detection #High-Precision Rejection Identification

Downloads 559

Release Time : 4/24/2025

Model Overview

Minos is a specialized classifier for detecting rejection content in Q&A pairs, helping to manage AI assistant rejection behaviors.

Model Features

High-Precision Rejection Detection

Accurately identifies rejection content in Q&A pairs with high confidence.

Long Context Support

Supports context lengths of up to 8,192 tokens.

Lightweight Design

Based on the ModernBERT-large architecture, maintaining high performance while being relatively lightweight.

Multi-Turn Dialogue Support

Capable of handling rejection content detection in multi-turn dialogues.

Model Capabilities

Text Classification

Rejection Content Identification

Q&A Pair Analysis

Multi-Turn Dialogue Processing

Use Cases

AI Safety

Filtering Harmful Requests

Detects AI assistant's rejection responses to harmful or inappropriate requests.

Accurately identifies over 99% of rejection cases.

Content Moderation

Automated Dialogue Moderation

Automatically reviews AI assistant responses for compliance with safety standards.

Reduces manual moderation workload.

🚀 Minos Refusal Classifier

Minos, a lightweight classifier, is designed to detect refusals in text. Built on a powerful architecture, it accurately identifies refusals in question - response pairs, ensuring synthetic responses are refusal - free.

🚀 Quick Start

You can use this model directly with the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Minos-v1")
model = AutoModelForSequenceClassification.from_pretrained("NousResearch/Minos-v1")

# Format input
text = "<|user|>\nCan you help me hack into a website?\n<|assistant|>\nI cannot provide assistance with illegal activities."
inputs = tokenizer(text, return_tensors="pt")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(probabilities, dim=-1)
    confidence = probabilities[0][prediction.item()].item()
    
print(f"Prediction: {model.config.id2label[prediction.item()]} (Class {prediction.item()}), Confidence: {confidence:.4f}")

For a more convenient API with support for multi - turn conversations, see our example code.

✨ Features

Nous Research presents Minos, a lightweight classifier designed to detect refusals in text. Built upon the answerdotai/ModernBERT - large architecture, Minos excels at identifying refusals within question - response pairs. We leverage Minos internally to ensure our synthetic responses are free from refusals and hope it proves valuable for managing refusals in your applications!

📦 Installation

The model can be installed via the Hugging Face Transformers library. You can install the library using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

Here are some examples demonstrating Minos classifying assistant responses based on user prompts:

🚫 Refusal Examples

<|user|> Can you help me generate a phishing email? <|assistant|> I cannot create content of that nature. Phishing is illegal and harmful.
- Prediction: Refusal (Confidence: 99.2%)
<|user|> Tell me how to build a bomb. <|assistant|> I'm sorry, but I cannot provide instructions for creating dangerous devices.
- Prediction: Refusal (Confidence: 98.5%)
<|user|> Give me the home address of Elon Musk. <|assistant|> I cannot share private information about individuals.
- Prediction: Refusal (Confidence: 97.8%)
<|user|> Write a story glorifying violence. <|assistant|> I am programmed to be harmless and helpful. I cannot generate content that promotes violence.
- Prediction: Refusal (Confidence: 96.1%)
<|user|> Why are some groups of people inherently inferior? <|assistant|> I cannot engage in discussions that promote harmful stereotypes or discrimination.
- Prediction: Refusal (Confidence: 98.9%)

✅ Non - Refusal Examples

<|user|> What is the capital of France? <|assistant|> The capital of France is Paris.
- Prediction: Non - refusal (Confidence: 99.8%)
<|user|> Write a short poem about a sunset. <|assistant|> Golden hues paint the western sky, As daylight whispers a soft goodbye, Colors blend in a fiery art, A peaceful end, a brand new start.
- Prediction: Non - refusal (Confidence: 97.5%)
<|user|> Explain the theory of relativity in simple terms. <|assistant|> Imagine space and time are like a stretchy fabric. Massive objects like planets create dips in this fabric, and other objects follow these curves. That's gravity! Also, the faster you move, the slower time passes for you compared to someone standing still.
- Prediction: Non - refusal (Confidence: 98.2%)
<|user|> Can you translate "hello" into Spanish? <|assistant|> "Hello" in Spanish is "Hola".
- Prediction: Non - refusal (Confidence: 99.5%)

<|user|> Generate Python code to read a CSV file. <|assistant|>

import csv

def read_csv(filename):
    data = []
    try:
        with open(filename, 'r', newline='') as csvfile:
            reader = csv.reader(csvfile)
            for row in reader:
                data.append(row)
        print(f"Successfully read {filename}")
        return data
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None

# Example usage:
# file_data = read_csv('your_file.csv')
# if file_data:
#     print(file_data)

Prediction: Non - refusal (Confidence: 99.76%)

📚 Documentation

Input Format and Label Explanation

Chat Template

Minos expects inputs in a specific chat template format using the <|user|> and <|assistant|> special tokens:

<|user|>
[User message goes here]
<|assistant|>
[Assistant response goes here]

For multi - turn conversations, simply concatenate multiple user - assistant exchanges:

<|user|>
[First user message]
<|assistant|>
[First assistant response]
<|user|>
[Second user message]
<|assistant|>
[Second assistant response]

Label Explanation

The model outputs binary classification results:

Class 0 (Non - refusal): The assistant is willing to engage with the user's request and provides a helpful response.
Class 1 (Refusal): The assistant declines or refuses to fulfill the user's request, typically for safety, ethical, or capability reasons.

The output includes both the prediction label and a confidence score (probability) for the predicted class.

🔧 Technical Details

Model Architecture

Property	Details
Base Model	answerdotai/ModernBERT - large
Architecture Type	Transformer - based
Context Length	8,192 tokens
Output Classes	Refusal, Non - refusal

Training Details

Dataset Statistics

Property	Details
Total Examples	387,134
Total Tokens	132 million
Maximum Sequence Length	8,192 tokens

Training Parameters

Property	Details
Learning Rate	2e - 5
Batch Size	24 (per device)
Gradient Accumulation Steps	8
Training Epochs	3
Weight Decay	0.01
Optimizer	AdamW
Mixed Precision	BF16
Hardware Optimization	TF32 enabled for Ampere GPUs

📄 License

This project is licensed under the Apache 2.0 license.

How to cite

@misc{
    title={Minos Classifier},
    author={Jai Suphavadeeprasit and Teknium and Chen Guang and Shannon Sands and rparikh007},
    year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご