ThinkPRM-1.5B Open-source Model - Generate and Verify Thought Chains, Step-by-Step Verification of Inference Processes

Thinkprm 1.5B

Developed by launch

ThinkPRM-1.5B is a generative process reward model based on the R1-Distill-Qwen-1.5B architecture, capable of step-by-step verification of reasoning processes by generating verification chains of thought.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Process Verification Reward #Mathematical Reasoning Verification #Generative Evaluation

Downloads 68

Release Time : 4/25/2025

Model Overview

This model is specifically designed to verify the correctness of step-by-step reasoning processes. It can generate explicit verification chains of thought and annotate the correctness of each step, offering high data efficiency and robust performance.

Model Features

High Data Efficiency

Significantly reduces the amount of supervised data required compared to traditional discriminative PRMs, needing only 1,000 synthetically generated verification chain-of-thought datasets for fine-tuning.

Generative Verification

Provides step-level verification scores by generating natural language critiques and correctness judgments, offering interpretability.

Multi-domain Applicability

Evaluated in mathematical reasoning, scientific QA, and code generation domains, outperforming baseline models.

Model Capabilities

Generate verification chains of thought

Step-level correctness judgment

Solution scoring

Independent verification of problem-solution pairs

Use Cases

Mathematical Reasoning

Mathematical Problem-solving Step Verification

Verify the correctness of mathematical problem-solving steps, such as solving equations, proofs, etc.

Excellent performance on benchmarks like MATH-500 and AIME '24.

Code Generation

Code Verification

Verify the logic of generated code for correctness.

Excellent performance on the LiveCodeBench benchmark.

Scientific QA

Scientific Problem Answer Verification

Verify the correctness of steps in scientific problem answers.

Excellent performance on the GPQA-Diamond benchmark.

🚀 ThinkPRM-1.5B Model Card

ThinkPRM-1.5B is a generative Process Reward Model (PRM) based on the R1 - Distill - Qwen - 1.5B architecture. It can perform step - by - step verification of reasoning processes (such as mathematical solutions) by generating an explicit verification chain - of - thought (CoT) with every step labeled. This model is highly data - efficient, requiring significantly less supervision data than traditional discriminative PRMs while achieving strong performance.

Here's an example of the model output:

📚 Documentation

📋 Model Description

ThinkPRM-1.5B offers step - level verification scores by generating natural language critiques and correctness judgments for each step in a given solution prefix. It utilizes the underlying reasoning capabilities of the base Large Reasoning Model (LRM) and enhances them through fine - tuning on a small (1K examples) dataset of synthetically generated verification CoTs. These synthetic CoTs were produced by prompting QwQ - 32B - Preview and filtered against ground - truth step labels from the PRM800K dataset to ensure quality.

The model uses a standard language modeling objective, which makes it interpretable and allows it to scale process verification compute by generating longer or multiple verification CoTs. It outperformed LLM - as - a - judge and discriminative PRM baselines (based on the same R1 - Distill - Qwen - 1.5B model but trained on ~100x more labels) on benchmarks including ProcessBench, MATH - 500, AIME '24, GPQA - Diamond, and LiveCodeBench.

Finetuned from model [optional]: [R1 - Distill - Qwen - 1.5B](https://huggingface.co/deepseek - ai/DeepSeek - R1 - Distill - Qwen - 1.5B)

🌐 Model Sources [optional]

Repository: Github
Paper: Process Reward Models that Think (arXiv:2504.16828)

💼 Direct Use

ThinkPRM-1.5B is designed for verifying the correctness of step - by - step reasoning processes. Its primary uses include:

Scoring Solutions: Assign step - level or overall scores to candidate solutions for ranking in Best - of - N sampling or guiding tree search in reasoning tasks.
Generating Verification Rationales/CoTs: Produce detailed chain - of - thought verifications that explain why a particular step is correct or incorrect, aiding interpretability.
Standalone Verification: Evaluate the correctness of a given problem - solution pair.

The model has been evaluated on mathematical reasoning (MATH, AIME), scientific QA (GPQA), and code generation (LiveCodeBench). See our paper for more details.

⚠️ Limitations

Overconfidence: Generative PRMs like ThinkPRM can sometimes produce scores clustered near 0 or 1, potentially not reflecting true uncertainty.
Step Label Interference: The autoregressive nature might cause an early incorrect step judgment to negatively bias the evaluation of subsequent steps.
Sensitivity to Formatting/Prompting: Performance might be sensitive to the exact format of the input solution and the prompt used for verification (though fine - tuning likely reduces this compared to LLM - as - a - judge).

🚀 Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "launch/ThinkPRM-1.5B" # Replace with actual model ID on Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, max_model_len=16384)

# Example problem and solution
problem = "Solve for x: 2x + 3 = 7"
prefix = "Step 1: Subtract 3 from both sides: 2x = 4\nStep 2: Divide by 2: x = 1"

# Format the prompt
prompt = f"""You are given a math problem and a proposed step-by-step solution:

[Math Problem]

{problem}

[Solution]

{prefix}

Review and critique each step in the proposed solution to determine whether each step is correct. If the solution is incomplete, only verify the provided steps
"""

prompt = tokenizer.apply_chat_template([
    {'role': "user", "content": prompt}
], tokenize=False, add_generation_prompt=True) + "\nLet's verify step by step:"

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
    stop=None
)

# Generate the verification
outputs = llm.generate(prompt, sampling_params)
verification_cot = outputs[0].outputs[0].text

print(verification_cot)
"""
Step 1: Subtract 3 from both sides: 2x = 4

Critique: Starting with the equation 2x + 3 = 7, subtracting 3 from both sides is a correct operation to isolate the term with the variable. So, 2x + 3 - 3 = 7 - 3, which simplifies
 to 2x = 4. This step seems correct.

Step 2: Divide by 2: x = 1

Critique: Now, to solve for x, we need to divide both sides of the equation by 2. So, 2x / 2 = 4 / 2, which simplifies to x = 2. Wait a minute, the solution says x = 1, but accordin
g to this calculation, it should be x = 2. This seems incorrect.

Therefore, the first step is correct, but the second step has an error.

**Final Output:**

Let's verify step by step:

Step 1: Subtract 3 from both sides: 2x = 4

Critique: This step is correct. Subtracting 3 from both sides of the equation 2x + 3 = 7 properly isolates the term with the variable, resulting in 2x = 4.

Step 1 is \boxed{correct}

Step 2: Divide by 2: x = 1

Critique: This step is incorrect. Dividing both sides of the equation 2x = 4 by 2 should yield x = 2, not x = 1.

Step 2 is \boxed{incorrect}
</think>
Is the solution correct? No
"""

📄 License

The license of this model is apache - 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご