TinyV-1.5B Open-Source AI Model - Improve the Efficiency of Reinforcement Learning and Optimize the Performance of the Final Model

Tinyv 1.5B

Developed by zhangchenxu

Fine-tuned based on the Qwen/Qwen2.5-1.5B-Instruct model, using the TinyV reward system, which can provide more accurate reward signals in the post-training of efficient reinforcement learning (RL) and significantly improve RL efficiency and the performance of the final model.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Reinforcement learning optimization #TinyV reward system #Efficient RL training

Downloads 1,124

Release Time : 4/13/2025

Model Overview

This model is a fine-tuned large language model that focuses on improving the efficiency of reinforcement learning training and model performance through the TinyV reward system.

Model Features

TinyV reward system

Provide more accurate reward signals through a small large language model, significantly improving the efficiency of reinforcement learning and model performance.

Efficient reinforcement learning

Only incur an additional 6% computational cost while significantly improving training efficiency and the performance of the final model.

False negative detection

Capable of detecting false negative situations in the current rule-based validator and providing more accurate training feedback.

Model Capabilities

Text generation

Reinforcement learning optimization

Reward signal provision

Use Cases

Reinforcement learning training

Efficient RL training

Use the TinyV reward system for reinforcement learning training to improve training efficiency and model performance.

Significantly improve RL efficiency and the performance of the final model

🚀 Qwen2.5-1.5B-Instruct-SFT-BigmathV_Simple_Balanced-LR1.0e-5-EPOCHS2

TinyV is a reward system for efficient RL post - training, which can detect false negatives and provide more accurate reward signals, significantly improving RL efficiency and model performance.

🚀 Quick Start

TinyV is a reward system designed for efficient RL post - training. It can detect false negatives in current rule - based verifiers and offer more accurate reward signals through a small LLM during RL training. Experiments indicate that TinyV only adds 6% additional computational cost while notably enhancing both RL efficiency and the final model's performance.

📄 Technical Report - Contains false negative analysis and theoretical insights of TinyV.
💪 Github Repo - Get access to the complete pipeline for more efficient RL training with TinyV.
🤝 HF Collection - Find training data, benchmarks, and model artifacts.

This model is a fine - tuned version of Qwen/Qwen2.5 - 1.5B - Instruct on the zhangchenxu/TinyV_Training_Data_Balanced dataset.

Overview

TinyV Pipeline

💻 Usage Examples

Basic Usage

Please refer to the codebase: https://github.com/uw-nsl/TinyV for details.

🔧 Technical Details

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 05
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 512
total_eval_batch_size: 64
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon = 1e - 08 and optimizer_args = No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2.0

Framework versions

Transformers 4.48.3
Pytorch 2.5.0
Datasets 3.2.0
Tokenizers 0.21.0

📄 License

This project is under the Apache - 2.0 license.

Property	Details
Library Name	transformers
License	apache - 2.0
Base Model	Qwen/Qwen2.5 - 1.5B - Instruct
Tags	llama - factory, full, generated_from_trainer
Model Name	Qwen2.5 - 1.5B - Instruct - SFT - BigmathV_Simple_Balanced - LR1.0e - 5 - EPOCHS2

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご