Model Selection

Reinforcement Learning Optimization

# Reinforcement Learning Optimization

Mmada 8B MixCoT

MMaDA is a novel class of multimodal diffusion foundation models, excelling in various domains such as text reasoning, multimodal understanding, and text-to-image generation.

ReasonGen-R1 is an autoregressive image generation model that integrates chain-of-thought reasoning. It enhances the logic and quality of image generation through SFT and RL.

Thinkless 1.5B Warmup

The Thinkless framework is a learnable framework that enables large models to adaptively choose between short reasoning or long-chain reasoning based on task complexity and their own capabilities.

Large Language Model

Qwen2.5 VL 3B UI R1 E

UI-R1-E-3B is an efficient GUI positioning model fine-tuned based on Qwen2.5-VL-3B-Instruct, specializing in visual question-answering tasks, particularly excelling at locating and identifying operational elements in user interface screenshots.

Safetensors English

Llama 3.1 Nemotron Nano 8B V1 GGUF

Llama-3.1-Nemotron-Nano-8B-v1 is an inference model based on Meta Llama-3.1-8B-Instruct, enhanced through post-training to improve reasoning capabilities, human chat preferences, and task execution.

Large Language Model

Transformers English

INFRL Qwen2.5 VL 72B Preview Q8 With Bf16 Output And Bf16 Embedding.gguf

An improved multimodal vision-language model based on Qwen2.5-VL-72B-Instruct, excelling in multiple visual reasoning benchmarks

Text-to-Image English

INFRL Qwen2.5 VL 72B Preview Bf16.gguf

A vision-language model optimized based on Qwen2.5-VL-72B-Instruct, excelling in multiple visual reasoning benchmarks

Text-to-Image English

Llama 3.1 8B Instruct

Meta Llama 3.1 series multilingual large language model, featuring 8B parameters, optimized for multilingual conversational use cases, supporting 8 languages.

Large Language Model

Safetensors Supports Multiple Languages

RM R1 DeepSeek Distilled Qwen 14B

RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning traces, providing explainable judgments.

Large Language Model

Transformers English

II Medical 7B Preview

A medical reasoning model fine-tuned based on Qwen/Qwen2.5-7B-Instruct, excelling in multiple medical QA benchmarks

Large Language Model

Intelligent-Internet

Skywork VL Reward 7B

Skywork-VL-Reward-7B is a 7B-parameter multimodal reward model based on the Qwen2.5-VL-7B-Instruct architecture, enhanced with a value head structure for training reward models.

Multimodal Fusion

Deepcoder 1.5B Preview GGUF

A code-reasoning large language model fine-tuned based on DeepSeek-R1-Distilled-Qwen-1.5B, utilizing distributed reinforcement learning technology to extend long-context processing capabilities

Large Language Model English

Tinyllava Video R1

TinyLLaVA-Video-R1 is a small-scale video reasoning model based on the traceable training model TinyLLaVA-Video. It significantly enhances reasoning and thinking abilities through reinforcement learning and exhibits the emergent property of 'epiphany moments'.

Deepcoder 14B Preview Exl2

DeepCoder-14B-Preview is a code generation model developed based on DeepSeek-R1-Distill-Qwen-14B, focusing on solving verifiable programming problems.

Large Language Model English

Deepcoder 1.5B Preview Exl2 4.65bpw

A code reasoning large model fine-tuned based on DeepSeek-R1-Distilled-Qwen-1.5B, utilizing distributed reinforcement learning technology to enhance long-context processing capabilities

Large Language Model

Transformers English

Quasar 3.0 Instract V2

Quasar-3.0-7B is the distilled version of the upcoming 400B Quasar 3.0 model, showcasing the early strength and potential of the Quasar architecture.

Large Language Model

Quasar 3.0 Final

Quasar-3.0-Max is a 7B parameter distilled model provided by SILX INC, showcasing the early potential of the Quasar architecture with innovative TTM training process and reinforcement learning techniques.

Large Language Model

VARGPT-v1.1 is a visual autoregressive unified large model, enhanced through iterative instruction tuning and reinforcement learning, capable of performing both visual understanding and generation tasks.

Transformers English

VARGPT V1.1 Edit

VARGPT-v1.1 is a vision autoregressive unified large model enhanced through iterative instruction tuning and reinforcement learning, supporting vision understanding and generation tasks.

Transformers English

Qwen2.5 VL 3B UI R1

UI-R1 is a vision-language model enhanced by reinforcement learning for GUI agent action prediction, built upon Qwen2.5-VL-3B-Instruct.

Text-to-Image English

R1-AQA is an audio question answering model based on Qwen2-Audio-7B-Instruct, optimized through Group Relative Policy Optimization (GRPO) algorithm, achieving state-of-the-art performance in the MMAU benchmark.

Light R1 14B DS

Light-R1-14B-DS is a 14B-parameter math SOTA model trained with reinforcement learning, excelling in AIME24/25 and GPQA benchmarks.

Large Language Model

Visualthinker R1 Zero

The first multimodal reasoning model to reproduce 'Aha moments' and increased response length on just a 2B model with unsupervised fine-tuning

Safetensors English

turningpoint-ai

TRL is a reinforcement learning library based on the Transformer architecture for training and fine-tuning language models.

Large Language Model

Qwen2.5vl 3B VLM R1 REC 500steps

A vision-language model based on Qwen2.5-VL-3B-Instruct, enhanced with VLM-R1 reinforcement learning, focusing on referring expression comprehension tasks.

Safetensors English

Text2graph R1 Qwen2.5 0.5b

A text-to-graph information extraction model based on Qwen-2.5-0.5B, jointly trained through reinforcement learning (GRPO) and supervised learning.

Knowledge Graph English

Cycleresearcher 12B Original

CycleResearcher is an automated research system based on reinforcement learning and iterative feedback, specifically trained for machine learning research, covering fields such as computer vision and natural language processing.

Large Language Model

Transformers Supports Multiple Languages

T5 Query Reformulation RL

This is a generative model specifically designed for search query rewriting, employing a sequence-to-sequence architecture and reinforcement learning framework to produce diverse and relevant query rewrites.

Large Language Model

Transformers Supports Multiple Languages

Speechless Llama2 Luban Orca Platypus 13b

This model is a merge of AIDC-ai-business/Luban-13B and Open-Orca/OpenOrca-Platypus2-13B, forming a 13-billion-parameter large language model based on the Llama 2 architecture.

Large Language Model

Transformers English

Ppo LunarLanderContinuous V2

This is a reinforcement learning agent based on the PPO algorithm, specifically trained for the LunarLanderContinuous-v2 environment, capable of controlling the lunar lander for smooth landing.

Korean dialogue summarization model based on BART architecture, trained by the 'Alaggung Dalaggung' team in the 2021 Hunminjeongeum Korean Speech & Natural Language AI Competition

Text Generation

Transformers Korean

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase