# Reinforcement Learning Optimization

Mmada 8B MixCoT
MIT
MMaDA is a novel class of multimodal diffusion foundation models, excelling in various domains such as text reasoning, multimodal understanding, and text-to-image generation.
Text-to-Image Transformers
M
Gen-Verse
601
3
Reasongen R1
Apache-2.0
ReasonGen-R1 is an autoregressive image generation model that integrates chain-of-thought reasoning. It enhances the logic and quality of image generation through SFT and RL.
Text-to-Image Transformers
R
Franklin0
142
1
Thinkless 1.5B Warmup
Apache-2.0
The Thinkless framework is a learnable framework that enables large models to adaptively choose between short reasoning or long-chain reasoning based on task complexity and their own capabilities.
Large Language Model Transformers
T
Vinnnf
966
1
Qwen2.5 VL 3B UI R1 E
MIT
UI-R1-E-3B is an efficient GUI positioning model fine-tuned based on Qwen2.5-VL-3B-Instruct, specializing in visual question-answering tasks, particularly excelling at locating and identifying operational elements in user interface screenshots.
Image-to-Text Safetensors English
Q
LZXzju
75
3
Llama 3.1 Nemotron Nano 8B V1 GGUF
Other
Llama-3.1-Nemotron-Nano-8B-v1 is an inference model based on Meta Llama-3.1-8B-Instruct, enhanced through post-training to improve reasoning capabilities, human chat preferences, and task execution.
Large Language Model Transformers English
L
unsloth
22.18k
3
INFRL Qwen2.5 VL 72B Preview Q8 With Bf16 Output And Bf16 Embedding.gguf
Apache-2.0
An improved multimodal vision-language model based on Qwen2.5-VL-72B-Instruct, excelling in multiple visual reasoning benchmarks
Text-to-Image English
I
GeorgyGUF
64
0
INFRL Qwen2.5 VL 72B Preview Bf16.gguf
Apache-2.0
A vision-language model optimized based on Qwen2.5-VL-72B-Instruct, excelling in multiple visual reasoning benchmarks
Text-to-Image English
I
GeorgyGUF
40
0
Llama 3.1 8B Instruct
Meta Llama 3.1 series multilingual large language model, featuring 8B parameters, optimized for multilingual conversational use cases, supporting 8 languages.
Large Language Model Safetensors Supports Multiple Languages
L
RedHatAI
292
1
RM R1 DeepSeek Distilled Qwen 14B
MIT
RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning traces, providing explainable judgments.
Large Language Model Transformers English
R
gaotang
95
1
II Medical 7B Preview
A medical reasoning model fine-tuned based on Qwen/Qwen2.5-7B-Instruct, excelling in multiple medical QA benchmarks
Large Language Model Transformers
I
Intelligent-Internet
112
9
Skywork VL Reward 7B
MIT
Skywork-VL-Reward-7B is a 7B-parameter multimodal reward model based on the Qwen2.5-VL-7B-Instruct architecture, enhanced with a value head structure for training reward models.
Multimodal Fusion Transformers
S
Skywork
30
8
Deepcoder 1.5B Preview GGUF
MIT
A code-reasoning large language model fine-tuned based on DeepSeek-R1-Distilled-Qwen-1.5B, utilizing distributed reinforcement learning technology to extend long-context processing capabilities
Large Language Model English
D
Mungert
888
2
Tinyllava Video R1
Apache-2.0
TinyLLaVA-Video-R1 is a small-scale video reasoning model based on the traceable training model TinyLLaVA-Video. It significantly enhances reasoning and thinking abilities through reinforcement learning and exhibits the emergent property of 'epiphany moments'.
Video-to-Text Transformers
T
Zhang199
123
2
Deepcoder 14B Preview Exl2
DeepCoder-14B-Preview is a code generation model developed based on DeepSeek-R1-Distill-Qwen-14B, focusing on solving verifiable programming problems.
Large Language Model English
D
cgus
46
2
Deepcoder 1.5B Preview Exl2 4.65bpw
MIT
A code reasoning large model fine-tuned based on DeepSeek-R1-Distilled-Qwen-1.5B, utilizing distributed reinforcement learning technology to enhance long-context processing capabilities
Large Language Model Transformers English
D
async0x42
14
3
Quasar 3.0 Instract V2
Quasar-3.0-7B is the distilled version of the upcoming 400B Quasar 3.0 model, showcasing the early strength and potential of the Quasar architecture.
Large Language Model Transformers
Q
silx-ai
314
8
Quasar 3.0 Final
Quasar-3.0-Max is a 7B parameter distilled model provided by SILX INC, showcasing the early potential of the Quasar architecture with innovative TTM training process and reinforcement learning techniques.
Large Language Model Transformers
Q
silx-ai
118
4
VARGPT V1.1
Apache-2.0
VARGPT-v1.1 is a visual autoregressive unified large model, enhanced through iterative instruction tuning and reinforcement learning, capable of performing both visual understanding and generation tasks.
Text-to-Image Transformers English
V
VARGPT-family
954
6
VARGPT V1.1 Edit
Apache-2.0
VARGPT-v1.1 is a vision autoregressive unified large model enhanced through iterative instruction tuning and reinforcement learning, supporting vision understanding and generation tasks.
Text-to-Image Transformers English
V
VARGPT-family
169
1
Qwen2.5 VL 3B UI R1
MIT
UI-R1 is a vision-language model enhanced by reinforcement learning for GUI agent action prediction, built upon Qwen2.5-VL-3B-Instruct.
Text-to-Image English
Q
LZXzju
96
6
R1 Aqa
Apache-2.0
R1-AQA is an audio question answering model based on Qwen2-Audio-7B-Instruct, optimized through Group Relative Policy Optimization (GRPO) algorithm, achieving state-of-the-art performance in the MMAU benchmark.
Audio-to-Text Transformers
R
mispeech
791
14
Light R1 14B DS
Apache-2.0
Light-R1-14B-DS is a 14B-parameter math SOTA model trained with reinforcement learning, excelling in AIME24/25 and GPQA benchmarks.
Large Language Model Transformers
L
qihoo360
2,890
33
Visualthinker R1 Zero
MIT
The first multimodal reasoning model to reproduce 'Aha moments' and increased response length on just a 2B model with unsupervised fine-tuning
Image-to-Text Safetensors English
V
turningpoint-ai
578
6
DPO A5 Nlp
TRL is a reinforcement learning library based on the Transformer architecture for training and fine-tuning language models.
Large Language Model Transformers
D
EraCoding
26
1
Qwen2.5vl 3B VLM R1 REC 500steps
A vision-language model based on Qwen2.5-VL-3B-Instruct, enhanced with VLM-R1 reinforcement learning, focusing on referring expression comprehension tasks.
Text-to-Image Safetensors English
Q
omlab
976
22
Text2graph R1 Qwen2.5 0.5b
Apache-2.0
A text-to-graph information extraction model based on Qwen-2.5-0.5B, jointly trained through reinforcement learning (GRPO) and supervised learning.
Knowledge Graph English
T
Ihor
199
20
Cycleresearcher 12B Original
Other
CycleResearcher is an automated research system based on reinforcement learning and iterative feedback, specifically trained for machine learning research, covering fields such as computer vision and natural language processing.
Large Language Model Transformers Supports Multiple Languages
C
WestlakeNLP
250
1
T5 Query Reformulation RL
Apache-2.0
This is a generative model specifically designed for search query rewriting, employing a sequence-to-sequence architecture and reinforcement learning framework to produce diverse and relevant query rewrites.
Large Language Model Transformers Supports Multiple Languages
T
prhegde
366
6
Speechless Llama2 Luban Orca Platypus 13b
This model is a merge of AIDC-ai-business/Luban-13B and Open-Orca/OpenOrca-Platypus2-13B, forming a 13-billion-parameter large language model based on the Llama 2 architecture.
Large Language Model Transformers English
S
uukuguy
94
4
Ppo LunarLanderContinuous V2
This is a reinforcement learning agent based on the PPO algorithm, specifically trained for the LunarLanderContinuous-v2 environment, capable of controlling the lunar lander for smooth landing.
Physics Model
P
sb3
15
0
Bart Rl
Korean dialogue summarization model based on BART architecture, trained by the 'Alaggung Dalaggung' team in the 2021 Hunminjeongeum Korean Speech & Natural Language AI Competition
Text Generation Transformers Korean
B
alaggung
18
0
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase