SeerAttention-Decode-Qwen3-4B-AttnGates Open-source Model - Supports Qwen3-4B inference tasks and provides weights for the decoding stage

Seerattention Decode Qwen3 4B AttnGates

Developed by SeerAttention

Provide the AttnGate weights for the decoding phase in the SeerAttention-R paper, supporting the inference tasks of the Qwen3-4B model

Large Language Model

Transformers

Open Source License:MIT #Decoding Attention Gates #Inference Task Optimization #Multi-budget Support

Downloads 4,295

Release Time : 6/9/2025

Model Overview

This model contains the attention gate weights for the decoding phase in the SeerAttention-R paper, used to enhance the inference ability of the Qwen3-4B model

Model Features

Attention Optimization in Decoding Phase

Provide attention gate weights for the decoding phase to optimize the inference process

Multi-budget Support

Support inference tasks under different token budgets

Compatibility with Qwen3 Series

Designed specifically for the Qwen3-4B model

Model Capabilities

Inference Task Optimization

Attention Mechanism Enhancement

Text Generation

Use Cases

Academic Inference

AIME Math Contest Problem Solving

Solve AIME math contest problems

Achieve an accuracy of 55.42 - 72.50% under different token budgets

GPQA Question Answering

Solve GPQA test questions

Achieve an accuracy of 39.61 - 56.19% under different token budgets

Mathematical Problem Solving

MATH500 Math Problem Solving

Solve math problems in the MATH500 dataset

Achieve an accuracy of 84.80 - 93.93% under different token budgets

🚀 SeerAttention-R

This repository contains the decode stage AttnGate weights from the paper SeerAttention-R. It offers a solution for enhancing the performance of specific models in reasoning tasks.

🚀 Quick Start

This repo provides the AttnGate weights for decoding stage, which can be used to enhance the performance of related models. You can access the weights through the following links:

✨ Features

The SeerAttention-R provides AttnGate weights for different models, which can potentially improve the models' performance in reasoning tasks under various token budgets.

📚 Documentation

Results of Reasoning Tasks

These are the results of reasoning tasks with different token budgets. All the results are the averaged pass@1 results with 64 samples per query for AIME, 16 samples for GPQA, and 8 samples for MATH-500.

AIME24

Model	2k	4k	6k	8k	Full Attention
Qwen3-4B	55.42	68.75	70.94	72.50	71.25
Qwen3-8B	56.56	72.29	74.22	75.05	74.48
Qwen3-14B	62.24	75.78	78.02	78.65	78.91
DeepSeek-R1-Distill-Qwen-14B	55.78	66.35	67.50	66.82	67.50

AIME25

Model	2k	4k	6k	8k	Full Attention
Qwen3-4B	45.73	57.60	60.20	62.90	66.41
Qwen3-8B	42.60	56.77	60.31	64.17	67.86
Qwen3-14B	46.67	62.66	67.19	69.01	70.21
DeepSeek-R1-Distill-Qwen-14B	38.44	47.19	52.25	50.05	50.00

MATH500

Model	1k	2k	4k	6k	Full Attention
Qwen3-4B	84.80	92.20	93.60	93.60	93.93
Qwen3-8B	82.82	91.53	94.17	94.53	94.43
Qwen3-14B	85.13	93.20	94.77	94.80	95.22
DeepSeek-R1-Distill-Qwen-14B	87.65	92.10	93.05	93.12	93.30

GPQA Diamond

Model	1k	2k	4k	6k	Full Attention
Qwen3-4B	39.61	51.20	55.20	55.90	56.19
Qwen3-8B	37.59	54.32	59.60	60.48	60.54
Qwen3-14B	44.54	59.72	63.76	64.20	65.25
DeepSeek-R1-Distill-Qwen-14B	51.26	56.79	56.41	57.48	57.80

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご