Z1 7B
Z1 is a large language model based on Qwen2.5-Coder-7B-Instruct, focusing on efficient reasoning through thought migration.
Downloads 125
Release Time : 4/1/2025
Model Overview
This model achieves efficient reasoning through thought migration patterns, particularly suitable for code generation and complex problem-solving tasks.
Model Features
Thought Migration Reasoning
Achieves more efficient reasoning through unique thought migration patterns
Code Optimization
Specially trained for code generation and optimization tasks
Efficient Scaling
Supports efficient scaling during testing
Model Capabilities
Text Generation
Code Generation
Complex Problem Solving
Reasoning Tasks
Use Cases
Programming Assistance
Code Generation
Generate code based on natural language descriptions
Code Optimization
Optimize and improve existing code
Problem Solving
Complex Reasoning
Solve complex problems requiring multi-step reasoning
base_model:
- Qwen/Qwen2.5-Coder-7B-Instruct library_name: transformers license: mit metrics:
- accuracy pipeline_tag: text-generation
Z1: Efficient Test-time Scaling with Code
Train Large Language Model to Reason with Shifted Thinking
[📜 Paper] •
[🤗 HF Models] •
[🐱 GitHub]
Model Details
To begin with the shifted thinking mode, please refer to https://github.com/efficientscaling/Z1.
Evaluation
Gradio Demo
import copy
from typing import List
from dataclasses import dataclass
import gradio as gr
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
BOX=r"\boxed{}"
ANSWER_WITH_BOX=f"\n\nI overthought it, the final answer in {BOX} should be:\n\n"
ANSWER_WITHOUT_BOX=f"\n\nI overthought it, the final answer should be:\n\n"
model_name = "efficientscaling/Z1-7B"
@dataclass
class ThinkingLLM(LLM):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def thinking_generate(self, prompts: List[str], sampling_params: SamplingParams = None, max_tokens_for_thinking: int = None):
# If no SamplingParams is provided, create a default one
if sampling_params is None:
raise ValueError("Sampling_params can't be None!")
else:
all_max_tokens = sampling_params.max_tokens
# Override the max_tokens in the provided SamplingParams with the budget
sampling_params.max_tokens = max_tokens_for_thinking
print(f"All tokens: {all_max_tokens}")
print(f"Tokens for thinking: {max_tokens_for_thinking}")
trajectories = self.generate(prompts, sampling_params)
rethinking_str = ANSWER_WITHOUT_BOX
sampling_params.max_tokens = all_max_tokens
answers = copy.deepcopy(trajectories)
unfinished_id = []
thinking_token = 0
new_prompts = []
for id, traj in enumerate(trajectories):
if traj.outputs[0].finish_reason == 'length':
unfinished_id.append(id)
new_prompts.append(prompts[id] + traj.outputs[0].text + rethinking_str)
thinking_token += len(traj.outputs[0].token_ids)
avg_thinking_token = thinking_token / len(prompts)
if new_prompts:
print(new_prompts[0])
o = self.generate(
new_prompts,
sampling_params=sampling_params,
)
for i, uid in enumerate(unfinished_id):
answers[uid] = o[i]
return new_prompts, answers
def generate_text(prompt, max_tokens, max_tokens_for_thinking, temperature, top_p):
sampling_params = SamplingParams(
temperature=temperature,
max_tokens=max_tokens,
top_p=top_p,
skip_special_tokens=False,
)
trajectories, outputs = llm.thinking_generate(prompt, sampling_params, max_tokens_for_thinking=max_tokens_for_thinking)
return trajectories[0] + '\n\n' + outputs[0].outputs[0].text if trajectories else outputs[0].outputs[0].text
llm = ThinkingLLM(
model=model_name,
tensor_parallel_size=1,
gpu_memory_utilization=0.96,
)
with gr.Blocks() as demo:
gr.Markdown("# Reason with shifted thinking")
with gr.Row():
with gr.Column():
prompt_input = gr.Textbox(
label="Prompt",
placeholder="Input",
lines=5,
)
max_tokens_for_thinking_input = gr.Slider(
label="shifted_thinking_window_size",
minimum=1,
maximum=32786,
value=4000,
step=1,
)
max_tokens_input = gr.Slider(
label="all_max_tokens",
minimum=1,
maximum=32786,
value=32786,
step=1,
)
temperature_input = gr.Slider(
label="Temperature",
minimum=00,
maximum=2.0,
value=0,
step=0.1,
)
top_p_input = gr.Slider(
label="Top-p",
minimum=0.0,
maximum=1.0,
value=1,
step=0.01,
)
generate_button = gr.Button("Generate")
with gr.Column():
output_text = gr.Textbox(
label="Shifted Thinking Window",
placeholder="Text is here...",
lines=10,
)
generate_button.click(
fn=generate_text,
inputs=[prompt_input, max_tokens_for_thinking_input,max_tokens_input, temperature_input, top_p_input],
outputs=output_text,
)
if __name__ == "__main__":
demo.launch()
Citation
@misc{yu2025efficientscaling,
title={Z1: Efficient Test-time Scaling with Code},
author={Zhaojian Yu and Yinghao Wu and Yilun Zhao and Arman Cohan and Xiao-Ping Zhang},
year={2025},
eprint={2504.00810},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.00810},
}
Phi 2 GGUF
Other
Phi-2 is a small yet powerful language model developed by Microsoft, featuring 2.7 billion parameters, focusing on efficient inference and high-quality text generation.
Large Language Model Supports Multiple Languages
P
TheBloke
41.5M
205
Roberta Large
MIT
A large English language model pre-trained with masked language modeling objectives, using improved BERT training methods
Large Language Model English
R
FacebookAI
19.4M
212
Distilbert Base Uncased
Apache-2.0
DistilBERT is a distilled version of the BERT base model, maintaining similar performance while being more lightweight and efficient, suitable for natural language processing tasks such as sequence classification and token classification.
Large Language Model English
D
distilbert
11.1M
669
Llama 3.1 8B Instruct GGUF
Meta Llama 3.1 8B Instruct is a multilingual large language model optimized for multilingual dialogue use cases, excelling in common industry benchmarks.
Large Language Model English
L
modularai
9.7M
4
Xlm Roberta Base
MIT
XLM-RoBERTa is a multilingual model pretrained on 2.5TB of filtered CommonCrawl data across 100 languages, using masked language modeling as the training objective.
Large Language Model Supports Multiple Languages
X
FacebookAI
9.6M
664
Roberta Base
MIT
An English pre-trained model based on Transformer architecture, trained on massive text through masked language modeling objectives, supporting text feature extraction and downstream task fine-tuning
Large Language Model English
R
FacebookAI
9.3M
488
Opt 125m
Other
OPT is an open pre-trained Transformer language model suite released by Meta AI, with parameter sizes ranging from 125 million to 175 billion, designed to match the performance of the GPT-3 series while promoting open research in large-scale language models.
Large Language Model English
O
facebook
6.3M
198
1
A pretrained model based on the transformers library, suitable for various NLP tasks
Large Language Model
Transformers

1
unslothai
6.2M
1
Llama 3.1 8B Instruct
Llama 3.1 is Meta's multilingual large language model series, featuring 8B, 70B, and 405B parameter scales, supporting 8 languages and code generation, with optimized multilingual dialogue scenarios.
Large Language Model
Transformers Supports Multiple Languages

L
meta-llama
5.7M
3,898
T5 Base
Apache-2.0
The T5 Base Version is a text-to-text Transformer model developed by Google with 220 million parameters, supporting multilingual NLP tasks.
Large Language Model Supports Multiple Languages
T
google-t5
5.4M
702
Featured Recommended AI Models