Qwen3-8B-Base Open Source Large Language Model - Free to Use, Providing Comprehensive Support for Knowledge Q&A

Qwen3 8B Base Unsloth Bnb 4bit

Developed by unsloth

Qwen3-8B-Base is the latest generation of large language models in the Tongyi series, offering a comprehensive set of dense and mixture-of-experts (MoE) models based on significant improvements in training data, model architecture, and optimization techniques.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Multilingual large model #Long context understanding #MoE architecture optimization

Downloads 6,214

Release Time : 4/28/2025

Model Overview

Qwen3-8B-Base is a pre-trained causal language model with 8.2 billion parameters, supporting a context length of 32k and suitable for various language tasks.

Model Features

Expanded high-quality pre-training corpus

Pre-trained on 36 trillion tokens in 119 languages, with three times the language coverage of Qwen2.5 and richer high-quality data.

Improvements in training technology and model architecture

Adopts global batch load balancing loss and qk layer normalization to improve stability and overall performance.

Three-stage pre-training

The first stage focuses on language modeling and general knowledge acquisition, the second stage improves reasoning ability, and the third stage enhances long context understanding ability.

Hyperparameter tuning based on scaling laws

Through comprehensive scaling law research, key hyperparameters are systematically adjusted to achieve better training dynamics and final performance.

Model Capabilities

Text generation

Language modeling

Multilingual support

Long context understanding

Logical reasoning

Use Cases

Natural language processing

Multilingual text generation

Generate high-quality multilingual text, suitable for scenarios such as translation and content creation.

Long document understanding

Process and understand long documents up to 32k tokens, suitable for tasks such as document summarization and question answering.

Coding and STEM

Code generation and completion

Generate and complete code snippets, supporting multiple programming languages.

Logical reasoning and mathematical calculation

Solve complex logical reasoning and mathematical calculation problems.

🚀 Qwen3-8B-Base

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

✨ Features

Qwen3 Highlights

Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5:

Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
Training Techniques and Model Architecture: Qwen3 incorporates a series of training techniques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and overall performance.
Three-stage Pre-training: Stage 1 focuses on broad language modeling and general knowledge acquisition, Stage 2 improves reasoning skills like STEM, coding, and logical reasoning, and Stage 3 enhances long-context comprehension by extending training sequence lengths up to 32k tokens.
Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three-stage pre-training pipeline, Qwen3 systematically tunes critical hyperparameters — such as learning rate scheduler and batch size — separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales.

Model Overview

Property	Details
Model Type	Causal Language Models
Training Stage	Pretraining
Number of Parameters	8.2B
Number of Parameters (Non-Embedding)	6.95B
Number of Layers	36
Number of Attention Heads (GQA)	32 for Q and 8 for KV
Context Length	32,768

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

📦 Installation

The code of Qwen3 has been in the latest Hugging Face transformers and we advise you to use the latest version of transformers.

⚠️ Important Note

With transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3'

📚 Documentation

Evaluation & Performance

Detailed evaluation results are reported in this 📑 blog.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご