đ Qwen3-1.7B-Base
Qwen3-1.7B-Base is a powerful causal language model in the Qwen3 series, offering high - performance language processing capabilities.
đ Quick Start
The code of Qwen3 has been integrated into the latest Hugging Face transformers
. We recommend using the latest version of transformers
.
If you use transformers<4.51.0
, you will encounter the following error:
KeyError: 'qwen3'
⨠Features
Qwen3 Highlights
Qwen3 is the latest generation of large language models in the Qwen series, providing a comprehensive set of dense and mixture - of - experts (MoE) models. Based on extensive improvements in training data, model architecture, and optimization techniques, Qwen3 offers the following key enhancements compared to the previously released Qwen2.5:
- Expanded Higher - Quality Pre - training Corpus: Qwen3 is pre - trained on 36 trillion tokens in 119 languages, tripling the language coverage of Qwen2.5. It uses a much richer mix of high - quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
- Training Techniques and Model Architecture: Qwen3 incorporates a series of training techniques and architectural refinements, such as the global - batch load balancing loss for MoE models and qk layernorm for all models, which improve stability and overall performance.
- Three - stage Pre - training: Stage 1 focuses on broad language modeling and general knowledge acquisition. Stage 2 improves reasoning skills in areas like STEM, coding, and logical reasoning. Stage 3 enhances long - context comprehension by extending the training sequence length up to 32k tokens.
- Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three - stage pre - training pipeline, Qwen3 systematically tunes critical hyperparameters, such as the learning rate scheduler and batch size, separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales.
Model Overview
Qwen3 - 1.7B - Base has the following features:
Property |
Details |
Model Type |
Causal Language Models |
Training Stage |
Pretraining |
Number of Parameters |
1.7B |
Number of Parameters (Non - Embedding) |
1.4B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
16 for Q and 8 for KV |
Context Length |
32,768 |
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
đ Documentation
Evaluation & Performance
Detailed evaluation results are reported in this đ blog.
Citation
If you find our work helpful, feel free to cite us.
@misc{qwen3,
title = {Qwen3},
url = {https://qwenlm.github.io/blog/qwen3/},
author = {Qwen Team},
month = {April},
year = {2025}
}
đ License
This project is licensed under the Apache - 2.0 license.