đ Qwen2.5 7B Instruct GGUF - llamafile
Run LLMs locally with a single file - No installation required!
Our goal is to make open - source large language models more accessible to both developers and end - users. We achieve this by combining llama.cpp with Cosmopolitan Libc into one framework. This collapses the complexity of LLMs into a single - file executable (a "llamafile") that runs locally on most computers without installation.
đ Quick Start
The easiest way to try it is to download our example llamafile. With llamafile, all inference occurs locally, and no data leaves your computer.
- Download the llamafile.
- Open your computer's terminal.
- If you're using macOS, Linux, or BSD, grant permission for your computer to execute the new file. (Do this only once.)
chmod +x qwen2.5-7b-instruct-q8_0.gguf
- If on Windows, rename the file by adding ".exe" at the end.
- Run the llamafile. For example:
./qwen2.5-7b-instruct-q8_0.gguf
- Your browser should open automatically and display a chat interface. If not, open your browser and go to http://localhost:8080.
- When done chatting, return to your terminal and press
Control - C
to shut down llamafile.
â ī¸ Important Note
LlamaFile is still under active development. Some methods may not be compatible with the most recent documents.
⨠Features
- Single - file Execution: Run LLMs locally with just one file, no installation needed.
- Local Inference: All data processing happens on your computer, ensuring data privacy.
đĻ Installation
No installation is required. Just download the llamafile and run it.
đģ Usage Examples
Basic Usage
The basic steps to use the llamafile are as follows:
chmod +x qwen2.5-7b-instruct-q8_0.gguf
./qwen2.5-7b-instruct-q8_0.gguf
Advanced Usage
If you want to achieve a chatbot - like experience, start in the conversation mode:
./llama-cli -m <gguf-file-path> \
-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
-fa -ngl 80 -n 512
đ Documentation
How to Use (Modified from Git README)
The steps are described in the "Quick Start" section above.
Settings for Qwen2.5 7B Instruct GGUF Llamafiles
- Model creator: Qwen
- Quantized GGUF files used: Qwen/Qwen2.5-7B-Instruct-GGUF
- Commit message "upload fp16 weights"
- Commit hash bb5d59e06d9551d752d08b292a50eb208b07ab1f
- LlamaFile version used: Mozilla-Ocho/llamafile
- Commit message "Merge pull request #687 from Xydane/main Add Support for DeepSeek - R1 models"
- Commit hash 29b5f27172306da39a9c70fe25173da1b1564f82
.args
content format (example):
-m
qwen2.5-7b-instruct-q8_0.gguf
...
Qwen2.5-7B-Instruct-GGUF Introduction
Qwen2.5 is the latest series of Qwen large language models. It brings significant improvements over Qwen2 in knowledge, coding, mathematics, instruction following, long - text generation, structured data understanding, and more.
Property |
Details |
Model Type |
Causal Language Models |
Training Stage |
Pretraining & Post - training |
Architecture |
transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias |
Number of Parameters |
7.61B |
Number of Paramaters (Non - Embedding) |
6.53B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
28 for Q and 4 for KV |
Context Length |
Full 32,768 tokens and generation 8192 tokens |
Quantization |
q2_K, q3_K_M, q4_0, q4_K_M, q5_0, q5_K_M, q6_K, q8_0 |
For more details, refer to blog, GitHub, and Documentation.
Quickstart
Clone llama.cpp
and install it following the official guide. You can also manually download the GGUF file or use huggingface-cli
:
- Install:
pip install -U huggingface_hub
- Download:
huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include "qwen2.5-7b-instruct-q5_k_m*.gguf" --local-dir . --local-dir-use-symlinks False
- (Optional) Merge split files:
./llama-gguf-split --merge qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf qwen2.5-7b-instruct-q5_k_m.gguf
Evaluation & Performance
Detailed evaluation results are in đ blog. Benchmark results for quantized models against bfloat16 models are here. GPU memory requirements and throughput results are here.
đ§ Technical Details
Qwen2.5 uses a combination of llama.cpp and Cosmopolitan Libc to create a single - file executable for local LLM inference. The Qwen2.5-7B - Instruct - GGUF model has a specific architecture with RoPE, SwiGLU, RMSNorm, and Attention QKV bias.
đ License
This project is licensed under the Apache 2.0 License.
Citation
If you find our work helpful, you can cite us as follows:
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
url = {https://qwenlm.github.io/blog/qwen2.5/},
author = {Qwen Team},
month = {September},
year = {2024}
}
@article{qwen2,
title={Qwen2 Technical Report},
author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
journal={arXiv preprint arXiv:2407.10671},
year={2024}
}