đ Phi-3 Mini-128K-Instruct ONNX models
This repository hosts optimized versions of Phi-3-mini-128k-instruct to accelerate inference with ONNX Runtime. Phi-3 Mini is a lightweight, state - of - the - art open model. It's built on datasets used for Phi - 2, focusing on high - quality, reasoning - dense data. The model belongs to the Phi - 3 family, with 4K and 128K context - length variants. It has undergone rigorous enhancements, including supervised fine - tuning and direct preference optimization.
đ Quick Start
To easily get started with Phi - 3, you can use our newly introduced ONNX Runtime Generate() API. See here for instructions on how to run it.
⨠Features
- Optimized for ONNX: The models are published in ONNX format to run with ONNX Runtime on CPU and GPU across various devices, including servers, desktops, and mobile CPUs.
- DirectML Support: DirectML support enables hardware acceleration on Windows devices across AMD, Intel, and NVIDIA GPUs.
- New API: A new API is introduced to wrap generative AI inferencing, making it easy to integrate LLMs into your app.
đĻ Installation
Not explicitly provided in the original README, so this section is skipped.
đģ Usage Examples
Basic Usage
python model-qa.py -m /*{YourModelPath}*/onnx/cpu_and_mobile/phi-3-mini-4k-instruct-int4-cpu -k 40 -p 0.95 -t 0.8 -r 1.0
*Input:* <|user|>Tell me a joke<|end|><|assistant|>
*Output:* Why don't scientists trust atoms?
Because they make up everything!
This joke plays on the double meaning of "make up." In science, atoms are the fundamental building blocks of matter, literally making up everything. However, in a colloquial sense, "to make up" can mean to fabricate or lie, hence the humor.
đ Documentation
ONNX Models
Here are some of the optimized configurations:
- ONNX model for int4 DML: For AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
- ONNX model for fp16 CUDA: For NVIDIA GPUs.
- ONNX model for int4 CUDA: For NVIDIA GPUs using int4 quantization via RTN.
- ONNX model for int4 CPU and Mobile: For CPU and Mobile, using int4 quantization via RTN. There are two versions: Acc = 1 for improved accuracy and Acc = 4 for improved performance. For mobile devices, we recommend using the model with acc - level - 4.
More updates on AMD, and additional optimizations on CPU and Mobile will be added with the official ORT 1.18 release in early May.
Hardware Supported
The models are tested on:
- GPU SKU: RTX 4090 (DirectML)
- GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
- CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
- Mobile SKU: Samsung Galaxy S21
Minimum Configuration Required:
- Windows: DirectX 12 - capable GPU and a minimum of 4GB of combined RAM
- CUDA: NVIDIA GPU with Compute Capability >= 7.0
Model Description
Property |
Details |
Developed by |
Microsoft |
Model Type |
ONNX |
Language(s) (NLP) |
Python, C, C++ |
License |
MIT |
Model Description |
This is a conversion of the Phi - 3 Mini - 4K - Instruct model for ONNX Runtime inference. |
Additional Details
How to Get Started with the Model
To run the Phi - 3 models across various devices and platforms with different execution provider backends, we introduce a new API. For running the early versions of these models with ONNX Runtime, follow the steps here.
Performance Metrics
Phi - 3 Mini - 128K - Instruct performs better in ONNX Runtime than PyTorch for all batch size, prompt length combinations. For FP16 CUDA, ORT performs up to 5X faster than PyTorch, while with INT4 CUDA it's up to 9X faster than PyTorch.
The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4.
Batch Size, Prompt Length |
ORT FP16 CUDA |
PyTorch Eager FP16 CUDA |
FP16 CUDA Speed Up (ORT/PyTorch) |
1, 16 |
134.46 |
25.35 |
5.30 |
1, 64 |
132.21 |
25.69 |
5.15 |
1, 256 |
124.51 |
25.77 |
4.83 |
1, 1024 |
110.03 |
25.73 |
4.28 |
1, 2048 |
96.93 |
25.72 |
3.77 |
1, 4096 |
62.12 |
25.66 |
2.42 |
4, 16 |
521.10 |
101.31 |
5.14 |
4, 64 |
507.03 |
101.66 |
4.99 |
4, 256 |
459.47 |
101.15 |
4.54 |
4, 1024 |
343.60 |
101.09 |
3.40 |
4, 2048 |
264.81 |
100.78 |
2.63 |
4, 4096 |
158.00 |
77.98 |
2.03 |
16, 16 |
1689.08 |
394.19 |
4.28 |
16, 64 |
1567.13 |
394.29 |
3.97 |
16, 256 |
1232.10 |
405.30 |
3.04 |
16, 1024 |
680.61 |
294.79 |
2.31 |
16, 2048 |
350.77 |
203.02 |
1.73 |
16, 4096 |
192.36 |
OOM |
|
Batch Size, Prompt Length |
PyTorch Eager INT4 CUDA |
INT4 CUDA Speed Up (ORT/PyTorch) |
1, 16 |
25.35 |
8.89 |
1, 64 |
25.69 |
8.58 |
1, 256 |
25.77 |
7.69 |
1, 1024 |
25.73 |
6.34 |
1, 2048 |
25.72 |
5.24 |
1, 4096 |
25.66 |
2.97 |
4, 16 |
101.31 |
2.82 |
4, 64 |
101.66 |
2.77 |
4, 256 |
101.15 |
2.64 |
4, 1024 |
101.09 |
2.20 |
4, 2048 |
100.78 |
1.84 |
4, 4096 |
77.98 |
1.62 |
16, 16 |
394.19 |
2.52 |
16, 64 |
394.29 |
2.41 |
16, 256 |
405.30 |
2.00 |
16, 1024 |
294.79 |
1.79 |
16, 2048 |
203.02 |
1.81 |
16, 4096 |
OOM |
|
Note: PyTorch compile and Llama.cpp currently do not support the Phi - 3 Mini - 128K - Instruct model.
Package Versions
Pip package name |
Version |
torch |
2.2.0 |
triton |
2.2.0 |
onnxruntime - gpu |
1.18.0 |
onnxruntime - genai |
0.2.0 |
onnxruntime - genai - cuda |
0.2.0 |
onnxruntime - genai - directml |
0.2.0 |
transformers |
4.39.0 |
bitsandbytes |
0.42.0 |
Appendix - Activation Aware Quantization
AWQ works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see here.
Model Card Contact
parinitarahi, kvaishnavi, natke
Contributors
Kunal Vaishnavi, Sunghoon Choi, Yufeng Li, Akshay Sonawane, Sheetal Arun Kadam, Rui Ren, Edward Chen, Scott McKay, Ryan Hill, Emma Ning, Natalie Kershaw, Parinita Rahi, Patrice Vignola, Chai Chaoweeraprasit, Logan Iyer, Vicente Rivera, Jacques Van Rhyn
đ License
This project is licensed under the MIT license.