Phi-3 Small Open-Source Model - Lightweight with 8K Context, Free and Powerful Inference on NVIDIA GPUs Available

Phi 3 Small 8k Instruct Onnx Cuda

Developed by microsoft

Phi-3 Small is a 7B-parameter lightweight cutting-edge open-source model, optimized for NVIDIA GPUs in ONNX format, supporting 8K context length with strong inference capabilities.

Large Language Model

Transformers

Open Source License:MIT #7B parameter lightweight #ONNX inference acceleration #Multi-task instruction fine-tuning

Downloads 115

Release Time : 5/19/2024

Model Overview

This model is the ONNX Runtime inference conversion version of Phi-3 Small-8K-Instruct, running on GPU devices such as server platforms, Windows, and Linux via ONNX Runtime.

Model Features

High-performance inference

FP16 CUDA version is up to 4x faster than PyTorch, INT4 CUDA version up to 10.9x faster

Lightweight design

7B parameter scale, maintaining high performance while reducing resource consumption

Long context support

Supports 8K token context length, suitable for long-text tasks

Multi-platform compatibility

Supports various devices and operating systems through ONNX Runtime

Model Capabilities

Text generation

Instruction following

Common-sense reasoning

Language understanding

Mathematical computation

Code generation

Logical reasoning

Use Cases

Dialogue systems

Intelligent assistant

Build high-performance, low-latency conversational assistants

Achieves 74.62 tokens per second generation speed on A100 GPU

Content generation

Long-text generation

Generate coherent long-form content using 8K context length

🚀 Phi-3 Small-8K-Instruct ONNX CUDA models

This repository hosts optimized versions of Phi-3-small-8k-instruct. It aims to accelerate inference with ONNX Runtime on machines equipped with NVIDIA GPUs.

Phi-3 Small is a 7B parameter, lightweight, state-of-the-art open model. It's trained with Phi-3 datasets, which consist of synthetic data and filtered publicly available website data, emphasizing high - quality and reasoning - dense properties. The model belongs to the Phi-3 family, with the small version having two variants: 8K and 128K, representing the context lengths (in tokens) they can support.

The base model has undergone a post - training process that includes supervised fine - tuning and direct preference optimization for instruction following and safety measures. When evaluated against benchmarks for common sense, language understanding, math, code, long context, and logical reasoning, Phi-3-Small-8K-Instruct demonstrated robust and state - of - the - art performance among models of the same size and next - size - up.

The optimized variants of the Phi-3 Small models are published in ONNX format and can run with ONNX Runtime on GPUs across various devices, including server platforms, Windows, and Linux.

🚀 Quick Start

To support the Phi-3 models across a range of devices, platforms, and EP backends, we introduce a new API to wrap several aspects of generative AI inferencing. This API makes it easy to integrate LLMs into your app. To run the early version of these models with ONNX, follow the steps here. You can also test the models with this chat app.

✨ Features

ONNX Models

Here are some of the optimized configurations we've added:

ONNX model for FP16 CUDA: Designed for NVIDIA GPUs.
ONNX model for INT4 CUDA: An ONNX model for NVIDIA GPUs using int4 quantization via RTN.

Note: If you're short on disk space, you can use the Hugging Face CLI to download sub - folders instead of all models. The FP16 model is recommended for larger batch sizes, while the INT4 model optimizes performance for lower batch sizes.

Example:

# Download just the FP16 model
$ huggingface-cli download microsoft/Phi-3-small-8k-instruct-onnx-cuda --include cuda-fp16/* --local-dir .  --local-dir-use-symlinks False

📚 Documentation

Hardware Supported

The ONNX models are tested on:

1 A100 GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)

Minimum Configuration Required:

CUDA: NVIDIA GPU with Compute Capability >= 7.5

Model Description

Property	Details
Model Type	ONNX
Developed by	Microsoft
Language(s) (NLP)	Python, C, C++
License	MIT
Model Description	This is a conversion of the Phi-3 Small-8K-Instruct model for ONNX Runtime inference.

Additional Details

Performance Metrics

Phi-3 Small-8K-Instruct performs better with ONNX Runtime compared to PyTorch for all batch size and prompt length combinations. For FP16 CUDA, ORT is up to 4X faster than PyTorch, while with INT4 CUDA, it's up to 10.9X faster than PyTorch.

The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4.

Batch Size, Prompt Length	ORT FP16 CUDA	PyTorch Eager FP16 CUDA	Speed Up ORT/PyTorch
1, 16	74.62	16.81	4.44
4, 16	290.36	65.56	4.43
16,16	1036.93	267.33	3.88

Batch Size, Prompt Length	ORT INT4 CUDA	PyTorch Eager INT4 CUDA	Speed Up ORT/PyTorch
1, 16	140.68	12.93	10.88
4, 16	152.90	44.04	3.47
16,16	582.07	160.57	3.62

Package Versions

Pip package name	Version
torch	2.3.0
triton	2.3.0
onnxruntime-gpu	1.18.0
transformers	4.40.2
bitsandbytes	0.43.1

📄 License

The model is released under the MIT license.

Appendix

Model Card Contact

parinitarahi, kvaishnavi, natke

Contributors

Kunal Vaishnavi, Sunghoon Choi, Yufeng Li, Tianlei Wu, Sheetal Arun Kadam, Rui Ren, Baiju Meswani, Natalie Kershaw, Parinita Rahi

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご