Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Experimental layer-wise quantization of Salesforce/Llama-xLAM-2-8b-fc-r
This project focuses on the experimental layer-wise quantization of the Salesforce/Llama-xLAM-2-8b-fc-r model, aiming to optimize the inference performance of large language models in resource-constrained environments.
Model Information
Property | Details |
---|---|
Base Model | Salesforce/Llama-xLAM-2-8b-fc-r |
Datasets | eaddario/imatrix-calibration |
Language | en |
License | cc-by-nc-4.0 |
Pipeline Tag | text-generation |
Tags | gguf, quant, experimental |
🚀 Quick Start
The experimental versions of the model are generated using a custom quantization method. Here is a high - level overview of the process:
- Convert the original model's tensors to GGUF F16*.
- Estimate the Perplexity score for the F16 model (baseline) using the wikitext-2-raw-v1 dataset, and save the logits.
- Generate an imatrix from selected calibration datasets.
- Determine tensor and layer Importance Score contribution using the modified version of
llama-imatrix
. - Select an appropriate quant level for each tensor and quantize the model using
llama-quantize
. - Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model.
- Keep versions with the best scores.
- Repeat until all desired quants are created.
*BF16 would be preferred, but Apple's GPUs don't support it yet, and therefore any operations are executed in the CPU, making it unacceptably slow. This is expected to change in the near term but until then, if you are using Apple kit avoid using any models tagged BF16
✨ Features
- Layer-wise Quantization: Inspired by Dumitru's et al Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels, different quantization types are applied to different tensors/layers.
- Optimized Inference: Aims to improve the inference performance of LLMs in resource - constrained environments.
- Custom Tools: A custom version of
llama-imatrix
andllama-quantize
is used to identify influential tensors and perform quantization.
📚 Documentation
Original Model Introduction
Large Action Models (LAMs) are advanced language models designed to enhance decision - making by translating user intentions into executable actions. As the brains of AI agents, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains.
This model release is for research purposes only.
The new xLAM - 2 series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in multi - turn conversation and tool usage. Trained using our novel APIGen - MT framework, which generates high - quality training data through simulated agent - human interactions. Our models achieve state - of - the - art performance on BFCL and τ - bench benchmarks, outperforming frontier models like GPT - 4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi - turn scenarios while maintaining exceptional consistency across trials.
Experimental Version Production
An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource - constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but the focus has been primarily on quantization and pruning.
The method used to produce these experimental versions is covered in Squeezing Tensor Bits: the quest for smaller LLMs. At a high level, it involves using a custom version of llama-imatrix
and llama-quantize
to identify influential tensors, and quantize the most important layers to higher bit precision and the less important to lower bits.
As of version b5125 llama - quantize can now perform tensor - wide quantization (TWQ), whereby user - defined tensors are quantized at a specific level, or perform layer - wise quantization (LWQ) by selecting different quantization types per tensor/layer.
The modified version of llama - imatrix generates useful statistics to guide the tensor selection process. --show - statistics
will display:
- Σ(Bias): the sum of all activations over the tensor (i.e. the Importance Scores)
- Min & Max: minimum and maximum activation values
- μ & σ: activations' mean and standard deviation
- % Active: proportion of elements whose average activation exceeds a very small threshold (1e - 6). Helpful to determine how alive/dormant the tensor is during inference
- N: number of activations in the tensor
- Entropy: entropy of the activation distribution, in bits (standard Shannon entropy measurement)
- E (norm): Normalized entropy.
- ZD Score: z - score distribution as described in 3.1 Layer Importance Scores in the Layer - Wise Quantization paper
- CosSim: cosine similarity between same type tensors with respect to the previous layer (i.e. blk.7.attn_k and blk.6.attn_k)
Please note that statistics are calculated for each individual tensor and should be used to compare between tensors of the same type only.
There's a pull request to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified version will be available on GitHub.
Testing and Comparison
For testing and comparison, normally models produced by Unsloth and Bartowski are used. But they don't provide GGUF versions of this model, so all tests and comparisons are done against naive quantizations obtained by simply running llama - quantize
with no further optimization.
All experimental versions were generated using an appropriate imatrix created from calibration datasets available at eaddario/imatrix - calibration.
🔧 Technical Details
Model Sizes
Model | Naive | Repo | Shrinkage |
---|---|---|---|
Llama-xLAM-2-8b-fc-r-IQ3_M | 3.78 | 3.69 | 2.4% |
Llama-xLAM-2-8b-fc-r-IQ3_S | 3.68 | 3.43 | 6.8% |
Llama-xLAM-2-8b-fc-r-IQ4_NL | 4.71 | 4.39 | 6.2% |
Llama-xLAM-2-8b-fc-r-Q3_K_L | 4.32 | 3.76 | 13.0% |
Llama-xLAM-2-8b-fc-r-Q3_K_M | 4.02 | 3.56 | 11.4% |
Llama-xLAM-2-8b-fc-r-Q3_K_S | 3.66 | 3.31 | 9.6% |
Llama-xLAM-2-8b-fc-r-Q4_K_M | 4.92 | 4.41 | 10.4% |
Llama-xLAM-2-8b-fc-r-Q4_K_S | 4.69 | 4.28 | 8.7% |
Llama-xLAM-2-8b-fc-r-Q5_K_M | 5.73 | 5.38 | 6.1% |
Llama-xLAM-2-8b-fc-r-Q5_K_S | 5.60 | 5.24 | 6.4% |
Llama-xLAM-2-8b-fc-r-Q6_K | 6.60 | 6.57 | 0.5% |
Llama-xLAM-2-8b-fc-r-Q8_0 | 8.54 | 7.73 | 9.5% |
Perplexity and KL Divergence scores
Model | ŒºPPL | ùúåPPL | ŒºKLD | RMS Œîp |
---|---|---|---|---|
Llama-xLAM-2-8b-fc-r-IQ3_M | 8.471225 ±0.059374 | 98.14% | 0.096730 ±0.000436 | 9.339 ±0.048 |
Llama-xLAM-2-8b-fc-r-IQ3_S | 8.675839 ±0.060418 | 97.37% | 0.137925 ±0.000554 | 11.245 ±0.051 |
Llama-xLAM-2-8b-fc-r-IQ4_NL | 8.337503 ±0.060156 | 99.09% | 0.047455 ±0.000243 | 6.280 ±0.039 |
Llama-xLAM-2-8b-fc-r-Q3_K_L | 8.894129 ±0.063417 | 97.22% | 0.136754 ±0.000659 | 11.276 ±0.057 |
Llama-xLAM-2-8b-fc-r-Q3_K_M | 8.991141 ±0.063906 | 96.89% | 0.152094 ±0.000706 | 11.870 ±0.058 |
Llama-xLAM-2-8b-fc-r-Q3_K_S | 9.352260 ±0.066573 | 95.91% | 0.198689 ±0.000870 | 13.526 ±0.061 |
Llama-xLAM-2-8b-fc-r-Q4_K_M | 8.230419 ±0.058263 | 99.18% | 0.041808 ±0.000219 | 5.988 ±0.037 |
Llama-xLAM-2-8b-fc-r-Q4_K_M (naive) | 8.072811 ±0.057158 | 99.60% | 0.019868 ±0.000110 | 4.044 ±0.029 |
Llama-xLAM-2-8b-fc-r-Q4_K_S | 8.239495 ±0.058176 | 99.10% | 0.045691 ±0.000240 | 6.262 ±0.039 |
Llama-xLAM-2-8b-fc-r-Q5_K_M | 8.062572 ±0.057549 | 99.77% | 0.011576 ±0.000073 | 3.136 ±0.025 |
Llama-xLAM-2-8b-fc-r-Q5_K_S | 8.057947 ±0.057474 | 99.75% | 0.012330 ±0.000075 | 3.210 ±0.026 |
Llama-xLAM-2-8b-fc-r-Q6_K | 7.983587 ±0.056711 | 99.91% | 0.004239 ±0.000034 | 1.912 ±0.018 |
Llama-xLAM-2-8b-fc-r-Q8_0 | 7.982215 ±0.056796 | 99.94% | 0.002365 ±0.000026 | 1.449 ±0.019 |
Llama-xLAM-2-8b-fc-r-F16 | 7.968796 ±0.056714 | 100% | N/A | N/A |
ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores
Scores generated using llama - perplexity with 750 tasks per test, and a context size of 768 tokens.
For the test data used in the generation of these scores, follow the appropriate links: HellaSwag, ARC, MMLU, Truthful QA and WinoGrande
Model | ARC | HellaSwag | MMLU | Truthful QA | WinoGrande | Avg Score |
---|---|---|---|---|---|---|
Llama-xLAM-2-8b-fc-r-IQ3_M | 64.6667 ±1.7466 | 76.67 | 38.5333 ±1.7783 | 29.6000 ±1.6680 | 74.5333 ±1.5919 | 56.80 |
Llama-xLAM-2-8b-fc-r-IQ3_S | 60.8000 ±1.7838 | 72.40 | 38.0000 ±1.7736 | 30.9333 ±1.6889 | 72.5333 ±1.6309 | 54.93 |
Llama-xLAM-2-8b-fc-r-IQ4_NL | 66.0000 ±1.7309 | 77.73 | 39.0667 ±1.7827 | 30.8000 ±1.6869 | 73.7333 ±1.6080 | 57.47 |
Llama-xLAM-2-8b-fc-r-Q3_K_L | 65.0667 ±1.7420 | 76.67 | 38.6667 ±1.7794 | 29.6000 ±1.6680 | 71.6000 ±1.6477 | 56.32 |
📄 License
This project is licensed under the cc - by - nc - 4.0 license.

