Llama-xLAM-2-8b-fc-r-GGUF Open-Source Model - Convert Intention into Action, Great for Multi-Round Dialogue Tool Use

Llama Xlam 2 8b Fc R GGUF

Developed by eaddario

An 8B-parameter Large Action Model (LAM) developed by Salesforce, specializing in transforming user intent into executable actions, with outstanding performance in multi-turn dialogues and tool usage.

Large Language Model English#Hierarchical Quantization #Multi-turn Dialogue #Tool Usage

Downloads 264

Release Time : 4/27/2025

Model Overview

A large action model trained on the APIGen-MT framework, designed to generate high-quality training data through simulated agent-human interactions, surpassing cutting-edge models like GPT-4o and Claude 3.5 in BFCL and τ-bench benchmarks.

Model Features

Hierarchical Quantization Technology

Utilizes experimental hierarchical quantization methods, identifying critical tensors through importance matrices to achieve more efficient model compression.

Multi-turn Dialogue Capability

Demonstrates exceptional performance in multi-turn dialogue scenarios, maintaining outstanding consistency.

Tool Usage Optimization

Specifically optimized for tool usage, making it suitable for automated workflow scenarios.

Model Capabilities

Text Generation

Intent Understanding

Task Planning

Multi-turn Dialogue

Tool Usage

Use Cases

Automated Workflow

Business Process Automation

Transforms user requirements into automated workflows

Achieves SOTA performance in BFCL benchmarks

Dialogue Systems

Multi-turn Dialogue Agent

Handles complex multi-turn dialogue scenarios

Surpasses GPT-4o in τ-bench tests

🚀 Experimental layer-wise quantization of Salesforce/Llama-xLAM-2-8b-fc-r

This project focuses on the experimental layer-wise quantization of the Salesforce/Llama-xLAM-2-8b-fc-r model, aiming to optimize the inference performance of large language models in resource-constrained environments.

Model Information

Property	Details
Base Model	Salesforce/Llama-xLAM-2-8b-fc-r
Datasets	eaddario/imatrix-calibration
Language	en
License	cc-by-nc-4.0
Pipeline Tag	text-generation
Tags	gguf, quant, experimental

🚀 Quick Start

The experimental versions of the model are generated using a custom quantization method. Here is a high - level overview of the process:

Convert the original model's tensors to GGUF F16*.
Estimate the Perplexity score for the F16 model (baseline) using the wikitext-2-raw-v1 dataset, and save the logits.
Generate an imatrix from selected calibration datasets.
Determine tensor and layer Importance Score contribution using the modified version of llama-imatrix.
Select an appropriate quant level for each tensor and quantize the model using llama-quantize.
Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model.
Keep versions with the best scores.
Repeat until all desired quants are created.

*BF16 would be preferred, but Apple's GPUs don't support it yet, and therefore any operations are executed in the CPU, making it unacceptably slow. This is expected to change in the near term but until then, if you are using Apple kit avoid using any models tagged BF16

✨ Features

Layer-wise Quantization: Inspired by Dumitru's et al Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels, different quantization types are applied to different tensors/layers.
Optimized Inference: Aims to improve the inference performance of LLMs in resource - constrained environments.
Custom Tools: A custom version of llama-imatrix and llama-quantize is used to identify influential tensors and perform quantization.

📚 Documentation

Original Model Introduction

Large Action Models (LAMs) are advanced language models designed to enhance decision - making by translating user intentions into executable actions. As the brains of AI agents, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains.

This model release is for research purposes only.

The new xLAM - 2 series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in multi - turn conversation and tool usage. Trained using our novel APIGen - MT framework, which generates high - quality training data through simulated agent - human interactions. Our models achieve state - of - the - art performance on BFCL and œÑ - bench benchmarks, outperforming frontier models like GPT - 4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi - turn scenarios while maintaining exceptional consistency across trials.

Experimental Version Production

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource - constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but the focus has been primarily on quantization and pruning.

The method used to produce these experimental versions is covered in Squeezing Tensor Bits: the quest for smaller LLMs. At a high level, it involves using a custom version of llama-imatrix and llama-quantize to identify influential tensors, and quantize the most important layers to higher bit precision and the less important to lower bits.

As of version b5125 llama - quantize can now perform tensor - wide quantization (TWQ), whereby user - defined tensors are quantized at a specific level, or perform layer - wise quantization (LWQ) by selecting different quantization types per tensor/layer.

The modified version of llama - imatrix generates useful statistics to guide the tensor selection process. --show - statistics will display:

Œ£(Bias): the sum of all activations over the tensor (i.e. the Importance Scores)
Min & Max: minimum and maximum activation values
Œº & œÉ: activations' mean and standard deviation
% Active: proportion of elements whose average activation exceeds a very small threshold (1e - 6). Helpful to determine how alive/dormant the tensor is during inference
N: number of activations in the tensor
Entropy: entropy of the activation distribution, in bits (standard Shannon entropy measurement)
E (norm): Normalized entropy.
ZD Score: z - score distribution as described in 3.1 Layer Importance Scores in the Layer - Wise Quantization paper
CosSim: cosine similarity between same type tensors with respect to the previous layer (i.e. blk.7.attn_k and blk.6.attn_k)

Please note that statistics are calculated for each individual tensor and should be used to compare between tensors of the same type only.

There's a pull request to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified version will be available on GitHub.

Testing and Comparison

For testing and comparison, normally models produced by Unsloth and Bartowski are used. But they don't provide GGUF versions of this model, so all tests and comparisons are done against naive quantizations obtained by simply running llama - quantize with no further optimization.

All experimental versions were generated using an appropriate imatrix created from calibration datasets available at eaddario/imatrix - calibration.

🔧 Technical Details

Model Sizes

Model	Naive	Repo	Shrinkage
Llama-xLAM-2-8b-fc-r-IQ3_M	3.78	3.69	2.4%
Llama-xLAM-2-8b-fc-r-IQ3_S	3.68	3.43	6.8%
Llama-xLAM-2-8b-fc-r-IQ4_NL	4.71	4.39	6.2%
Llama-xLAM-2-8b-fc-r-Q3_K_L	4.32	3.76	13.0%
Llama-xLAM-2-8b-fc-r-Q3_K_M	4.02	3.56	11.4%
Llama-xLAM-2-8b-fc-r-Q3_K_S	3.66	3.31	9.6%
Llama-xLAM-2-8b-fc-r-Q4_K_M	4.92	4.41	10.4%
Llama-xLAM-2-8b-fc-r-Q4_K_S	4.69	4.28	8.7%
Llama-xLAM-2-8b-fc-r-Q5_K_M	5.73	5.38	6.1%
Llama-xLAM-2-8b-fc-r-Q5_K_S	5.60	5.24	6.4%
Llama-xLAM-2-8b-fc-r-Q6_K	6.60	6.57	0.5%
Llama-xLAM-2-8b-fc-r-Q8_0	8.54	7.73	9.5%

Perplexity and KL Divergence scores

Model	ŒºPPL	ùúåPPL	ŒºKLD	RMS Œîp
Llama-xLAM-2-8b-fc-r-IQ3_M	8.471225 ¬±0.059374	98.14%	0.096730 ¬±0.000436	9.339 ¬±0.048
Llama-xLAM-2-8b-fc-r-IQ3_S	8.675839 ¬±0.060418	97.37%	0.137925 ¬±0.000554	11.245 ¬±0.051
Llama-xLAM-2-8b-fc-r-IQ4_NL	8.337503 ¬±0.060156	99.09%	0.047455 ¬±0.000243	6.280 ¬±0.039
Llama-xLAM-2-8b-fc-r-Q3_K_L	8.894129 ¬±0.063417	97.22%	0.136754 ¬±0.000659	11.276 ¬±0.057
Llama-xLAM-2-8b-fc-r-Q3_K_M	8.991141 ¬±0.063906	96.89%	0.152094 ¬±0.000706	11.870 ¬±0.058
Llama-xLAM-2-8b-fc-r-Q3_K_S	9.352260 ¬±0.066573	95.91%	0.198689 ¬±0.000870	13.526 ¬±0.061
Llama-xLAM-2-8b-fc-r-Q4_K_M	8.230419 ¬±0.058263	99.18%	0.041808 ¬±0.000219	5.988 ¬±0.037
Llama-xLAM-2-8b-fc-r-Q4_K_M (naive)	8.072811 ¬±0.057158	99.60%	0.019868 ¬±0.000110	4.044 ¬±0.029
Llama-xLAM-2-8b-fc-r-Q4_K_S	8.239495 ¬±0.058176	99.10%	0.045691 ¬±0.000240	6.262 ¬±0.039
Llama-xLAM-2-8b-fc-r-Q5_K_M	8.062572 ¬±0.057549	99.77%	0.011576 ¬±0.000073	3.136 ¬±0.025
Llama-xLAM-2-8b-fc-r-Q5_K_S	8.057947 ¬±0.057474	99.75%	0.012330 ¬±0.000075	3.210 ¬±0.026
Llama-xLAM-2-8b-fc-r-Q6_K	7.983587 ¬±0.056711	99.91%	0.004239 ¬±0.000034	1.912 ¬±0.018
Llama-xLAM-2-8b-fc-r-Q8_0	7.982215 ¬±0.056796	99.94%	0.002365 ¬±0.000026	1.449 ¬±0.019
Llama-xLAM-2-8b-fc-r-F16	7.968796 ¬±0.056714	100%	N/A	N/A

ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores

Scores generated using llama - perplexity with 750 tasks per test, and a context size of 768 tokens.

For the test data used in the generation of these scores, follow the appropriate links: HellaSwag, ARC, MMLU, Truthful QA and WinoGrande

Model	ARC	HellaSwag	MMLU	Truthful QA	WinoGrande	Avg Score
Llama-xLAM-2-8b-fc-r-IQ3_M	64.6667 ¬±1.7466	76.67	38.5333 ¬±1.7783	29.6000 ¬±1.6680	74.5333 ¬±1.5919	56.80
Llama-xLAM-2-8b-fc-r-IQ3_S	60.8000 ¬±1.7838	72.40	38.0000 ¬±1.7736	30.9333 ¬±1.6889	72.5333 ¬±1.6309	54.93
Llama-xLAM-2-8b-fc-r-IQ4_NL	66.0000 ¬±1.7309	77.73	39.0667 ¬±1.7827	30.8000 ¬±1.6869	73.7333 ¬±1.6080	57.47
Llama-xLAM-2-8b-fc-r-Q3_K_L	65.0667 ¬±1.7420	76.67	38.6667 ¬±1.7794	29.6000 ¬±1.6680	71.6000 ¬±1.6477	56.32

📄 License

This project is licensed under the cc - by - nc - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご