🚀 DFloat11 Compressed Model: black-forest-labs/FLUX.1-Canny-dev
This project presents a losslessly compressed version of the black-forest-labs/FLUX.1-Canny-dev
model. By leveraging the custom DFloat11 format, it significantly reduces GPU memory usage while maintaining the same output as the original model.
✨ Features
- Lossless Compression: The outputs of the compressed model are bit - for - bit identical to the original BFloat16 model.
- Reduced Memory Consumption: It cuts down GPU memory consumption by approximately 30%.
- Efficient Decompression: Utilizes Huffman coding and hardware - aware algorithmic designs for on - the - fly decompression directly on the GPU.
- GPU - Only Operations: All operations are handled entirely on the GPU, eliminating CPU decompression and host - device data transfer.
📦 Installation
- Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA - compatible GPU and PyTorch installed):
pip install -U dfloat11[cuda12]
- Install or upgrade the diffusers and controlnet_aux packages.
pip install -U diffusers controlnet_aux
💻 Usage Examples
Basic Usage
To use the DFloat11 model, run the following example code in Python:
import torch
from controlnet_aux import CannyDetector
from diffusers import FluxControlPipeline
from diffusers.utils import load_image
from dfloat11 import DFloat11Model
pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
DFloat11Model.from_pretrained('DFloat11/FLUX.1-Canny-dev-DF11', device='cpu', bfloat16_model=pipe.transformer)
prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
processor = CannyDetector()
control_image = processor(control_image, low_threshold=50, high_threshold=200, detect_resolution=1024, image_resolution=1024)
image = pipe(
prompt=prompt,
control_image=control_image,
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=30.0,
).images[0]
image.save("output.png")
📚 Documentation
How It Works
DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware - aware algorithmic designs that enable efficient on - the - fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint.
Key benefits:
- No CPU decompression or host - device data transfer: all operations are handled entirely on the GPU.
- DFloat11 is much faster than CPU - offloading approaches, enabling practical deployment in memory - constrained environments.
- The compression is fully lossless, guaranteeing that the model's outputs are bit - for - bit identical to those of the original model.
Learn More
📄 Information Table
Property |
Details |
Base Model |
black - forest - labs/FLUX.1 - Canny - dev |
Base Model Relation |
quantized |
Pipeline Tag |
text - to - image |
Tags |
dfloat11, df11, lossless compression, 70% size, 100% accuracy |