Model Overview
Model Features
Model Capabilities
Use Cases
๐ LibreFLUX: A free, de-distilled FLUX model
LibreFLUX is an Apache 2.0 licensed model derived from FLUX.1-schnell, offering full T5 context length, attention masking, restored classifier free guidance, and reduced aesthetic fine - tuning. It's slower but more adaptable for finetuning.
๐ Quick Start
LibreFLUX is an Apache 2.0 version of FLUX.1-schnell. It provides a full T5 context length, uses attention masking, has classifier free guidance restored, and has had most of the FLUX aesthetic fine - tuning/DPO fully removed.
The image features a man standing confidently, wearing a simple t - shirt with a humorous and quirky message printed across the front. The t - shirt reads: "I de - distilled FLUX schnell into a slow, ugly model and all I got was this stupid t - shirt." The manโs expression suggests a mix of pride and irony, as if he's aware of the complexity behind the statement, yet amused by the underwhelming reward. The background is neutral, keeping the focus on the man and his t - shirt, which pokes fun at the frustrating and often anticlimactic nature of technical processes or complex problem - solving, distilled into a comically understated punchline.
๐ป Usage Examples
Basic Usage
# ! pip install diffusers==0.30.3
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"jimmycarter/LibreFLUX",
custom_pipeline="jimmycarter/LibreFLUX",
use_safetensors=True,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# High VRAM
prompt = "Photograph of a chalk board on which is written: 'I thought what I'd do was, I'd pretend I was one of those deaf - mutes.'"
negative_prompt = "blurry"
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
return_dict=False,
# guidance_scale=3.5,
# num_inference_steps=28,
# generator=torch.Generator().manual_seed(42),
# no_cfg_until_timestep=0,
)
images[0][0].save('chalkboard.png')
# If you have <=24 GB VRAM, try:
# ! pip install optimum - quanto
# Then
from optimum.quanto import freeze, quantize, qint8
# quantize and freeze will take a short amount of time, so be patient.
quantize(
pipe.transformer,
weights=qint8,
exclude=[
"*.norm", "*.norm1", "*.norm2", "*.norm2_context",
"proj_out", "x_embedder", "norm_out", "context_embedder",
],
)
freeze(pipe.transformer)
pipe.enable_model_cpu_offload()
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
device=None,
return_dict=False,
do_batch_cfg=False, # https://github.com/huggingface/optimum - quanto/issues/327
# guidance_scale=3.5,
# num_inference_steps=28,
# generator=torch.Generator().manual_seed(42),
# no_cfg_until_timestep=0,
)
images[0][0].save('chalkboard.png')
Advanced Usage
from lycoris import create_lycoris_from_weights
pipe = DiffusionPipeline.from_pretrained(
"jimmycarter/LibreFLUX",
custom_pipeline="jimmycarter/LibreFLUX",
use_safetensors=True,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
lycoris_safetensors_path = 'pytorch_lora_weights.safetensors'
wrapper, _ = create_lycoris_from_weights(1.0, lycoris_safetensors_path, pipe.transformer)
wrapper.merge_to()
del wrapper
prompt = "Photograph of a chalk board on which is written: 'I thought what I'd do was, I'd pretend I was one of those deaf - mutes.'"
negative_prompt = "blurry"
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
return_dict=False,
)
images[0][0].save('chalkboard.png')
# optionally, save a merged pipeline containing the LyCORIS baked - in:
# pipe.save_pretrained('/path/to/output/pipeline')
๐ Documentation
Non - technical Report on Schnell De - distillation
Welcome to my non - technical report on de - distilling FLUX.1 - schnell in the most un - scientific way possible with extremely limited resources. I'm not going to claim I made a good model, but I did make a model. It was trained on about 1,500 H100 hour equivalents.
Everyone is an artist a machine learning researcher.
Why
FLUX is a good text - to - image model, but the only versions of it that are out are distilled. FLUX.1 - dev is distilled so that you don't need to use CFG (classifier free guidance), so instead of making one sample for conditional (your prompt) and unconditional (negative prompt), you only have to make the sample for conditional. This means that FLUX.1 - dev is twice as fast as the model without distillation.
FLUX.1 - schnell (German for "fast") is further distilled so that you only need 4 steps of conditional generation to get an image. Importantly, FLUX.1 - schnell has an Apache - 2.0 license, so you can use it freely without having to obtain a commercial license from Black Forest Labs. Out of the box, schnell is pretty bad when you use CFG unless you skip the first couple of steps.
The FLUX distilled models are created for their base, non - distilled models by training on output from the teacher model (non - distilled) to student model (distilled) along with some tricks like an adversarial network.
For de - distilled models, image generation takes a little less than twice as long because you need to compute a sample for both conditional and unconditional images at each step. The benefit is you can use them commercially for free, training is a little easier, and they may be more creative.
Restoring the original training objective
This part is actually really easy. You just train it on the normal flow - matching objective with MSE loss and the model starts learning how to do it again. That being said, I don't think either LibreFLUX or OpenFLUX.1 managed to fully de - distill the model. The evidence I see for that is that both models will either get strange shadows that overwhelm the image or blurriness when using CFG scale values greater than 4.0. Neither of us trained very long in comparison to the training for the original model (assumed to be around 0.5 - 2.0m H100 hours), so it's not particularly surprising.
FLUX and attention masking
FLUX models use a text model called T5 - XXL to get most of its conditioning for the text - to - image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.
This results in the model using these padding tokens to store information. When you visualize the attention maps of the tokens in the padding segment of the text encoder, you can see that about 10 - 40 tokens shortly after the last token of the text and about 10 - 40 tokens at the end of the padding contain information which the model uses to make images. Because these are normally used to store information, it means that any prompt long enough to not have some of these padding tokens will end up with degraded performance.
It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.
I already [implemented attention masking](https://github.com/bghira/SimpleTuner/resolve/main/helpers/models/flux/transformer.py#L404 - L406) and I would like to be able to use all 512 tokens without degradation, so I did my finetune with it on. Small scale finetunes with it on tend to damage the model, but since I need to train so much out of distillation schnell to make it work anyway I figured it probably didn't matter to add it.
Note that FLUX.1 - schnell was only trained on 256 tokens, so my finetune allows users to use the whole 512 token sequence length.
Make de - distillation go fast and fit in small GPUs
I avoided doing any full - rank (normal, all parameters) fine - tuning at all, since FLUX is big. I trained initially with the model in int8 precision using [quanto](https://github.com/huggingface/optimum - quanto). I started with a 600 million parameter LoKr, since LoKr tends to approximate full - rank fine - tuning better than LoRA. The loss was really slow to go down when I began, so after poking around the code to initialize the matrix to apply to the LoKr I settled on this function, which injects noise at a fraction of the magnitudes of the layers they apply to.
def approximate_normal_tensor(inp, target, scale=1.0):
tensor = torch.randn_like(target)
desired_norm = inp.norm()
desired_mean = inp.mean()
desired_std = inp.std()
current_norm = tensor.norm()
tensor = tensor * (desired_norm / current_norm)
current_std = tensor.std()
tensor = tensor * (desired_std / current_std)
tensor = tensor - tensor.mean() + desired_mean
tensor.mul_(scale)
target.copy_(tensor)
def init_lokr_network_with_perturbed_normal(lycoris, scale=1e - 3):
with torch.no_grad():
for lora in lycoris.loras:
lora.lokr_w1.fill_(1.0)
approximate_normal_tensor(lora.org_weight, lora.lokr_w2, scale=scale)
This isn't normal PEFT (parameter efficient fine - tuning) anymore, because this will perturb all the weights of the model slightly in the beginning. It doesn't seem to cause any performance degradation in the model after testing and it made the loss fall for my LoKr twice as fast, so I used it with scale = 1e - 3
. The LoKr weights I trained in bfloat16, with the adamw_bf16
optimizer that I plagiarized wrote with the magic of open source software.
Selecting better layers to train with LoKr
FLUX is a pretty standard transformer model aside from some peculiarities. One of these peculiarities is in their "norm" layers, which contain non - linearities so they don't act like norms except for a single normalization that is applied in the layer with...
๐ License
This project is licensed under the Apache - 2.0 license.







