LibreFLUX Open Source Model - Restore Full Functionality, Remove Fine-tuning Content, Perfect for Secondary Development

Libreflux

Developed by jimmycarter

LibreFLUX is the Apache 2.0 open-source version of FLUX.1-schnell, restoring full T5 context length and classifier-free guidance, removing aesthetic fine-tuning content, making it more suitable for secondary development

Text-to-Image Open Source License:Apache-2.0 #Distilled Text-to-Image #Full-length T5 Support #Attention Mask Optimization

Downloads 119

Release Time : 10/20/2024

Model Overview

A distilled text-to-image generation model based on FLUX.1-schnell, supporting 512-token length and attention masking mechanism while preserving original training objectives

Model Features

Full Context Support

Restores full 512-token T5 context length with attention masking mechanism to enhance long-text generation quality

Classifier-Free Guidance

Restored CFG functionality post-distillation (recommended scale 2.0-5.0), supporting negative prompts

Open-Source Friendly

Removes proprietary aesthetic fine-tuning content, easier for secondary development and commercial applications under Apache 2.0 license

Efficient Fine-Tuning Support

Adapted for LyCORIS parameter-efficient fine-tuning, recommended to use SimpleTuner for training

Model Capabilities

Text-to-Image Generation

Negative Prompt Support

Long Text Description Generation

Style Transfer (requires fine-tuning)

Use Cases

Creative Design

Concept Art Generation

Generate concept art based on detailed text descriptions

Can generate images with specific styles and details

Educational Applications

Teaching Material Generation

Quickly generate blackboard/whiteboard teaching illustrations

Successfully generated blackboard images containing text in examples

🚀 LibreFLUX: A free, de-distilled FLUX model

LibreFLUX is an Apache 2.0 licensed model derived from FLUX.1-schnell, offering full T5 context length, attention masking, restored classifier free guidance, and reduced aesthetic fine - tuning. It's slower but more adaptable for finetuning.

🚀 Quick Start

LibreFLUX is an Apache 2.0 version of FLUX.1-schnell. It provides a full T5 context length, uses attention masking, has classifier free guidance restored, and has had most of the FLUX aesthetic fine - tuning/DPO fully removed.

Splash Image

The image features a man standing confidently, wearing a simple t - shirt with a humorous and quirky message printed across the front. The t - shirt reads: "I de - distilled FLUX schnell into a slow, ugly model and all I got was this stupid t - shirt." The man’s expression suggests a mix of pride and irony, as if he's aware of the complexity behind the statement, yet amused by the underwhelming reward. The background is neutral, keeping the focus on the man and his t - shirt, which pokes fun at the frustrating and often anticlimactic nature of technical processes or complex problem - solving, distilled into a comically understated punchline.

💻 Usage Examples

Basic Usage

# ! pip install diffusers==0.30.3
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "jimmycarter/LibreFLUX",
    custom_pipeline="jimmycarter/LibreFLUX",
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# High VRAM
prompt = "Photograph of a chalk board on which is written: 'I thought what I'd do was, I'd pretend I was one of those deaf - mutes.'"
negative_prompt = "blurry"
images = pipe(
  prompt=prompt,
  negative_prompt=negative_prompt,
  return_dict=False,
  # guidance_scale=3.5,
  # num_inference_steps=28,
  # generator=torch.Generator().manual_seed(42),
  # no_cfg_until_timestep=0,
)
images[0][0].save('chalkboard.png')

# If you have <=24 GB VRAM, try:
# ! pip install optimum - quanto
# Then
from optimum.quanto import freeze, quantize, qint8
# quantize and freeze will take a short amount of time, so be patient.
quantize(
    pipe.transformer,
    weights=qint8,
    exclude=[
        "*.norm", "*.norm1", "*.norm2", "*.norm2_context",
        "proj_out", "x_embedder", "norm_out", "context_embedder",
    ],
)
freeze(pipe.transformer)
pipe.enable_model_cpu_offload()

images = pipe(
  prompt=prompt,
  negative_prompt=negative_prompt,
  device=None,
  return_dict=False,
  do_batch_cfg=False, # https://github.com/huggingface/optimum - quanto/issues/327
  # guidance_scale=3.5,
  # num_inference_steps=28,
  # generator=torch.Generator().manual_seed(42),
  # no_cfg_until_timestep=0,
)
images[0][0].save('chalkboard.png')

Advanced Usage

from lycoris import create_lycoris_from_weights

pipe = DiffusionPipeline.from_pretrained(
    "jimmycarter/LibreFLUX",
    custom_pipeline="jimmycarter/LibreFLUX",
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

lycoris_safetensors_path = 'pytorch_lora_weights.safetensors'
wrapper, _ = create_lycoris_from_weights(1.0, lycoris_safetensors_path, pipe.transformer)
wrapper.merge_to()
del wrapper

prompt = "Photograph of a chalk board on which is written: 'I thought what I'd do was, I'd pretend I was one of those deaf - mutes.'"
negative_prompt = "blurry"
images = pipe(
  prompt=prompt,
  negative_prompt=negative_prompt,
  return_dict=False,
)
images[0][0].save('chalkboard.png')

# optionally, save a merged pipeline containing the LyCORIS baked - in:
# pipe.save_pretrained('/path/to/output/pipeline')

📚 Documentation

Non - technical Report on Schnell De - distillation

Welcome to my non - technical report on de - distilling FLUX.1 - schnell in the most un - scientific way possible with extremely limited resources. I'm not going to claim I made a good model, but I did make a model. It was trained on about 1,500 H100 hour equivalents.

Science Image

Everyone is ~~an artist~~ a machine learning researcher.

Why

FLUX is a good text - to - image model, but the only versions of it that are out are distilled. FLUX.1 - dev is distilled so that you don't need to use CFG (classifier free guidance), so instead of making one sample for conditional (your prompt) and unconditional (negative prompt), you only have to make the sample for conditional. This means that FLUX.1 - dev is twice as fast as the model without distillation.

FLUX.1 - schnell (German for "fast") is further distilled so that you only need 4 steps of conditional generation to get an image. Importantly, FLUX.1 - schnell has an Apache - 2.0 license, so you can use it freely without having to obtain a commercial license from Black Forest Labs. Out of the box, schnell is pretty bad when you use CFG unless you skip the first couple of steps.

The FLUX distilled models are created for their base, non - distilled models by training on output from the teacher model (non - distilled) to student model (distilled) along with some tricks like an adversarial network.

For de - distilled models, image generation takes a little less than twice as long because you need to compute a sample for both conditional and unconditional images at each step. The benefit is you can use them commercially for free, training is a little easier, and they may be more creative.

Restoring the original training objective

This part is actually really easy. You just train it on the normal flow - matching objective with MSE loss and the model starts learning how to do it again. That being said, I don't think either LibreFLUX or OpenFLUX.1 managed to fully de - distill the model. The evidence I see for that is that both models will either get strange shadows that overwhelm the image or blurriness when using CFG scale values greater than 4.0. Neither of us trained very long in comparison to the training for the original model (assumed to be around 0.5 - 2.0m H100 hours), so it's not particularly surprising.

FLUX and attention masking

FLUX models use a text model called T5 - XXL to get most of its conditioning for the text - to - image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.

This results in the model using these padding tokens to store information. When you visualize the attention maps of the tokens in the padding segment of the text encoder, you can see that about 10 - 40 tokens shortly after the last token of the text and about 10 - 40 tokens at the end of the padding contain information which the model uses to make images. Because these are normally used to store information, it means that any prompt long enough to not have some of these padding tokens will end up with degraded performance.

It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.

I already [implemented attention masking](https://github.com/bghira/SimpleTuner/resolve/main/helpers/models/flux/transformer.py#L404 - L406) and I would like to be able to use all 512 tokens without degradation, so I did my finetune with it on. Small scale finetunes with it on tend to damage the model, but since I need to train so much out of distillation schnell to make it work anyway I figured it probably didn't matter to add it.

Note that FLUX.1 - schnell was only trained on 256 tokens, so my finetune allows users to use the whole 512 token sequence length.

Make de - distillation go fast and fit in small GPUs

I avoided doing any full - rank (normal, all parameters) fine - tuning at all, since FLUX is big. I trained initially with the model in int8 precision using [quanto](https://github.com/huggingface/optimum - quanto). I started with a 600 million parameter LoKr, since LoKr tends to approximate full - rank fine - tuning better than LoRA. The loss was really slow to go down when I began, so after poking around the code to initialize the matrix to apply to the LoKr I settled on this function, which injects noise at a fraction of the magnitudes of the layers they apply to.

def approximate_normal_tensor(inp, target, scale=1.0):
    tensor = torch.randn_like(target)
    desired_norm = inp.norm()
    desired_mean = inp.mean()
    desired_std = inp.std()

    current_norm = tensor.norm()
    tensor = tensor * (desired_norm / current_norm)
    current_std = tensor.std()
    tensor = tensor * (desired_std / current_std)
    tensor = tensor - tensor.mean() + desired_mean
    tensor.mul_(scale)

    target.copy_(tensor)


def init_lokr_network_with_perturbed_normal(lycoris, scale=1e - 3):
    with torch.no_grad():
        for lora in lycoris.loras:
            lora.lokr_w1.fill_(1.0)
            approximate_normal_tensor(lora.org_weight, lora.lokr_w2, scale=scale)

This isn't normal PEFT (parameter efficient fine - tuning) anymore, because this will perturb all the weights of the model slightly in the beginning. It doesn't seem to cause any performance degradation in the model after testing and it made the loss fall for my LoKr twice as fast, so I used it with scale = 1e - 3. The LoKr weights I trained in bfloat16, with the adamw_bf16 optimizer that I ~~plagiarized~~ wrote with the magic of open source software.

Selecting better layers to train with LoKr

FLUX is a pretty standard transformer model aside from some peculiarities. One of these peculiarities is in their "norm" layers, which contain non - linearities so they don't act like norms except for a single normalization that is applied in the layer with...

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご