Re-Conditioning Flux for Image to Image Generation

Flux/Pixart Image Variation Experiments

Quick summary: We trained Flux to generate image variations, like this:

Code and models are on huggingface.

This post will cover a number of experiments I made to replicate the work by Justin Pinkney which reconditions Stable Diffusion v1.5 to generate image variations. To quickly recap, most image diffusion models are trained with text conditioning, and so take text as input to generate an image. Image variations work a bit differently, they take an image as input and generate similar-looking images (as demonstrated in the example image above).

Since diffusion models work via an iterative denoising process, it’s possible to use a standard text-conditioned diffusion model to generate image variations by taking your image prompt and adding noise, and then re-denoising. This approach is easy, but limits the flexibility of the model. Re-conditioning is very flexible, for example the Zero123 paper combined image conditioning with camera position/rotation conditioning to allow for novel view synthesis.

In this post I’ll be working to re-condition two different diffusion models, Flux and PixArt, by swapping out the text conditioning for image conditioning. I’ll cover a number of different conditioning strategies and my exploration in replicating Justin’s work.

Flux

Stable Diffusion by default accepts CLIP encoded text conditioning. Much like Justin, initially I was surprised that it’s necessary to recondition Stable Diffusion to accept image conditioning, since all I knew about CLIP was that it works by unifying both text and image embeddings, and I assumed that they would be interchangeable. Unfortunately, it’s not that simple. CLIP trains an image and text encoder, which both produce intermediate results, which are then projected to the unified CLIP embedding. Both Stable Diffusion and Flux don’t directly accept the final unified CLIP embedding, rather they both use the intermediate text encoder embeddings. It wasn’t explicitly stated in the original Stable Diffusion paper, but I believe the reason that Stable Diffusion uses the CLIP intermediate text encoder embeddings is to capture as much information as possible from the prompt.

Even though Stable Diffusion accepts the CLIP text encoder’s intermediate results, their shape (N x 768) is the same as the final CLIP unified embedding (1 x 768). Despite the fact that one can easily plug the CLIP unified embedding into Stable Diffusion, Justin found that using the unified embeddings did not produce meaningful results, requiring him to turn to fine tuning instead. My plan was to do something similar, but with Flux.

Flux is a newer image generation model based on the diffusion transformer (DiT) architecture. Flux was developed by Black Forest Labs, a company founded by former employees of Stability AI (the company that created stable diffusion). In contrast to Stable Diffusion, Flux accepts two different forms of conditioning. One is a slightly different form of the CLIP intermediate text encoder embeddings (Flux accepts just the pooled CLIP text embeddings, and Stable Diffusion accepts both the pooled and token text embeddings), and the other is the embeddings from a T5 text encoder.

In order to replicate Justin’s work as directly as possible, I wanted to swap out only the CLIP embeddings. My plan for the T5 embeddings was to leave them empty (by providing an empty string as the prompt), to force the model to focus on CLIP.

Here’s where I hit a roadblock:

(https://web.archive.org/web/20240822025109/https://github.com/black-forest-labs/flux/issues/9)

Flux-Schnell has undergone a timestep distillation process, which means that it has been distilled from a larger (unreleased) Flux model to generate images in 1-4 steps. The consensus in the Flux Github issues (which when I looked back to cite for this post, I found were all removed) was that this would make the Flux models difficult or impossible to fine tune. I also found the paper Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization which attempts to improve fine-tuning timestep distilled models, although that technique didn’t seem to allow for reconditioning the model.

To start, I decided to confirm for myself that a standard fine-tuning of Flux would be problematic. Before trying to swap out the embeddings, I started with a quick traditional (text to image) fine-tune on the Smithsonian Butterfly Dataset just to try out fine tuning Flux. Here’s one image I liked a lot.

Prompt: “Billboard that says all3d”. Now Flux wants to generate butterflies 🦋.

It looked to me like Flux was fine-tuning just fine, so I moved on to image variations. I swapped out Flux’s pooled CLIP text embeddings for projected CLIP image embeddings, fixed the T5 prompt to “”, and started training. For these early training runs I was using the PD12M dataset. This is where I started to see degraded model performance. Here are some images from those training runs to give you an idea of what I encountered:

Overall, the images started to get both blurry, noisy, and were losing clear form. I was pretty confident that this was the “representation collapse” that was mentioned earlier, and if I wanted to get anywhere, I couldn’t just naively train Flux and expect it to work.

I went looking for a solution and I found a number of models that claimed to remove the distillation from Flux-Schnell. I ended up choosing OpenFlux.1.

My theory was that if distillation was the problem, then it should be possible to use one of these models instead—with very similar training—and avoid the representational collapse. One downside to this approach is that the de-distillation process seemed to reduce the quality of the Flux model. Take a look at this writeup for some great commentary on the de-distillation process.

For training, the only thing that had to change was adding conditioning dropout, as the de-distilled Flux models required using classifier free guidance (CFG). I implemented a simple form of conditioning dropout for the image embedding, by setting the image prompt to solid black.

Training OpenFlux

This training run was really fun. As Flux started to learn its way around the CLIP embedding space, many of the images generated while training were psychedelic and artistic.

Although I feel like showing off AI generated images is basically like talking about your dreams, I hope that you enjoy some of these. Here’s a selection of my favorite intermediate generations:

The leftmost image is the image prompt, the rest are variations.

This beetle generated weird metallic circuit board-y things.

This flower generated black and white images.

This beetle generated strange faces.

As you can see, even though we haven’t trained it enough to learn to reproduce the concepts in the image, the same image prompts still result in outputs along a consistent theme. I gave it another 5000 steps of training, and started getting some reasonable results.

Top image is the image prompt, bottom row are generations.

My theory for this one is that the PD12M dataset has many photos of museum objects. The chair on the white background evoked this class of objects in the generations.

And here are some strange glass flowers, and a… bug?

My theory was that the fine-tuning dataset was limiting the generations: the PD12M dataset is full of photos of plants/nature, and public domain photos like museum photos. So I decided to take this model, and further train it with data from the SA1B/SAM dataset, a dataset with 12M high quality images collected by Meta. This dataset came to our attention as it was used to train PixArt-Alpha.

So I wrote a simple streaming dataloader for SA1B, debugged some network issues, discovered that SA1B sort-of but doesn’t really abide by the WebDataset standard, debugged some more network issues, quickly cleaned up PixArt’s AI generated SAM captions, and was ready to go.

41k steps later, here were the results:

Top image is the image prompt, bottom row are generations (512×512).

Some issues with anatomy.

It seems Flux has forgotten a lot of its text knowledge.

The faces in the SA1B dataset are blurred, which is most likely the cause of the blurred faces here.

So, a decent result, but it definitely seems like the model has degraded in performance. I’m not sure how much of that is the result of using a de-distilled Flux model, or some consequence of the reconditioning.

To get a sense of how much my reconditioning harmed the model, I decided to do a few test generations with the base OpenFlux.1 model:

All images generated with 28 steps and a CFG scale of 2. Only text conditioning was used.

Prompt: “billboard that says all3d”

Prompt: “company logo that says ‘all3d’”

Prompt: “green midcentury modern chair”

These generations show that OpenFlux.1 definitely still has decent text generation capabilities, so I think my reconditioning did harm the model somewhat.

Pixart

While I was attempting to recondition Flux for image variations, it turned out that someone else had been trying the same thing: Black Forest Labs. They released a number of tools for Flux, including an ‘adapter network’ to allow for image conditioning called Flux-Redux. Redux translates images from SigLip (an improved CLIP model) to T5 embeddings.

Top image: Image prompt. Bottom row: Flux (Flux-Schnell) Redux image variations.

Flux redux works really well.

Redux caught my interest because instead of reconditioning the whole model, it was just a small translation layer. I wanted to experiment with this technique, but at this point I decided that I didn’t want to keep working with Flux. The model was big and unwieldy, took a long time to train, and the de-distillation process added a layer of uncertainty to any experiments I made. I chose to instead run my next experiments with the PixArt-Alpha model, which is much smaller and also uses T5 embeddings.

My first thought was that maybe I could use the Redux model as-is with the PixArt model. After all, Redux is compatible with both Flux-Dev and Flux-Schnell. Maybe it learned to translate SigLip into tokens that were compatible with any T5 conditioned model.

So I tried plugging Redux into PixArt-Alpha:

Something seems wrong here, can’t quite put my finger on it.

Instead of image variations, I just got noise. Pretty quickly I realized that the distributions for the T5 embeddings and the Redux embeddings were very different. T5 embeddings typically had a standard deviation of ~0.1, whereas the Redux embeddings had a standard deviation of ~2.5.

This surprised me. Why were the Redux tokens working with Flux? Shouldn’t Flux also be trained to accept the distribution for T5 embeddings? Well it turns out that Flux is much more resilient to tokens scaled outside of the T5 distribution than PixArt. Here’s some experiments:

Prompt: “Cool dog wearing a cool hat” multiplied by some scaling factor. Top row: PixArt generations. Bottom row: Flux generations (CLIP set to “”).

Scaled by 5. PixArt is mostly noise.

Scaled by 3. Here the PixArt results are still noisy but in a way that almost looks like modern art.

Scaled by 0.5. PixArt is basically perfect here, whereas Flux is moving into abstract art territory for some reason.

Scaled by 0.1. Flux seems to have entered the abstract phone wallpaper latent space, PixArt doesn’t know what to do.

So Flux is much more resilient to scaled T5 embeddings, and Redux outputs embeddings that take advantage of this. This left me with a few questions:

Would scaling down the Redux embeddings make them work with PixArt?
Can I train a Redux-like model from scratch that would work with PixArt?
Would that model be compatible with Flux?

The first question was answered easily: no, simply scaling down the Redux embeddings does not work. Here are some generations using scaled down Redux embeddings:

Top image: Image prompt. Bottom row: PixArt image variations.

Judging by the lack of noise, we have an embedding that PixArt can tolerate, but it’s not similar enough to standard T5 embeddings for PixArt to make anything of it.

To answer the other two questions, I started two experiments and left them to run over the holidays.

Training PixArt to use Redux embeddings

Maybe it would be possible to recondition PixArt to accept these embeddings? It should be possible for PixArt to learn to use Redux. I started a training run to investigate.

In order to help PixArt along, I scaled the Redux model outputs by 0.04, and sliced the output to the first 120 elements (as PixArt was trained on 120 T5 tokens).

I trained for ~186k steps, at a learning rate of 1e-4, and here are the results:

Top image: Image prompt. Bottom row: PixArt image variations.

I think it’s clear that PixArt has definitely learned how to use Redux embeddings. Although it’s experiencing some of the same model degradation that Flux also experienced when reconditioning it to use CLIP embeddings. I’m not sure if there’s a good way to avoid this when reconditioning a model.

Training a Redux adapter network from scratch for PixArt

To answer the other two questions (Can I train a Redux-like model from scratch that would work with PixArt? Would that model be compatible with Flux?) I started another training run to train just a fresh Redux network.

I trained for ~91k steps, at a learning rate of 2e-5 for ~35k steps, and a learning rate of 1e-5 for ~56k steps. Here are the results:

Top image: Image prompt. Bottom row: PixArt image variations.

I didn’t get to train this model as long as the other experiment, but I think the results are also successful. Since I’m only training the Redux embedding network, I think I’m getting less degradation, although it’s hard for me to say.

Overall I think this training run shows that it’s definitely possible to replicate the Redux technique for PixArt.

Does the Redux adapter for PixArt work with Flux?

To answer my final question, I substituted my newly trained Redux model back into Flux to see if it would work. Here are some generations:

Top image: Image prompt. Middle row: PixArt image variations. Bottom row: Flux image variations.

I think it’s pretty clear that the PixArt redux model does not work with Flux. This isn’t that surprising since the Flux redux model didn’t work with PixArt. It looks like the Redux model ends up depending on the specifics of how the diffusion model interacts with the T5 latent space, rather than resulting in a Redux model which outputs standard T5 tokens.

Conclusions/Cost breakdown

All models were trained using GPUs from Lambda Labs:

Flux (Fine tune to use CLIP embeddings): 8xA100 (80GB), 357 hr, $5129

Pixart (Fine tune to use Redux embeddings): 1xH100, 290 hr, $722
Pixart (Train new Redux adapter network from scratch): 1xH100, 345 hr, $861

As you can see, fine tuning Flux was a lot more expensive than fine tuning Pixart.

You can find the fine-tuned models and inference scripts here on huggingface: https://huggingface.co/ivand-all3d/image-variation-experiments.

Appendix

Training Run Info

Flux training runs:

Flux-1:

– Dataset: PD12M

– Trained: 3722 steps

– LR: Constant warmup to 1e-5

– Batch size: 24 (500 steps) -> 32 (1500 steps) -> 48 (1722 steps)

– Finetune layers: all

– Additional info:

– Added all prompt dropout of 0.1 to prevent forgetting unconditional denoising.

– Warmed up mix between text pooled and clip projected image embeddings, 500 steps, from 50/50 to 0/100.

– Similarly, warmed up prompt dropout from 0.2 to 0.5 for the same steps.

– Results:

– Inconclusive, different image prompts seemed to map to different concepts in an unclear way.

Flux-2:

– Continued fine tuning flux-1

– Trained: 1300 steps

– LR: Constant warmup to 1e-4 (1000 steps)

– Batch size: 48

– Additional info:

– ~6 epochs of 10000 images

– Increased prompt dropout to 0.75

– Instead of using an empty string for CLIP for all prompt dropout, switched to using black image. (Still using empty string for T5).

– Results:

– Both flowers, trees and beetle image prompts seemed to elicit similar-ish images. Painting image prompt seems to elicit paintings.

Flux-3:

– Continued fine tuning flux-2

– Trained: 3200 steps

– Batch size: 64

– Additional info:

– Epochs of 100000 images

– Increased prompt dropout to 0.8

– Results:

– Even more similar-ish results, seems better than output-model-4 despite unclear improvements in loss. Prompting with a painting gives paintings, photo gives photos, etc.. Prompting with cat doesn’t work.

Flux-4:

– Continued fine tuning flux-3

– Trained: 1600 steps

– Batch size: 96

– Additional info:

– Prompted with different section of PD12M, [200000:300000]

– Results:

– Pretty good this time. Guidance scale of 2-3 works best with image conditioning, guidance scale of 7-8 with only text.

– Some combination of color scheme information, layout information, and subject is conveyed through the image prompt. For example:

– Stock photography of a chair on white background results in random stock photography of various objects.

– Photo of geese in water results in birds in nature.

– Photo of woman with colorful jewelry results in colorful flowers.

– Oil painting portrait results in more oil paintings.

Flux-5:

– Continued fine tuning flux-4

– Trained: 41k steps

– Batch size: 256

– Used deepspeed, 8 A100s, 8 (gpus) x 32 (batch size) x 1 (gradient accumulation)

– Additional info:

– Used SA1B dataset, first 2 million images

– Results:

– SA1B blurred faces so many faces in photos/paintings are blurred.

– Definitely captures the composition + colors of the photo. Sometimes it is very very close (geese) other times it is more ‘along the same lines’ (bosco).

Pixart training runs:

Pixart (fine tuning to use pre-existing Flux redux network):

– Dataset: SA1B

– Trained: 186k steps

– LR: 1e-4

– Batch size: 192

– 1 (gpus) x 32 (batch size) x 6 (gradient accumulation)

Pixart (training Redux-style network from scratch):

– Dataset: SA1B

– Trained: 91k steps

– LR: 2e-5 (35k steps) -> 1e-5

– Batch size: 160

– 1 (gpus) x 32 (batch size) x 5 (gradient accumulation)