


The latent space is 48 times smaller so it reaps the benefit of crunching a lot fewer numbers. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space. Stable Diffusion is a latent diffusion model. Stable Diffusion is designed to solve the speed problem. They have used some tricks to make the model faster but still not enough. Think about it: a 512×512 image with three color channels (red, green, and blue) is a 786,432-dimensional space! (You need to specify that many values for ONE image.)ĭiffusion models like Google’s Imagen and Open AI’s DALL-E are in pixel space. You won’t be able to run on any single GPU, let alone the crappy GPU on your laptop. Now I need to tell you some bad news: What we just talked about is NOT how Stable Diffusion works! The reason is that the above diffusion process is in image space. You can read more about reverse diffusion sampling and samplers in this article. For now, image generation is unconditioned. We will address this when we talk about conditioning. You may notice we have no control over generating a cat or dog’s image. Reverse diffusion works by subtracting the predicted noise from the image successively. You will get an image of either a cat or a dog. We then subtract this estimated noise from the original image. We first generate a completely random image and ask the noise predictor to tell us the noise.

The noise predictor estimates the total noise added up to each step.Īfter training, we have a noise predictor capable of estimating the noise added to an image. Noise is sequentially added at each step. This is done by tuning its weights and showing it the correct answer. Teach the noise predictor to tell us how much noise was added.Corrupt the training image by adding this noisy image up to a certain number of steps.Pick a training image, like a photo of a cat.It is called the noise predictor in Stable Diffusion. The answer is teaching a neural network model to predict the noise added. To reverse the diffusion, we need to know how much noise is added to an image. But the million-dollar question is, “How can it be done?” The idea of reverse diffusion is undoubtedly clever and elegant. That’s why the result can either be a cat or a dog. The reverse diffusion drifts towards either cat or dog images but nothing in between. Technically, every diffusion process has two parts: (1) drift or directed motion and (2) random motion. Starting from a noisy, meaningless image, reverse diffusion recovers a cat OR a dog image. The reverse diffusion process recovers an image. We will see where the ink drop was initially added. What if we can reverse the diffusion? Like playing a video backward. You can no longer tell whether it initially fell at the center or near the rim.īelow is an example of an image undergoing forward diffusion. After a few minutes, It randomly distributes itself throughout the water. It’s like a drop of ink fell into a glass of water. Eventually, you won’t be able to tell whether they are initially a dog or a cat. The forward process will turn any cat or dog image into a noise image.

(Figure modified from this article)Ī forward diffusion process adds noise to a training image, gradually turning it into an uncharacteristic noise image. Forward diffusion Forward diffusion turns a photo into noise. In the figure below, the two peaks on the left represent the groups of cat and dog images. Let’s say I trained a diffusion model with only two kinds of images: cats and dogs. Why is it called the diffusion model? Because its math looks very much like diffusion in physics. In the case of Stable Diffusion, the data are images. They are generative models, meaning they are designed to generate new data similar to what they have seen in training. Stable Diffusion belongs to a class of deep learning models called diffusion models. Stable diffusion turns text prompts into images. It will return an image matching the text. In the simplest form, Stable Diffusion is a text-to-image model.
