[COMPSCI 180] Diffusion Models!
Jerry Xiao Two

Part A: The Power of Diffusion Models!

Part 0: Setup

Results Comparison: Different Resolutions and Inference Steps

The following table shows the generated images for 5 different prompts, comparing low resolution (after stage 1 of the network) vs high resolution (after stage 2 of the network), and 20 inference steps vs 50 inference steps.

PromptLow Resolution
20 Steps
Low Resolution
50 Steps
High Resolution
20 Steps
High Resolution
50 Steps
1. An oil painting of a snowy mountain villageLow Res 20 Steps - Snowy Mountain VillageLow Res 50 Steps - Snowy Mountain VillageHigh Res 20 Steps - Snowy Mountain VillageHigh Res 50 Steps - Snowy Mountain Village
2. An oil painting of an old manLow Res 20 Steps - Old ManLow Res 50 Steps - Old ManHigh Res 20 Steps - Old ManHigh Res 50 Steps - Old Man
3. An oil painting of a young ladyLow Res 20 Steps - Young LadyLow Res 50 Steps - Young LadyHigh Res 20 Steps - Young LadyHigh Res 50 Steps - Young Lady
4. A lithograph of waterfallsLow Res 20 Steps - WaterfallsLow Res 50 Steps - WaterfallsHigh Res 20 Steps - WaterfallsHigh Res 50 Steps - Waterfalls
5. A lithograph of a skullLow Res 20 Steps - SkullLow Res 50 Steps - SkullHigh Res 20 Steps - SkullHigh Res 50 Steps - Skull

Analysis

This comparison demonstrates the effects of:

  • Resolution: High resolution images provide more detail and clarity compared to low resolution versions
  • Inference Steps: More inference steps (50 vs 20) generally result in more refined and detailed outputs, though the improvement may vary depending on the prompt (For example, for the two scenary views, the results in more inference steps do not look much better than the results in less inference steps, they seem both unrealistic.)

Part 1: Sampling Loops

Part 1.1 Implementing the Forward Process

The implementation of the forward function is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def forward(im, t):
"""
Args:
im : torch tensor of size (1, 3, 64, 64) representing the clean image
t : integer timestep

Returns:
im_noisy : torch tensor of size (1, 3, 64, 64) representing the noisy image at timestep t
"""
with torch.no_grad():
noise = torch.randn_like(im)
alphas_t = alphas_cumprod[t]
im_noisy = im * torch.sqrt(alphas_t) + torch.sqrt(1 - alphas_t) * noise
return im_noisy
Campanile at noise level 250

Noise Level: 250

Campanile at noise level 500

Noise Level: 500

Campanile at noise level 750

Noise Level: 750

Original Campanile

Original

Part 1.2 Classical Denoising

This section demonstrates the effect of traditional Gaussian blur denoising on noisy images. The images below show a comparison between the noisy images (before denoising) and the results after applying Gaussian blur denoising.

Noise LevelBefore Denoising
(Noisy Image)
After Gaussian Blur
(Denoised Image)
Noise Level: 250Noisy image at noise level 250Denoised image at noise level 250
Noise Level: 500Noisy image at noise level 500Denoised image at noise level 500
Noise Level: 750Noisy image at noise level 750Denoised image at noise level 750
OriginalOriginal imageOriginal image after Gaussian blur

The comparison above demonstrates the effect of Gaussian blur denoising:

  • Noise Reduction: Gaussian blur effectively reduces high-frequency noise, making the images appear smoother. While noise is reduced, Gaussian blur also tends to blur fine details and edges, resulting in a loss of sharpness
  • Limitations: Traditional Gaussian blur is a simple denoising method that doesn’t preserve image structure as well as more advanced denoising techniques like diffusion models

Part 1.3 Implementing One Step Denoising

The one step denoising is mainly by the following steps:

  1. Use forward to get the noisy image at timestep t
  2. Estimate the noise at timestep t
  3. Remove the noise to get an estimate of the original image
Noise LevelOriginal ImageBefore Denoising
(Noisy Image)
After One Step Denoising
(Denoised Image)
Noise Level: 250Original imageNoisy image at noise level 250Denoised image at noise level 250
Noise Level: 500Original imageNoisy image at noise level 500Denoised image at noise level 500
Noise Level: 750Original imageNoisy image at noise level 750Denoised image at noise level 750

Part 1.4 Implementing Iterative Denoising

The iterative denoising loop repeatedly applies our learned denoiser while walking backward through the noise schedule. Each pass removes the predicted noise for the current timestep and re-injects the correct level of randomness, giving the next, slightly cleaner image. Repeating this across many steps (rather than performing a single giant leap) preserves structure while gradually restoring finer details. The implementation of the iterative_denoising function is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def iterative_denoise(im_noisy, i_start, prompt_embeds, timesteps, display=True):
image = im_noisy
images_to_display = []

with torch.no_grad():
for i in range(i_start, len(timesteps) - 1):
# Get timesteps
t = timesteps[i]
prev_t = timesteps[i+1]

# get `alpha_cumprod` and `alpha_cumprod_prev` for timestep t from `alphas_cumprod`
# compute `alpha`
# compute `beta`
# ===== your code here! =====
alpha_cumprod = alphas_cumprod[t]
alpha_cumprod_prev = alphas_cumprod[prev_t]
alpha_t_step = alpha_cumprod / alpha_cumprod_prev
beta_t_step = 1 - alpha_t_step
# ==== end of code ====

# Get noise estimate
model_output = stage_1.unet(
image.half().cuda(),
t,
encoder_hidden_states=prompt_embeds,
return_dict=False
)[0]

# Split estimate into noise and variance estimate
noise_est, predicted_variance = torch.split(model_output, image.shape[1], dim=1)

# compute `pred_prev_image` (x_{t'}), the DDPM estimate for the image at the
# next timestep, which is slightly less noisy. Use the equation 3.
# This is the core of DDPM
# ===== your code here! =====
x0_pred = (image - torch.sqrt(1 - alpha_cumprod) * noise_est) / torch.sqrt(alpha_cumprod)

# Apply equation 3
pred_prev_image = (
(torch.sqrt(alpha_cumprod_prev) * beta_t_step / (1 - alpha_cumprod)) * x0_pred +
(torch.sqrt(alpha_t_step) * (1 - alpha_cumprod_prev) / (1 - alpha_cumprod)) * image
)
pred_prev_image = add_variance(predicted_variance, t, pred_prev_image)

if t in [90, 240, 390, 540, 690]:
iterative_images[t] = (pred_prev_image.squeeze(0) * 0.5 + 0.5).clamp(0, 1).cpu().numpy().transpose(1, 2, 0)

# ==== end of code ====

image = pred_prev_image

clean = image.cpu().detach().numpy()
return clean
OriginalGaussian BlurOne-Step DenoisingIterative Denoising
Original imageGaussian blur denoised imageOne-step denoised imageIteratively denoised image
Iterative denoising step

Noise Level: 690

The slider highlights how the sample becomes progressively clearer as we move from heavy noise (690) toward the final reconstruction. The gradual refinement with multiple steps avoids the over-smoothing artifacts seen in the Gaussian blur and produces noticeably sharper edges than the single-step approach.

Part 1.5 Diffusion Model Sampling

The full sampling loop produces high-quality generations from pure noise. Below are five samples (using the same prompt embeddings from ‘a high quality photo’) captured at the final timestep of the iterative procedure:

Sample 0

Sample 1

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Part 1.6 Classifier-Free Guidance (CFG)

I implement the iterative_denoise_cfg function to add classifier-free guidance to the iterative denoising process. The implementation is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def iterative_denoise_cfg(im_noisy, i_start, prompt_embeds, uncond_prompt_embeds, timesteps, scale=7):
image = im_noisy
images_to_display = []

with torch.no_grad():
for i in range(i_start, len(timesteps) - 1):
# Get timesteps
t = timesteps[i]
prev_t = timesteps[i+1]

# Get `alpha_cumprod`, `alpha_cumprod_prev`, `alpha`, `beta`
# ===== your code here! =====
alpha_cumprod = alphas_cumprod[t]
alpha_cumprod_prev = alphas_cumprod[prev_t]
alpha_t_step = alpha_cumprod / alpha_cumprod_prev # This is alpha_t in equation 3
beta_t_step = 1 - alpha_t_step # This is beta_t in equation 3
# ==== end of code ====

# Get cond noise estimate
model_output = stage_1.unet(
image,
t,
encoder_hidden_states=prompt_embeds,
return_dict=False
)[0]

# Get uncond noise estimate
uncond_model_output = stage_1.unet(
image,
t,
encoder_hidden_states=uncond_prompt_embeds,
return_dict=False
)[0]

# Split estimate into noise and variance estimate
noise_est, predicted_variance = torch.split(model_output, image.shape[1], dim=1)
uncond_noise_est, _ = torch.split(uncond_model_output, image.shape[1], dim=1)

# Compute the CFG noise estimate based on equation 4
# ===== your code here! =====
noise_est_cfg = uncond_noise_est + scale * (noise_est - uncond_noise_est)
# ==== end of code ====


# Get `pred_prev_image`, the next less noisy image.
# Predict x0 using the CFG noise estimate
# ===== your code here! =====
x0_pred = (image - torch.sqrt(1 - alpha_cumprod) * noise_est_cfg) / torch.sqrt(alpha_cumprod)

# Apply equation 3
pred_prev_image = (
(torch.sqrt(alpha_cumprod_prev) * beta_t_step / (1 - alpha_cumprod)) * x0_pred +
(torch.sqrt(alpha_t_step) * (1 - alpha_cumprod_prev) / (1 - alpha_cumprod)) * image
)
pred_prev_image = add_variance(predicted_variance, t, pred_prev_image)
# ==== end of code ====

image = pred_prev_image

clean = image.cpu().detach().numpy()
return clean

The following images show the results of the iterative denoising with CFG:

Sample 0

Sample 1

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Part 1.7 Image-to-image Translation

We now explore how strongly the diffusion process can pull a slightly corrupted real image back to the learned image manifold when we start the reverse process at different points in the noise schedule. I add a small amount of noise to the Campanile photo, then jump into the sampler at indices [1, 3, 5, 7, 10, 20] (the indices correspond to positions within the noise schedule; smaller values mean we restart from a noisier state and therefore denoise over a longer trajectory).

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Start @ 1

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Start @ 3

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Start @ 5

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Start @ 7

Iterative denoising start index 10 Iterative denoising start index 10 Iterative denoising start index 10

Start @ 10

Iterative denoising start index 20 Iterative denoising start index 20 Iterative denoising start index 20

Start @ 20

Original Campanile image Original Campanile image Original Campanile image

Original

Part 1.7.1 Editing Hand-Drawn and Web Images

In this section, we will use pictures from the web and hand-drawn images to test the editing ability of the diffusion model.

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Start @ 1

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Start @ 3

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Start @ 5

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Start @ 7

Iterative denoising start index 10 Iterative denoising start index 10 Iterative denoising start index 10

Start @ 10

Iterative denoising start index 20 Iterative denoising start index 20 Iterative denoising start index 20

Start @ 20

Original Campanile image Original Campanile image Original Campanile image

Original

Part 1.7.2 Inpainting

In this section, we will use the inpainting method to fill in the missing parts of the images. The technique is to only allow denoising in the masked region and keep the rest of the image unchanged.

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Original Image

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Mask Image

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Replace Area

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Result Image

Part 1.7.3 Text-Conditioned Image-to-image Translation

In this section, we change the prompt ‘A high quality photo’ to my own prompt and see the translation results. The technique is to use pictures with similar structure, or the translation process will not be smooth.

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Start @ 1

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Start @ 3

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Start @ 5

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Start @ 7

Iterative denoising start index 10 Iterative denoising start index 10 Iterative denoising start index 10

Start @ 10

Iterative denoising start index 20 Iterative denoising start index 20 Iterative denoising start index 20

Start @ 20

Original Campanile image Original Campanile image Original Campanile image

Original

Part 1.8 Visual Anagrams

In this section, we will use the visual anagrams method to create a new image from the original images. The visual anagrams are created using the following steps:

  1. ϵ1=CFG of UNet(xt,t,p1)\epsilon_1 = \text{CFG of UNet}(x_t, t, p_1)

  2. ϵ2=flip(CFG of UNet(flip(xt),t,p2))\epsilon_2 = \text{flip}(\text{CFG of UNet}(\text{flip}(x_t), t, p_2))

  3. ϵ=(ϵ1+ϵ2)/2\epsilon = (\epsilon_1 + \epsilon_2) / 2

where UNet is the diffusion model UNet from before, flip()\text{flip}(\cdot) is a function that flips the image, and p1p_1 and p2p_2 are two different text prompt embeddings.

Old Man
Woman

An oil painting of an old man

An oil painting of an young lady

Click to flip

Snowy Village
Campfire

An oil painting of a snowy mountain village

An oil painting of people around a campfire

Click to flip

Part 1.9 Hybrid Images

For the Hybrid Images, we are taking the low-pass noise of the first image and the high-pass noise of the second image and then add them together to get the hybrid image. The algorithm is as follows:

  1. ϵ1=CFG of UNet(xt,t,p1)\epsilon_1 = \text{CFG of UNet}(x_t, t, p_1)

  2. ϵ2=CFG of UNet(xt,t,p2)\epsilon_2 = \text{CFG of UNet}(x_t, t, p_2)

  3. ϵ=flowpass(ϵ1)+fhighpass(ϵ2)\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)

where UNet is the diffusion model UNet, flowpassf_\text{lowpass} is a low pass function, fhighpassf_\text{highpass} is a high pass function, and p1p_1 and p2p_2 are two different text prompt embeddings. Our final noise estimate is ϵ\epsilon.

The following are the results of the hybrid images:

Hybrid Image 1 - Low Pass Hybrid Image 1 - High Pass

A lithograph of a skull

A lithograph of waterfalls

High Pass (Normal Picture)

Hybrid Image 2 - Low Pass Hybrid Image 2 - High Pass

A painting of a red panda

A painting of houseplant

High Pass (Normal Picture)

Part B: Flow Matching from Scratch

Part 1: Training a Single-Step Denoising UNet

Visualization of the Noising process

After implementing the UNet, we can visualize the noising process by selecting different noise levels σ\sigma. The following are the results of the noising process:

Noising process with different noise levels

Training Process Visualization

I have trained the UNet for 5 epochs and each epoch process 253 batches with the batch size of 64. I select some of the images in the batch and produce a visualization of the training process. The following are the results before the training (epoch 0) and after the training (epoch 5):

Before Training (Epoch 0)
Training process before training
After Training (Epoch 5)
Training process after training

Training Loss Visualization

The training loss is as follows:

Training loss

Out-of-Distribution Testing

I sample results on the test set with out-of-distribution noise levels after the model is trained. The following are the results:

Out-of-distribution testing

You can see that for the noise level that was lower than 0.5, the generated images are quite good. However, for the noise level that was higher than 0.5, the denoised images perform poorly. This is because the model is not adapted to the noise higher than 0.5.

Denoising Pure Noise Visualization

I used the trained model to denoise pure noise (input the pure noise and then use the images to calculated the mean as loss) and the visualization of the training process are as follows:

Before Training (Epoch 0)
Training process before training
After Training (Epoch 5)
Training process after training

You can see that the model can barely denoise the pure noise.

Denoising Pure Noise Training Loss Visualization

The training loss is as follows:

Training loss

The training loss is decreasing, however the results for different labels are the same. This is because when we train the model using pure noise as input, the model cannot distinguish the differences between the inputs of different labels. Thus, the results for all the predicted labels are very similar because they are trained on a dataset that does not really distinguish the input. So the results seem to be a combination of all the numbers, you can see a ‘3’ shape, ‘6’ shape and ‘8’ shape because there are overlapping curves for these numbers. The results can also be summarized as a centroid for all the shape of the numbers.

Part 2: Training a Flow Matching Model

Training Loss for Time-Contiditioned UNet

After implementing the UNet as told in the ipynb notebook and the tutorial, I was able to train the flow matching model with converging loss. The learning rate I use is 1e21e-2 and the γ\gamma for the scheduler is 0.990.99.

Training Loss for Time Conditioned UNet

Sampling results from the Time-Conditioned UNet

I sampled some results from the epoch 1, 5 and 10 and we can see that the resuls are quite good.

Before Training (Epoch 0)
Before epoch 1
After Epoch 1
After epoch 1
After Epoch 5
After epoch 5
After Epoch 10
After epoch 10

Training Loss for Class-Conditioned UNet

Using the same parameters in the Time-Conditioned UNet, we can get the following training curve.

Training Loss for Time Conditioned UNet

Sampling results from the Class-Conditioned UNet

The visulization results are as follows:

Before Training (Epoch 0)
Before epoch 1
After Epoch 1
After epoch 1
After Epoch 5
After epoch 5
After Epoch 10
After epoch 10

We can see that the Class-Conditioned UNet can help generate more accurate and clearer results than the time-conditioned network.

Train without scheduler

We observe that train without scheduler does not directly lead to a bad results. However, it does lead to slower convergence and therefore it is better to use scheduler.

Results with Time-Conditioned Network:

Before Training (Epoch 0)
Before epoch 1
After Epoch 1
After epoch 1
After Epoch 5
After epoch 5
After Epoch 10
After epoch 10

Results with Class-Conditioned Network:

Before Training (Epoch 0)
Before epoch 1
After Epoch 1
After epoch 1
After Epoch 5
After epoch 5
After Epoch 10
After epoch 10

We can see that the differences between using scheduler and not using shceduler are not huge. However, using scheduler does lead to faster convergence.

Powered by Hexo & Theme Keep
This site is deployed on