[COMPSCI 180] Diffusion Models!
Jerry Xiao Two

Part A: The Power of Diffusion Models!

Part 0: Setup

Results Comparison: Different Resolutions and Inference Steps

The following table shows the generated images for 5 different prompts, comparing low resolution (after stage 1 of the network) vs high resolution (after stage 2 of the network), and 20 inference steps vs 50 inference steps.

PromptLow Resolution
20 Steps
Low Resolution
50 Steps
High Resolution
20 Steps
High Resolution
50 Steps
1. An oil painting of a snowy mountain villageLow Res 20 Steps - Snowy Mountain VillageLow Res 50 Steps - Snowy Mountain VillageHigh Res 20 Steps - Snowy Mountain VillageHigh Res 50 Steps - Snowy Mountain Village
2. An oil painting of an old manLow Res 20 Steps - Old ManLow Res 50 Steps - Old ManHigh Res 20 Steps - Old ManHigh Res 50 Steps - Old Man
3. An oil painting of a young ladyLow Res 20 Steps - Young LadyLow Res 50 Steps - Young LadyHigh Res 20 Steps - Young LadyHigh Res 50 Steps - Young Lady
4. A lithograph of waterfallsLow Res 20 Steps - WaterfallsLow Res 50 Steps - WaterfallsHigh Res 20 Steps - WaterfallsHigh Res 50 Steps - Waterfalls
5. A lithograph of a skullLow Res 20 Steps - SkullLow Res 50 Steps - SkullHigh Res 20 Steps - SkullHigh Res 50 Steps - Skull

Analysis

This comparison demonstrates the effects of:

  • Resolution: High resolution images provide more detail and clarity compared to low resolution versions
  • Inference Steps: More inference steps (50 vs 20) generally result in more refined and detailed outputs, though the improvement may vary depending on the prompt (For example, for the two scenary views, the results in more inference steps do not look much better than the results in less inference steps, they seem both unrealistic.)

Part 1: Sampling Loops

Part 1.1 Implementing the Forward Process

The implementation of the forward function is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def forward(im, t):
"""
Args:
im : torch tensor of size (1, 3, 64, 64) representing the clean image
t : integer timestep

Returns:
im_noisy : torch tensor of size (1, 3, 64, 64) representing the noisy image at timestep t
"""
with torch.no_grad():
noise = torch.randn_like(im)
alphas_t = alphas_cumprod[t]
im_noisy = im * torch.sqrt(alphas_t) + torch.sqrt(1 - alphas_t) * noise
return im_noisy
Campanile at noise level 250

Noise Level: 250

Campanile at noise level 500

Noise Level: 500

Campanile at noise level 750

Noise Level: 750

Original Campanile

Original

Part 1.2 Classical Denoising

This section demonstrates the effect of traditional Gaussian blur denoising on noisy images. The images below show a comparison between the noisy images (before denoising) and the results after applying Gaussian blur denoising.

Noise LevelBefore Denoising
(Noisy Image)
After Gaussian Blur
(Denoised Image)
Noise Level: 250Noisy image at noise level 250Denoised image at noise level 250
Noise Level: 500Noisy image at noise level 500Denoised image at noise level 500
Noise Level: 750Noisy image at noise level 750Denoised image at noise level 750
OriginalOriginal imageOriginal image after Gaussian blur

The comparison above demonstrates the effect of Gaussian blur denoising:

  • Noise Reduction: Gaussian blur effectively reduces high-frequency noise, making the images appear smoother. While noise is reduced, Gaussian blur also tends to blur fine details and edges, resulting in a loss of sharpness
  • Limitations: Traditional Gaussian blur is a simple denoising method that doesn’t preserve image structure as well as more advanced denoising techniques like diffusion models

Part 1.3 Implementing One Step Denoising

The one step denoising is mainly by the following steps:

  1. Use forward to get the noisy image at timestep t
  2. Estimate the noise at timestep t
  3. Remove the noise to get an estimate of the original image
Noise LevelOriginal ImageBefore Denoising
(Noisy Image)
After One Step Denoising
(Denoised Image)
Noise Level: 250Original imageNoisy image at noise level 250Denoised image at noise level 250
Noise Level: 500Original imageNoisy image at noise level 500Denoised image at noise level 500
Noise Level: 750Original imageNoisy image at noise level 750Denoised image at noise level 750

Part 1.4 Implementing Iterative Denoising

The iterative denoising loop repeatedly applies our learned denoiser while walking backward through the noise schedule. Each pass removes the predicted noise for the current timestep and re-injects the correct level of randomness, giving the next, slightly cleaner image. Repeating this across many steps (rather than performing a single giant leap) preserves structure while gradually restoring finer details. The implementation of the iterative_denoising function is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def iterative_denoise(im_noisy, i_start, prompt_embeds, timesteps, display=True):
image = im_noisy
images_to_display = []

with torch.no_grad():
for i in range(i_start, len(timesteps) - 1):
# Get timesteps
t = timesteps[i]
prev_t = timesteps[i+1]

# get `alpha_cumprod` and `alpha_cumprod_prev` for timestep t from `alphas_cumprod`
# compute `alpha`
# compute `beta`
# ===== your code here! =====
alpha_cumprod = alphas_cumprod[t]
alpha_cumprod_prev = alphas_cumprod[prev_t]
alpha_t_step = alpha_cumprod / alpha_cumprod_prev
beta_t_step = 1 - alpha_t_step
# ==== end of code ====

# Get noise estimate
model_output = stage_1.unet(
image.half().cuda(),
t,
encoder_hidden_states=prompt_embeds,
return_dict=False
)[0]

# Split estimate into noise and variance estimate
noise_est, predicted_variance = torch.split(model_output, image.shape[1], dim=1)

# compute `pred_prev_image` (x_{t'}), the DDPM estimate for the image at the
# next timestep, which is slightly less noisy. Use the equation 3.
# This is the core of DDPM
# ===== your code here! =====
x0_pred = (image - torch.sqrt(1 - alpha_cumprod) * noise_est) / torch.sqrt(alpha_cumprod)

# Apply equation 3
pred_prev_image = (
(torch.sqrt(alpha_cumprod_prev) * beta_t_step / (1 - alpha_cumprod)) * x0_pred +
(torch.sqrt(alpha_t_step) * (1 - alpha_cumprod_prev) / (1 - alpha_cumprod)) * image
)
pred_prev_image = add_variance(predicted_variance, t, pred_prev_image)

if t in [90, 240, 390, 540, 690]:
iterative_images[t] = (pred_prev_image.squeeze(0) * 0.5 + 0.5).clamp(0, 1).cpu().numpy().transpose(1, 2, 0)

# ==== end of code ====

image = pred_prev_image

clean = image.cpu().detach().numpy()
return clean
OriginalGaussian BlurOne-Step DenoisingIterative Denoising
Original imageGaussian blur denoised imageOne-step denoised imageIteratively denoised image
Iterative denoising step

Noise Level: 690

The slider highlights how the sample becomes progressively clearer as we move from heavy noise (690) toward the final reconstruction. The gradual refinement with multiple steps avoids the over-smoothing artifacts seen in the Gaussian blur and produces noticeably sharper edges than the single-step approach.

Part 1.5 Diffusion Model Sampling

The full sampling loop produces high-quality generations from pure noise. Below are five samples (using the same prompt embeddings from ‘a high quality photo’) captured at the final timestep of the iterative procedure:

Sample 0

Sample 1

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Part 1.6 Classifier-Free Guidance (CFG)

I implement the iterative_denoise_cfg function to add classifier-free guidance to the iterative denoising process. The implementation is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def iterative_denoise_cfg(im_noisy, i_start, prompt_embeds, uncond_prompt_embeds, timesteps, scale=7):
image = im_noisy
images_to_display = []

with torch.no_grad():
for i in range(i_start, len(timesteps) - 1):
# Get timesteps
t = timesteps[i]
prev_t = timesteps[i+1]

# Get `alpha_cumprod`, `alpha_cumprod_prev`, `alpha`, `beta`
# ===== your code here! =====
alpha_cumprod = alphas_cumprod[t]
alpha_cumprod_prev = alphas_cumprod[prev_t]
alpha_t_step = alpha_cumprod / alpha_cumprod_prev # This is alpha_t in equation 3
beta_t_step = 1 - alpha_t_step # This is beta_t in equation 3
# ==== end of code ====

# Get cond noise estimate
model_output = stage_1.unet(
image,
t,
encoder_hidden_states=prompt_embeds,
return_dict=False
)[0]

# Get uncond noise estimate
uncond_model_output = stage_1.unet(
image,
t,
encoder_hidden_states=uncond_prompt_embeds,
return_dict=False
)[0]

# Split estimate into noise and variance estimate
noise_est, predicted_variance = torch.split(model_output, image.shape[1], dim=1)
uncond_noise_est, _ = torch.split(uncond_model_output, image.shape[1], dim=1)

# Compute the CFG noise estimate based on equation 4
# ===== your code here! =====
noise_est_cfg = uncond_noise_est + scale * (noise_est - uncond_noise_est)
# ==== end of code ====


# Get `pred_prev_image`, the next less noisy image.
# Predict x0 using the CFG noise estimate
# ===== your code here! =====
x0_pred = (image - torch.sqrt(1 - alpha_cumprod) * noise_est_cfg) / torch.sqrt(alpha_cumprod)

# Apply equation 3
pred_prev_image = (
(torch.sqrt(alpha_cumprod_prev) * beta_t_step / (1 - alpha_cumprod)) * x0_pred +
(torch.sqrt(alpha_t_step) * (1 - alpha_cumprod_prev) / (1 - alpha_cumprod)) * image
)
pred_prev_image = add_variance(predicted_variance, t, pred_prev_image)
# ==== end of code ====

image = pred_prev_image

clean = image.cpu().detach().numpy()
return clean

The following images show the results of the iterative denoising with CFG:

Sample 0

Sample 1

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Part 1.7 Image-to-image Translation

We now explore how strongly the diffusion process can pull a slightly corrupted real image back to the learned image manifold when we start the reverse process at different points in the noise schedule. I add a small amount of noise to the Campanile photo, then jump into the sampler at indices [1, 3, 5, 7, 10, 20] (the indices correspond to positions within the noise schedule; smaller values mean we restart from a noisier state and therefore denoise over a longer trajectory).

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Start @ 1

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Start @ 3

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Start @ 5

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Start @ 7

Iterative denoising start index 10 Iterative denoising start index 10 Iterative denoising start index 10

Start @ 10

Iterative denoising start index 20 Iterative denoising start index 20 Iterative denoising start index 20

Start @ 20

Original Campanile image Original Campanile image Original Campanile image

Original

Part 1.7.1 Editing Hand-Drawn and Web Images

In this section, we will use pictures from the web and hand-drawn images to test the editing ability of the diffusion model.

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Start @ 1

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Start @ 3

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Start @ 5

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Start @ 7

Iterative denoising start index 10 Iterative denoising start index 10 Iterative denoising start index 10

Start @ 10

Iterative denoising start index 20 Iterative denoising start index 20 Iterative denoising start index 20

Start @ 20

Original Campanile image Original Campanile image Original Campanile image

Original

Part 1.7.2 Inpainting

In this section, we will use the inpainting method to fill in the missing parts of the images. The technique is to only allow denoising in the masked region and keep the rest of the image unchanged.

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Original Image

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Mask Image

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Replace Area

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Result Image

Part 1.7.3 Text-Conditioned Image-to-image Translation

In this section, we change the prompt ‘A high quality photo’ to my own prompt and see the translation results. The technique is to use pictures with similar structure, or the translation process will not be smooth.

Iterative denoising start index 1 Iterative denoising start index 1 Iterative denoising start index 1

Start @ 1

Iterative denoising start index 3 Iterative denoising start index 3 Iterative denoising start index 3

Start @ 3

Iterative denoising start index 5 Iterative denoising start index 5 Iterative denoising start index 5

Start @ 5

Iterative denoising start index 7 Iterative denoising start index 7 Iterative denoising start index 7

Start @ 7

Iterative denoising start index 10 Iterative denoising start index 10 Iterative denoising start index 10

Start @ 10

Iterative denoising start index 20 Iterative denoising start index 20 Iterative denoising start index 20

Start @ 20

Original Campanile image Original Campanile image Original Campanile image

Original

Part 1.8 Visual Anagrams

In this section, we will use the visual anagrams method to create a new image from the original images. The visual anagrams are created using the following steps:

  1. ϵ1=CFG of UNet(xt,t,p1)\epsilon_1 = \text{CFG of UNet}(x_t, t, p_1)

  2. ϵ2=flip(CFG of UNet(flip(xt),t,p2))\epsilon_2 = \text{flip}(\text{CFG of UNet}(\text{flip}(x_t), t, p_2))

  3. ϵ=(ϵ1+ϵ2)/2\epsilon = (\epsilon_1 + \epsilon_2) / 2

where UNet is the diffusion model UNet from before, flip()\text{flip}(\cdot) is a function that flips the image, and p1p_1 and p2p_2 are two different text prompt embeddings.

Old Man
Woman

An oil painting of an old man

An oil painting of an young lady

Click to flip

Snowy Village
Campfire

An oil painting of a snowy mountain village

An oil painting of people around a campfire

Click to flip

Part 1.9 Hybrid Images

For the Hybrid Images, we are taking the low-pass noise of the first image and the high-pass noise of the second image and then add them together to get the hybrid image. The algorithm is as follows:

  1. ϵ1=CFG of UNet(xt,t,p1)\epsilon_1 = \text{CFG of UNet}(x_t, t, p_1)

  2. ϵ2=CFG of UNet(xt,t,p2)\epsilon_2 = \text{CFG of UNet}(x_t, t, p_2)

  3. ϵ=flowpass(ϵ1)+fhighpass(ϵ2)\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)

where UNet is the diffusion model UNet, flowpassf_\text{lowpass} is a low pass function, fhighpassf_\text{highpass} is a high pass function, and p1p_1 and p2p_2 are two different text prompt embeddings. Our final noise estimate is ϵ\epsilon.

The following are the results of the hybrid images:

Hybrid Image 1 - Low Pass Hybrid Image 1 - High Pass

A lithograph of a skull

A lithograph of waterfalls

High Pass (Normal Picture)

Hybrid Image 2 - Low Pass Hybrid Image 2 - High Pass

A painting of a red panda

A painting of houseplant

High Pass (Normal Picture)

Powered by Hexo & Theme Keep
This site is deployed on