
Image by Author | Ideogram
Generative AI models have emerged as a rising star in recent years, particularly with the introduction of large language model (LLM) products like ChatGPT. Using natural language that humans can understand, these models can process input and provide a suitable output. As a result of products like ChatGPT, other forms of generative AI have also become popular and mainstream.
Products such as DALL-E and Midjourney have become popular amid the generative AI boom due to their ability to generate images solely from natural language input. These popular products do not create images from nothing; instead, they rely on a model known as a diffusion model.
In this article, we will demystify the diffusion model to gain a deeper understanding of the technology behind it. We will discuss the fundamental concept, how the model works, and how it is trained.
Curious? Let’s get into it.
# Diffusion Model Fundamentals
 
Diffusion models are a class of AI algorithms that fall under the category of generative models, designed to generate new data based on training data. In the case of diffusion models, this means they can create new images from given inputs.
However, diffusion models generate images through a different process than usual, where the model adds and then removes noise from data. In simpler terms, the diffusion model alters an image and then refines it to create the final product. You can think of the model as a denoising model, as it learns to remove noise from images.
Formally, the diffusion model first emerged in the paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015). The paper introduces the concept of converting data into noise using a process called the controlled forward diffusion process and then training a model to reverse the process and reconstruct the data, which is the denoising process.
Building upon this foundation, the paper Denoising Diffusion Probabilistic Models by Ho et al. (2020) introduces the modern diffusion framework, which can produce high-quality images and outperform previous popular models, such as generative adversarial networks (GANs). In general, a diffusion model consists of two critical stages:
- Forward (diffusion) process: Data is corrupted by incrementally adding noise until it becomes indistinguishable from random static
- Reverse (denoising) process: A neural network is trained to iteratively remove noise, learning how to reconstruct image data from complete randomness
Let’s try to understand the diffusion model components better to have a clearer picture.
// Forward Process
The forward process is the first phase, where an image is systematically degraded by adding noise until it becomes random static.
The forward process is controlled and iterative, which we can summarize in the following steps:
- Start with an image from the dataset
- Add a small amount of noise to the image
- Repeat this process many times (potentially hundreds or thousands), each time further corrupting the image
After enough steps, the original image will appear as pure noise.
The process above is often modeled mathematically as a Markov chain, as each noisy version depends only on the one immediately preceding it, not on the entire sequence of steps.
But why should we gradually turn the image into noise instead of converting it straight into noise in one step? The goal is to enable the model to gradually learn how to reverse the corruption. Small, incremental steps allow the model to learn the transition from noisy to less-noisy data, which helps it reconstruct the image step-by-step from pure noise.
To determine how much noise is added at each step, the concept of a noise schedule is used. For example, linear schedules introduce noise steadily over time, whereas cosine schedules introduce noise more gradually and preserve useful image features for a more extended period.
That’s a quick summary of the forward process. Let’s learn about the reverse process.
// Reverse Process
The next stage after the forward process is to turn the model into a generator, which learns to turn the noise back into image data. Through iterative small steps, the model can generate image data that previously did not exist.
In general, the reverse process is the inverse of the forward process:
- Begin with pure noise — an entirely random image composed of Gaussian noise
- Iteratively remove noise by using a trained model that tries to approximate a reverse version of each forward step. In each step, the model uses the current noisy image and the corresponding timestep as input, predicting how to reduce the noise based on what it learned during training
- Step-by-step, the image becomes progressively clearer, resulting in the final image data
This reverse process requires a model trained to denoise noisy images. Diffusion models often employ a neural network architecture, such as a U-Net, which is an autoencoder that combines convolutional layers in an encoder–decoder structure. During training, the model learns to predict the noise components added during the forward process. At each step, the model also considers the timestep, allowing it to adjust its predictions according to the level of noise.
The model is typically trained using a loss function such as mean squared error (MSE), which measures the difference between the predicted and actual noise. By minimizing this loss across many examples, the model gradually becomes proficient at reversing the diffusion process.
Compared to alternatives like GANs, diffusion models offer more stability and a more straightforward generative path. The step-by-step denoising approach leads to more expressive learning, which makes training more reliable and interpretable.
Once the model is fully trained, generating a new image follows the reverse process we have summarized above.
// Text Conditioning
In many text-to-image products, such as DALL-E and Midjourney, these systems can guide the reverse process using text prompts, which we refer to as text conditioning. By integrating natural language, we can acquire a matching scene rather than random visuals.
The process works by utilizing a pre-trained text encoder, such as CLIP (Contrastive Language–Image Pre-training), which converts the text prompt into a vector embedding. This embedding is then fed into the diffusion model architecture through a mechanism such as cross-attention, a type of attention mechanism that enables the model to focus on specific parts of the text and align the image generation process with the text. At each step of the reverse process, the model examines the current image state and the text prompt, utilizing cross-attention to align the image with the semantics from the prompt.
This is the core mechanism that allows DALL-E and Midjourney to generate images from prompts.
# How Do DALL-E and Midjourney Differ?
 
Both products utilize diffusion models as their foundation but differ slightly in their technical applications.
For instance, DALL-E employs a diffusion model guided by CLIP-based embedding for text conditioning. In contrast, Midjourney features its proprietary diffusion model architecture, which reportedly includes a fine-tuned image decoder optimized for high realism.
Both models also rely on cross-attention, but their guidance styles differ. DALL-E emphasizes adhering to the prompt through classifier-free guidance, which balances between unconditioned and text-conditioned output. In contrast, Midjourney tends to prioritize stylistic interpretation, possibly employing a higher default guidance scale for classifier-free guidance.
DALL-E and Midjourney differ in their handling of prompt length and complexity, as the DALL-E model can manage longer prompts by processing them before they enter the diffusion pipeline, while Midjourney tends to perform better with concise prompts.
There are more differences, but these are the ones you should know that relate to the diffusion models.
# Conclusion
 
Diffusion models have become a foundation of modern text-to-image systems such as DALL-E and Midjourney. By utilizing the foundational processes of forward and reverse diffusion, these models can generate entirely new images from randomness. Additionally, these models can use natural language to guide the results through mechanisms such as text conditioning and cross-attention.
I hope this has helped!
 
 
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.
 
				 
								 
						
									