Machine Learning Review Notes

Pattern Recognition and Machine Learning (PRML) focuses on traditional machine learning algorithms and their probabilistic interpretations.

Clustering

K-means (Hard Clustering)

, where,

Probabilistic Clustering (Soft Clustering)

	GMM	Bernoulli








Expectation
Maximization

When ais degraded to a specific data point, thewill be degraded to 0 and singularity will occur. This will lead to the unbounded value of the likelihood function.

The connction between GMM and K-means

We can consider a GMM with. The distribution will be

We seeas a constant, then for a specific data point, the responsibility will be

If we consider, then the responsibility will be

which is the same as K-means.

The generalization of EM algorithm

Consider a probabilistic model where:

Observed variable:
Latent variable:
Model parameters:

The goal is to maximize the likelihood function:

Directly maximizingis more difficult than maximizing the joint distribution.

We can decompose the log-likelihood function as:

where:

is the distribution over the latent variables.
Lower bound:

KL divergence:

Since KL divergence satisfies:, it implies that:serves as a lower bound for.

Based on the above observations, the EM algorithm consists of two steps:

E-step:
- Keepfixed.
- Maximizeby minimizing.
- At maximum, which implies:.
M-step:
- Maximize the lower boundwith respect to.
- Since, any increase inensures thatalso increases.

Practical Reformulation

From the E-step, we use:

and derive:

Let:

The remaining term is a constant (information entropy), so the M-step effectively maximizes.

The EM algorithm alternates between estimating the latent variable distribution () and maximizing the lower bound with respect to parameters ().
The KL divergence guarantees that each iteration increases the likelihood.
The reformulated objective function simplifies computations by focusing on the expected complete-data log-likelihood ().

Hidden Markov Matrix

Why Do We Need Hidden Markov Models (HMMs)?

The complexity of Markov Chain Models is tightly linked to the dependency on prior data for conditional probabilities. This dependency leads to an exponentially growing model complexity as the order of dependencies increases.

To address this, Hidden Markov Models (HMMs) are introduced to construct arbitrary-order sequence models without being constrained by the Markov assumption, while requiring fewer parameters.

Transmission Probabilities:
- Initial state probabilities:
Emission Probabilities:
- Example with Gaussian emissions:
Joint Probability of Observations and Hidden States:
- Thus, the parameters controlling the model can be represented as:

Three Key Problems for HMMs

Learning (Parameter Estimation):
- Goal: What is the most likely HMM model () for a given observation sequence?
- Solution: Use the Expectation-Maximization (EM) strategy.
Evaluation (Likelihood Computation):
- Goal: Given an HMM model, what is the likelihood of the observation sequencegenerated by this model?
- Solution: Use the Forward-Backward Algorithm to efficiently compute probabilities.
Decoding (Hidden State Inference):
- Goal: Given an HMM model, what is the most likely sequence of latent statesfor an observation sequence?
- Solution: Use the Viterbi Algorithm to determine the most probable hidden state path.

Expectation-Maximization (EM) for HMMs

E-Step	Equations


M-Step	Equations



Gaussian

Discrete

Forward-Backward Algorithm

Forward Algorithm:
- Initialization:
- Recursion:
- Termination:
Backward Algorithm:
- Initialization:
- Recursion:
- Termination: