# Relation of mutual information to divergence

# Mutual information and divergences

Mutual information (MI) and Kullback-Leibler divergence (KL) are closely related because mutual information can be expressed as the KL divergence between the joint distribution and the product of the marginal distributions.

**Definition:** Let two random variables X and Y, with joint distribution \(p(x,y)\) and the product of the marginals is \(p(x)p(y)\). The mutual information \(MI(X;Y)\) between \(X\) and \(Y\) is defined as

The definition above measures how different the joint distribution is from what it would be if \(X\) and \(Y\) were independent.

## KL Divergence Definition

KL divergence between two distributions \(P\) and \(Q\) is given by

\[D_{\text{KL}}(P\parallel Q) = \int_{\mathcal{X}} p(x) \log \left( \frac{p(x)}{q(x)} \right) dx\]In the case of mutual information, \(q(x, y) = p(x)p(y)\), the product of the marginal distributions. In this way,

\[I(X; Y) = \int p(x, y) \log \left( \frac{p(x, y)}{p(x)p(y)} \right) dx dy\]\(p(x, y)\): The

joint distributionof \(X\) and \(Y\), describing their dependence.

\(p(x)p(y)\): The

product of marginals, describing what their distribution would look like if \(X\) and \(Y\) were independent.

The KL divergence \(D_{\text{KL}}(p(x, y) \parallel p(x)p(y))\) quantifies how much the joint distribution \(p(x, y)\) differs from the case where \(X\) and \(Y\) are independent (i.e., \(p(x)p(y)\)). For dependent variables, \(p(x, y) \neq p(x)p(y)\), so the KL divergence is non-zero, and mutual information is positive, indicating that there is shared information (i.e., dependence) between the variables.

## Why mutual information estimation uses divergences

In the context of MINE (or any mutual information estimation), the relationship between mutual information and KL divergence allows us to estimate how much knowing one variable \(X\) reduces uncertainty about the other \(Y\). The Donsker-Varadhan representation used in MINE focuses on maximizing the difference between the two expectations (one for the joint distribution and one for the product of the marginals), which are tied to the KL divergence.

Mutual informationmeasures how much two variables share information. It does this by using theKL divergenceto compare the actual joint distribution \(p(x, y)\) to the case where $X$ and $Y$ are independent (i.e., \(p(x)p(y)\));

KL divergenceprovides the core mathematical framework to quantify how far the joint distribution is from independence, which is the basis for mutual information estimation.

## Enjoy Reading This Article?

Here are some more articles you might like to read next: