Relation of mutual information to divergence

Mutual information and divergences

Mutual information (MI) and Kullback-Leibler divergence (KL) are closely related because mutual information can be expressed as the KL divergence between the joint distribution and the product of the marginal distributions.

Definition: Let two random variables X and Y, with joint distribution \(p(x,y)\) and the product of the marginals is \(p(x)p(y)\). The mutual information \(MI(X;Y)\) between \(X\) and \(Y\) is defined as

\[I(X; Y) = D_{\text{KL}}(p(x, y) \parallel p(x)p(y))\]

The definition above measures how different the joint distribution is from what it would be if \(X\) and \(Y\) were independent.

KL Divergence Definition

KL divergence between two distributions \(P\) and \(Q\) is given by

\[D_{\text{KL}}(P\parallel Q) = \int_{\mathcal{X}} p(x) \log \left( \frac{p(x)}{q(x)} \right) dx\]

In the case of mutual information, \(q(x, y) = p(x)p(y)\), the product of the marginal distributions. In this way,

\[I(X; Y) = \int p(x, y) \log \left( \frac{p(x, y)}{p(x)p(y)} \right) dx dy\]

\(p(x, y)\): The joint distribution of \(X\) and \(Y\), describing their dependence.

\(p(x)p(y)\): The product of marginals, describing what their distribution would look like if \(X\) and \(Y\) were independent.

The KL divergence \(D_{\text{KL}}(p(x, y) \parallel p(x)p(y))\) quantifies how much the joint distribution \(p(x, y)\) differs from the case where \(X\) and \(Y\) are independent (i.e., \(p(x)p(y)\)). For dependent variables, \(p(x, y) \neq p(x)p(y)\), so the KL divergence is non-zero, and mutual information is positive, indicating that there is shared information (i.e., dependence) between the variables.

Why mutual information estimation uses divergences

In the context of MINE (or any mutual information estimation), the relationship between mutual information and KL divergence allows us to estimate how much knowing one variable \(X\) reduces uncertainty about the other \(Y\). The Donsker-Varadhan representation used in MINE focuses on maximizing the difference between the two expectations (one for the joint distribution and one for the product of the marginals), which are tied to the KL divergence.

Mutual information measures how much two variables share information. It does this by using the KL divergence to compare the actual joint distribution \(p(x, y)\) to the case where $X$ and $Y$ are independent (i.e., \(p(x)p(y)\));

KL divergence provides the core mathematical framework to quantify how far the joint distribution is from independence, which is the basis for mutual information estimation.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Many explanations are wrong, but some are useful
  • The Rise of Large Language Models: Galactica, ChatGPT, and Bard
  • Practical and Societal Dimensions of Explainable AI
  • SHAP Values: An Intersection Between Game Theory and Artificial Intelligence
  • Measuring and Mitigating Bias: Introducing Holistic AI’s Open-Source Library