Differential Entropy of a Multivariate Gaussian

This should be a short post (but I'm quite bad at estimating these things, so it could be huge, who knows?). I just wanted to show the derivation of the differential entropy of a Multivariate Gaussian distribution. The reason for this is two-fold: first, it comes in handy when I'm debugging density estimation models, because I can see what they give as their average log likelihood and compare it with it to see whether the model works, as explained in my previous post. The second reason, is that there are many (or at least one) linear algebra identities that are used to make the derivation easier and they come up all over the place in machine learning, so seeing an example of them in use is a great way to become more familiar with them.

Differential Entropy?

So, first off, we need to define what the differential entropy is. The entropy of a distribution \(p(x)\) over a random variable \(x\in \mathbb{X}\), where \(\mathbb{X}\) is a discrete space, is basically defined as $$ H(p) = \mathbb{E}_p\left[\log \frac{1}{p(x)}\right] = \sum_{x} p(x)\log\frac{1}{p(x)} $$ and it basically is the expected surprise that we get when we see elements that has been sampled from the distribution. The quantity \(-\log p(x)\) is defined as the information content of the value \(x\), and this makes sense. Values that have high probability are less informative than values that have low probability. For example, suppose I tell you that tomorrow I will wake up in the morning, you won't really be surprised because most people do that every day, so with high probability I will also do that. However, suppose I then tell you that after I wake up I will take a plane to Brazil, go to the Amazon rainforest, search for a tribe that has never seen an outsider before and live with them for a year... well, that's not that likely, so you would be really surprised.
If we use logarithms with base two and we assume that each value that \(x\) takes is from an alphabet of some size i.e. each value is a symbol, then the entropy corresponds to the smallest averge number of bits per symbol that are required to represent a string of symbols from the alphabet. Anyway, there are lots of great resources online and in books that you can read, so I won't go over the details of that here.
So what is the differential entropy then? Well, if you have noticed, above the entropy is basically a summation and each value that \(x\) takes is a discrete value. In the differential entropy case, \(x\) can take on continuous values and therefore we get an integral i.e. $$ H(p) = \mathbb{E}_p\left[\log \frac{1}{p(x)}\right] = \int p(x)\log\frac{1}{p(x)} dx $$ The differential entropy cannot be interpreted as the number of bits per symbol required to represent some string of symbols. The reason for this is because for any continuous random variable the number of bits required to represent any value is infinite. The reason for this is because the real line (or any continuous space) has an uncountably infinite number of values between any two numbers, and therefore, if there are an uncountably infinite number of symbols, there is no mapping from the natural numbers to them (the definition of uncountably infinite, if I am not mistaken), therfore there is no mapping from any finite number of bits to them.
Anyway, that was a bit too theoretical and out of my area of expertise as well. The basic thing to remember is that, the differential entropy is a measure of surprise as the entropy is. The higher the entropy of a distribution, the more surprised we will be on average to see values that are sampled from it.

Differential Entropy of a Gaussian (kind of the point of this post...)

So, now that some introductions have been made, let's derive the differential entropy of a Gaussian. $$ \begin{align} H(p) &= \int p(x) \log \frac{1}{p(x)} dx \\ &= \int p(x) \left[\frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]dx \\ &= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2} \int p(x) (x-\mu)^T\Sigma^{-1}(x-\mu)dx \\ &\stackrel{(a)}{=} \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2} \mathbb{E}_p\left[ \text{Tr}\left[(x-\mu)^T\Sigma^{-1}(x-\mu)\right]\right] \\ &\stackrel{(b)}{=} \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2} \mathbb{E}_p\left[ \text{Tr}\left[\Sigma^{-1}(x-\mu)(x-\mu)^T\right]\right] \\ &\stackrel{(c)}{=} \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2} \text{Tr}\left[\Sigma^{-1}\mathbb{E}_p\left[ (x-\mu)(x-\mu)^T \right]\right] \\ &= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2} \text{Tr}\left[\Sigma^{-1}\Sigma\right] \\ &= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{1}{2} \text{Tr}\left[I\right] \\ &= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|} + \frac{D}{2} \\ &= \frac{D}{2}(\log (2\pi) + 1) + \frac{1}{2}\log{|\Sigma|} \end{align} $$ Where we have that \((a)\) is true because the trace (sum of the diagonal terms) of a scalar is the scalar itself and \((x-\mu)^T \Sigma^{-1}(x-\mu)\) is a scalar. We have that \((b)\) is true because \(\text{Tr}(AB) = \text{Tr}(BA)\), where \(A = (x-\mu)^T\) and \(B = \Sigma^{-1}(x-\mu)\) (for the proof of why this is true, click on the button below). Finally, \((c)\) is true because the expectation is a linear operator so we can push it into the trace.
Here we will show that \(\text{Tr}(AB) = \text{Tr}(BA)\) is true (it is a standard linear algebra proof, but it is nice to have it here to keep this page pretty self-contained). So, we will first start off with the definition of the trace operator for a \(D\times D\) matrix \(AB\), where \(A\) has dimension \(D\times K\) and \(B\) has dimension \(K \times D\). $$ \text{Tr}(AB) = \sum_{i=1}^D [AB]_{i i}. $$ Now, what is \([AB]_{i i}\) though? Well, if we expand this we get $$ \begin{align} \text{Tr}(AB) &= \sum_{i=1}^D \sum_{j=1}^K A_{ij} B_{ji} \\ &= \sum_{j=1}^K \sum_{i=1}^D B_{ji} A_{ij} \\ &= \sum_{j=1}^K [ B A]_{jj} \\ &= \text{Tr}(BA), \end{align} $$ which concludes the proof. This can actually be extended to \(\text{Tr}(ABC) = \text{Tr}(CAB)\) etc. and it is quite a useful identity in machine learning so getting used to at least recognising it when it is used could save time.
As we can see, the entropy of a Gaussian distribution only depends on the covariance matrix, not on its mean. This is intuitive though. If I know the parameters of a Gaussian and I know the mean, I know that most samples will be concentrated around the mean, no matter where the mean is. The covariance matrix, however, gives us the ellipsoid around the mean that most samples will be in and this is what determines how surprised I will be to see most samples.
For instance, lets focus only on a univariate Gaussian i.e. the covariance is a scalar, the variance. In this case, if we take the limit of the variance going to infinity, what we are basically saying is that any number, or a better defined measure is, any \(\epsilon\) wide bucket is pretty much equally likely, so it would be hard for me to predict in which bucket the next sample would fall in. Therefore, we have a high entropy. On the other hand, taking the limit of the variance to zero, most samples will be essentially at the mean, therefore, our surprise is very low. If you were to ask me what the next sample would be from such a distribution, I would basically say the mean plus or minus some delta and I would always be correct. This would be an example of a low entropy distribution.
And that's pretty much it. As promised, a short post on the differential entropy of a Multivariate Gaussian. As I mentioned in my previous post, the differential entropy can be useful for debugging density estimation models because we know that the log likelihood of our model can never be greater than the negative (differential) entropy of the model, but if the model is good at modelling continuous data, it should be able to get very close to it.
Last editted 09/04/2016