This should be a short post (but I'm quite bad at estimating these
things, so it could be huge, who knows?). I just wanted to show the
derivation of the differential entropy of a Multivariate
Gaussian distribution. The reason for this is two-fold: first, it
comes in handy when I'm debugging density estimation models, because
I can see what they give as their average log likelihood and compare
it with it to see whether the model works, as explained in my previous post. The second reason, is
that there are many (or at least one) linear algebra identities that
are used to make the derivation easier and they come up all over the
place in machine learning, so seeing an example of them in use is a
great way to become more familiar with them.
Differential Entropy?
So, first off, we need to define what the differential entropy
is.
The entropy of a distribution \(p(x)\) over a random variable
\(x\in \mathbb{X}\), where \(\mathbb{X}\) is a discrete space,
is basically defined as
$$
H(p) = \mathbb{E}_p\left[\log \frac{1}{p(x)}\right]
= \sum_{x} p(x)\log\frac{1}{p(x)}
$$
and it basically is the expected surprise that we get when we
see elements that has been sampled from the distribution. The
quantity \(-\log p(x)\) is defined as the information content of
the value \(x\), and this makes sense. Values that have high probability
are less informative than values that have low probability. For example,
suppose I tell you that tomorrow I will wake up in the morning, you
won't really be surprised because most people do that every day, so
with high probability I will also do that. However, suppose I then tell
you that after I wake up I will take a plane to Brazil, go to the Amazon
rainforest, search for a tribe that has never seen an outsider before
and live with them for a year... well, that's not that likely, so you
would be really surprised.
If we use logarithms with base two and we assume that each value that
\(x\) takes is from an alphabet of some size i.e. each value is a
symbol, then the entropy corresponds to the smallest averge number of
bits per symbol that are required to represent a string of symbols from
the alphabet.
Anyway, there are lots of great resources
online and in books that you can read, so I won't go over the details of
that here.
So what is the differential entropy then? Well, if you have noticed,
above the entropy is basically a summation and each value that \(x\)
takes is a discrete value. In the differential entropy case, \(x\) can
take on continuous values and therefore we get an integral i.e.
$$
H(p) = \mathbb{E}_p\left[\log \frac{1}{p(x)}\right]
= \int p(x)\log\frac{1}{p(x)} dx
$$
The differential entropy cannot be interpreted as the number of bits per
symbol required to represent some string of symbols. The reason for this
is because for any continuous random variable the number of bits required
to represent any value is infinite. The reason for this is because the
real line (or any continuous space) has an uncountably infinite number
of values between any two numbers, and therefore, if there are an
uncountably infinite number of symbols, there is no mapping from the
natural numbers to them (the definition of uncountably infinite, if I
am not mistaken), therfore there is no mapping from any finite number of
bits to them.
Anyway, that was a bit too theoretical and out of my area of expertise
as well. The basic thing to remember is that, the differential entropy
is a measure of surprise as the entropy is. The higher the entropy of a
distribution, the more surprised we will be on average to see values
that are sampled from it.
Differential Entropy of a Gaussian
(kind of the point of this post...)
So, now that some introductions have been made, let's derive the
differential entropy of a Gaussian.
$$
\begin{align}
H(p) &= \int p(x) \log \frac{1}{p(x)} dx
\\
&= \int p(x) \left[\frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]dx
\\
&= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2} \int p(x) (x-\mu)^T\Sigma^{-1}(x-\mu)dx
\\
&\stackrel{(a)}{=} \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2} \mathbb{E}_p\left[
\text{Tr}\left[(x-\mu)^T\Sigma^{-1}(x-\mu)\right]\right]
\\
&\stackrel{(b)}{=} \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2} \mathbb{E}_p\left[
\text{Tr}\left[\Sigma^{-1}(x-\mu)(x-\mu)^T\right]\right]
\\
&\stackrel{(c)}{=} \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2}
\text{Tr}\left[\Sigma^{-1}\mathbb{E}_p\left[ (x-\mu)(x-\mu)^T \right]\right]
\\
&= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2}
\text{Tr}\left[\Sigma^{-1}\Sigma\right]
\\
&= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{1}{2}
\text{Tr}\left[I\right]
\\
&= \frac{D}{2}\log (2\pi) + \frac{1}{2}\log{|\Sigma|}
+ \frac{D}{2}
\\
&= \frac{D}{2}(\log (2\pi) + 1) + \frac{1}{2}\log{|\Sigma|}
\end{align}
$$
Where we have that \((a)\) is true because the trace (sum of the diagonal
terms) of a scalar is the scalar itself and \((x-\mu)^T
\Sigma^{-1}(x-\mu)\) is a scalar. We have that \((b)\) is true because
\(\text{Tr}(AB) = \text{Tr}(BA)\), where \(A = (x-\mu)^T\) and
\(B = \Sigma^{-1}(x-\mu)\) (for the proof of why this is true,
click on the button below). Finally, \((c)\) is true because the
expectation is a linear operator so we can push it into the trace.
Here we will show that \(\text{Tr}(AB) = \text{Tr}(BA)\) is true (it is
a standard linear algebra proof, but it is nice to have it here to keep
this page pretty self-contained). So, we will first start off with the
definition of the trace operator for a \(D\times D\) matrix \(AB\), where
\(A\) has dimension \(D\times K\) and \(B\) has dimension \(K \times D\).
$$
\text{Tr}(AB) = \sum_{i=1}^D [AB]_{i i}.
$$
Now, what is \([AB]_{i i}\) though? Well, if we expand this we get
$$
\begin{align}
\text{Tr}(AB) &= \sum_{i=1}^D \sum_{j=1}^K A_{ij} B_{ji}
\\
&= \sum_{j=1}^K \sum_{i=1}^D B_{ji} A_{ij}
\\
&= \sum_{j=1}^K [ B A]_{jj}
\\
&= \text{Tr}(BA),
\end{align}
$$
which concludes the proof. This can actually be extended to
\(\text{Tr}(ABC) = \text{Tr}(CAB)\) etc. and it is quite a useful identity
in machine learning so getting used to at least recognising it when
it is used could save time.
As we can see, the entropy of a Gaussian distribution only depends on
the covariance matrix, not on its mean. This is intuitive though. If
I know the parameters of a Gaussian and I know the mean, I know that
most samples will be concentrated around the mean, no matter where
the mean is. The covariance matrix, however, gives us the ellipsoid
around
the
mean that most samples will be in and this is what determines how
surprised I will be to see most samples.
For instance, lets focus only
on a univariate Gaussian i.e. the covariance is a scalar, the variance.
In this case, if we take the limit of the variance going to infinity,
what we are basically saying is that any number, or a better defined
measure is, any \(\epsilon\) wide bucket is pretty much
equally likely, so it would be hard for me to predict in which bucket
the next sample would fall in. Therefore, we have a high entropy. On
the other hand, taking the limit of the variance to zero, most samples
will be essentially at the mean, therefore, our surprise is very low.
If you were to ask me what the next sample would be from such a
distribution, I would basically say the mean plus or minus some delta
and I would always be correct. This would be an example of a low
entropy distribution.
And that's pretty much it. As promised, a short post on the differential
entropy of a Multivariate Gaussian. As I mentioned in my
previous post, the
differential entropy can be useful for debugging density estimation
models because we know that the log likelihood of our model can never be
greater than the negative (differential) entropy of the model, but if
the model is good at modelling continuous data, it should be able to
get very close to it.