If we use logarithms with base two and we assume that each value that
\(x\) takes is from an alphabet of some size i.e. each value is a
symbol, then the entropy corresponds to the smallest averge number of
bits per symbol that are required to represent a string of symbols from
the alphabet.
Anyway, there are lots of great resources
online and in books that you can read, so I won't go over the details of
that here.

So what is the differential entropy then? Well, if you have noticed,
above the entropy is basically a summation and each value that \(x\)
takes is a discrete value. In the differential entropy case, \(x\) can
take on continuous values and therefore we get an integral i.e.
$$
H(p) = \mathbb{E}_p\left[\log \frac{1}{p(x)}\right]
= \int p(x)\log\frac{1}{p(x)} dx
$$
The differential entropy cannot be interpreted as the number of bits per
symbol required to represent some string of symbols. The reason for this
is because for any continuous random variable the number of bits required
to represent any value is infinite. The reason for this is because the
real line (or any continuous space) has an uncountably infinite number
of values between any two numbers, and therefore, if there are an
uncountably infinite number of symbols, there is no mapping from the
natural numbers to them (the definition of uncountably infinite, if I
am not mistaken), therfore there is no mapping from any finite number of
bits to them.

Anyway, that was a bit too theoretical and out of my area of expertise
as well. The basic thing to remember is that, the differential entropy
is a measure of surprise as the entropy is. The higher the entropy of a
distribution, the more surprised we will be on average to see values
that are sampled from it.

Here we will show that \(\text{Tr}(AB) = \text{Tr}(BA)\) is true (it is
a standard linear algebra proof, but it is nice to have it here to keep
this page pretty self-contained). So, we will first start off with the
definition of the trace operator for a \(D\times D\) matrix \(AB\), where
\(A\) has dimension \(D\times K\) and \(B\) has dimension \(K \times D\).
$$
\text{Tr}(AB) = \sum_{i=1}^D [AB]_{i i}.
$$
Now, what is \([AB]_{i i}\) though? Well, if we expand this we get
$$
\begin{align}
\text{Tr}(AB) &= \sum_{i=1}^D \sum_{j=1}^K A_{ij} B_{ji}
\\
&= \sum_{j=1}^K \sum_{i=1}^D B_{ji} A_{ij}
\\
&= \sum_{j=1}^K [ B A]_{jj}
\\
&= \text{Tr}(BA),
\end{align}
$$
which concludes the proof. This can actually be extended to
\(\text{Tr}(ABC) = \text{Tr}(CAB)\) etc. and it is quite a useful identity
in machine learning so getting used to at least recognising it when
it is used could save time.

As we can see, the entropy of a Gaussian distribution only depends on
the covariance matrix, not on its mean. This is intuitive though. If
I know the parameters of a Gaussian and I know the mean, I know that
most samples will be concentrated around the mean, no matter where
the mean is. The covariance matrix, however, gives us the ellipsoid
around
the
mean that most samples will be in and this is what determines how
surprised I will be to see most samples.

For instance, lets focus only
on a univariate Gaussian i.e. the covariance is a scalar, the variance.
In this case, if we take the limit of the variance going to infinity,
what we are basically saying is that any number, or a better defined
measure is, any \(\epsilon\) wide bucket is pretty much
equally likely, so it would be hard for me to predict in which bucket
the next sample would fall in. Therefore, we have a high entropy. On
the other hand, taking the limit of the variance to zero, most samples
will be essentially at the mean, therefore, our surprise is very low.
If you were to ask me what the next sample would be from such a
distribution, I would basically say the mean plus or minus some delta
and I would always be correct. This would be an example of a low
entropy distribution.

And that's pretty much it. As promised, a short post on the differential
entropy of a Multivariate Gaussian. As I mentioned in my
previous post, the
differential entropy can be useful for debugging density estimation
models because we know that the log likelihood of our model can never be
greater than the negative (differential) entropy of the model, but if
the model is good at modelling continuous data, it should be able to
get very close to it.

Last editted 09/04/2016