Let us illustrate this by deriving the cost function for ML for some
general supervised learning task. Again, we suppose we have the dataset
\(\mathcal{D}\) defined above. We assume that each point has been drawn
from the identical distribution that every other point has been sampled
from but independently from any other data point (known as i.i.d.). In
ML learning, we seek to maximise the likelihood of the data. That is,
the probability of the data given the parameters or
\(p(\mathcal{D}|\theta)\). Because the data is i.i.d. we can write this
as
$$
\begin{align}
p(\mathcal{D}|\theta)
&= \prod_{n=1}^N p(x^{(n)}, y^{(n)}|\theta)\\
&= \prod_{n=1}^N p(y^{(n)}|x^{(n)}, \theta) p(x^{(n)}|\theta).
\end{align}
$$
Now since we only care about modelling \(p(y^{(n)}|x^{(n)}, \theta)\),
we don't really need to care about modelling \(p(x^{(n)}|\theta)\) i.e.
we don't care how the input data is distributed, we only care about
how the output data is distributed *given the input data*.
Therefore, we can say that \(p(x^{(n)}|\theta) = p(x^{(n)})\). This means
that
$$
\begin{align}
p(\mathcal{D}|\theta) &\propto \prod_{n=1}^N p(y^{(n)}|x^{(n)}, \theta)
= l(\theta).
\end{align}
$$
This is equivalent to maximising the log probability because log is a
strictly monotonically increasing function, so the parameters that
maximise the log of the likelihood, are also the parameters that
maximise the likelihood. So, by taking logs we get
$$
\begin{align}
L(\theta) &= \sum_{n=1}^N \log p(y^{(n)}|x^{(n)}, \theta)
\end{align}
$$
and by making different modelling assumptions for the parametric family
of distributions that \(p(y^{(n)}|x^{(n)}, \theta)\) is modelled by, we get
different well known supervised models (e.g. logistic regression,
linear regression, neural networks, etc.).

Another reason this can be tough is because each parametric model that
is chosen to model a probability density has its own *inductive
bias*. The inductive bias of a model is the set of assumptions the
model inherently makes, which make it hard for it to model data that
does not satisfy these assumptions.
For example, imagine trying
to model a uniform distribution with a Gaussian. This would always
be a bad fit. For one, the Gaussian has
an infinite support (the interval in which the density is
non-zero), whereas the uniform distribution has a finite support. Also,
the Gaussian has a single mode and decays square-exponentially away
from it, whereas the uniform distribution takes two distinct values:
0 and 1. The image below illustrates this. The Gaussian was fit by
minimising the KL divergence (I'll probably do another post on this at
some point) and by clicking on the button below you can see how this
is done.
So, to model arbitrary data, we need to have
models that are flexible enough so as to not make very stringent
assumptions. One common assumption made by many models is that the
density is *smooth* everywhere i.e. the gradient at any point
has some maximum magnitude. This is not always true (the uniform
distribution is again an example where this is not the case, as the
gradient at 0 and 1 does not even exist), but we
seem to believe that it is the case for many densities.

To derive the best fit Gaussian to a Uniform distribution, we can
minimise the KL divergence between the two distributions. Let \(p\) be
the probability density function of the uniform distribution and \(q\)
be the probability density function of the Gaussian. Then we have
$$
\begin{align}
\text{KL}(p || q) &= \int p(x) \log \frac{p(x)}{q(x)} dx
\\
&= \int_{0}^{1} \log \frac{1}{q(x)} dx
\\
&= \int_{0}^{1} \frac{1}{2} \log 2\pi \sigma^2 dx +
\int_{0}^{1} \frac{1}{2\sigma^2} (x - \mu)^2dx
\\
&= \frac{1}{2} \log 2\pi \sigma^2 +
\frac{1}{2\sigma^2} \int_{0}^{1} x^2 - 2\mu x + \mu^2 dx
\\
&= \frac{1}{2} \log 2\pi \sigma^2 +
\frac{1}{2\sigma^2} \left[\frac{x}{3} - \mu x^2 -\mu^2 x \right]\Big|_0^1
\\
&= \frac{1}{2} \log 2\pi \sigma^2 +
\frac{1}{2\sigma^2} \left[\frac{1}{3} - \mu -\mu^2 \right].
\end{align}
$$
Now, if we want to minimise this KL divergence, we just take the
derivative with respect to the parameters (\(\mu, \sigma^2\)) and set
them to zero to get a couple of equations and solve for the parameters,
as such
$$
\begin{align}
\frac{\partial \text{KL}(p || q)}{\partial \mu} &= -1 + 2\mu_* = 0,\\
\frac{\partial \text{KL}(p || q)}{\partial \sigma^2} &=
\frac{1}{2\sigma^{2}_*} - \frac{1}{2(\sigma^{2}_*)^2}\frac{1}{12} = 0.
\end{align}
$$
Solving for the parameters, we get that \(\mu_* = \frac{1}{2}\) and
\(\sigma^2_* = \frac{1}{12}\) i.e. the mean is equal to the mean of the
uniform and the variance is equal to the variance of the uniform. No
surprise there.

It is also required that the
model gives a valid density i.e. it must integrate to one and it must
be non-negative everywhere. This means that models like the Restricted
Boltzman Machine (RBM) and other undirected graphical models are not
easy to use because the *partition function* needs to be
calculated. In a lot of cases, this is just intractable.

Other models include Mixture of Factor Analysers (MoFAs), Mixture of
Bernoullis (MoBs) which are for multivariate binary data and more recent
models like the Neural Autoregressive Distribution Estimator
(NADE) and its variants (which my supervisor Iain Murray
was involved in developing). There are actually plenty of models
out there, and the purpose of this post is not to list them all
(because that would be incredibly boring for me, and I want to keep this
infomative, yet fun for me as well). If you are interested in knowing
more about the models that are available, you can send me an email and
I'll send you some papers to read.

The next thing to note is that, we can't actually evaluate the
expectation (since that would require us knowing \(p(x)\), which is
what we are trying to model, so you see the problem right?). Instead,
we make a Monte Carlo approximation to this expectation, as we have
samples from this distribution i.e.
$$
\mathbb{E}_p\left[\log {q_\theta(x)}\right] \approx
\frac{1}{N} \sum_{n=1}^N \log q_\theta(x^{(n)}).
$$
Therefore, to minimise the KL divergence (and thus fit the model), we
can instead minimise the average log likelihood (or the expectation
under
the empirical distribution) of the model with respect to the parameters.
This is a much easier problem, because we can use standard gradient
descent procedures (assuming we can get gradients from our model) and
minimise this loss. Therfore, we have that
$$
\arg\min_\theta \text{KL}(p || q_\theta) \approx
\arg\min_\theta \frac{1}{N} \sum_{n=1}^N \log q_\theta(x^{(n)}),
$$
so to fit a density estimation model to some data set, we just need to
maximise the likelihood that that data set would be generated by the
model.

Another thing that is worth mentioning is that we know that the KL
divergence is always
non-negative (or you sould know when you read my soon-to-come-out-post).
If we take this into account and re-arrange the original equation, we
can see
that our log likelihood can never be greater than the negative entropy.
This can sometimes be useful for debugging by trying to fit our model to
data from a distribution for which the entropy is known (e.g. a
Gaussian) and checking whether it gets close enough to it (assuming it
has the capacity to represent the distribution). Note, because we only
have an approximation to the expected log likelihood, this may actually
get slightly higher than the negative entropy but with enough data, this
shouldn't happen or it shouldn't get much higher.

Last editted 02/04/2016