Unsupervised Learning and Density Estimation

Before introducing and explaining density estimation, I would like to focus a bit on supervised learning because I will use it to explain the difference between it and unsupervised methods like density estimation.

Supervised Learning

In supervised learning, the general setup is that we have some data in some \(D\) dimensional space e.g. \(x \in \mathbb{R}^D\) and we want to make a model that can make predictions for some target variable \(y \in \mathbb{R}^M\), where \(M << D\) typically. We call it supervised learning when our data set consists, both of the inputs and of the outputs i.e. we have some dataset \(\mathcal{D} = \{(x^{(n)}, y^{(n)})\}_{n=1}^N\), where \(N\) is the number of data points in our data set. One way to do supervised learning is to model \(p(y|x, \theta)\) with some parametric model that takes as parameters \(\theta\) and then to fit the parameters of the model to minimise a cost function. Both Maximum likelihood (ML) and Maximum a posteriori (MAP) estimation can be formulated in this way.
Let us illustrate this by deriving the cost function for ML for some general supervised learning task. Again, we suppose we have the dataset \(\mathcal{D}\) defined above. We assume that each point has been drawn from the identical distribution that every other point has been sampled from but independently from any other data point (known as i.i.d.). In ML learning, we seek to maximise the likelihood of the data. That is, the probability of the data given the parameters or \(p(\mathcal{D}|\theta)\). Because the data is i.i.d. we can write this as $$ \begin{align} p(\mathcal{D}|\theta) &= \prod_{n=1}^N p(x^{(n)}, y^{(n)}|\theta)\\ &= \prod_{n=1}^N p(y^{(n)}|x^{(n)}, \theta) p(x^{(n)}|\theta). \end{align} $$ Now since we only care about modelling \(p(y^{(n)}|x^{(n)}, \theta)\), we don't really need to care about modelling \(p(x^{(n)}|\theta)\) i.e. we don't care how the input data is distributed, we only care about how the output data is distributed given the input data. Therefore, we can say that \(p(x^{(n)}|\theta) = p(x^{(n)})\). This means that $$ \begin{align} p(\mathcal{D}|\theta) &\propto \prod_{n=1}^N p(y^{(n)}|x^{(n)}, \theta) = l(\theta). \end{align} $$ This is equivalent to maximising the log probability because log is a strictly monotonically increasing function, so the parameters that maximise the log of the likelihood, are also the parameters that maximise the likelihood. So, by taking logs we get $$ \begin{align} L(\theta) &= \sum_{n=1}^N \log p(y^{(n)}|x^{(n)}, \theta) \end{align} $$ and by making different modelling assumptions for the parametric family of distributions that \(p(y^{(n)}|x^{(n)}, \theta)\) is modelled by, we get different well known supervised models (e.g. logistic regression, linear regression, neural networks, etc.).

Density Estimation

In unsupervised learning, we generally do not have a split in the data into some \((x^{(n)}, y^{(n)})\), so we care about modelling the data set as a whole. We look for interesting patterns in the data or seek to find models that explain the data well and are interpretable. Density estimation is an unsupervised learning task, in which the goal is to model the probability density (or distribution for discrete data) over our input i.e. \(p(x)\) for some \(x \in \mathbb{R}^D\).

So, that's pretty straight forward, right?

No, actually, this can be a really difficult task, especially when \(D\) is large, because of the curse of dimensionality. The curse of dimensionality basically states that as the dimensionality of the space increases, the amount of data needed to get a good idea of how the data is distributed in the space grows exponentially, which is really sad because, though we are in the big data era, for most data sets of interest (images, audio, etc.) the sizes of the datasets are still too small to accurately get a accurate models for their densities.
Another reason this can be tough is because each parametric model that is chosen to model a probability density has its own inductive bias. The inductive bias of a model is the set of assumptions the model inherently makes, which make it hard for it to model data that does not satisfy these assumptions. For example, imagine trying to model a uniform distribution with a Gaussian. This would always be a bad fit. For one, the Gaussian has an infinite support (the interval in which the density is non-zero), whereas the uniform distribution has a finite support. Also, the Gaussian has a single mode and decays square-exponentially away from it, whereas the uniform distribution takes two distinct values: 0 and 1. The image below illustrates this. The Gaussian was fit by minimising the KL divergence (I'll probably do another post on this at some point) and by clicking on the button below you can see how this is done. So, to model arbitrary data, we need to have models that are flexible enough so as to not make very stringent assumptions. One common assumption made by many models is that the density is smooth everywhere i.e. the gradient at any point has some maximum magnitude. This is not always true (the uniform distribution is again an example where this is not the case, as the gradient at 0 and 1 does not even exist), but we seem to believe that it is the case for many densities.
To derive the best fit Gaussian to a Uniform distribution, we can minimise the KL divergence between the two distributions. Let \(p\) be the probability density function of the uniform distribution and \(q\) be the probability density function of the Gaussian. Then we have $$ \begin{align} \text{KL}(p || q) &= \int p(x) \log \frac{p(x)}{q(x)} dx \\ &= \int_{0}^{1} \log \frac{1}{q(x)} dx \\ &= \int_{0}^{1} \frac{1}{2} \log 2\pi \sigma^2 dx + \int_{0}^{1} \frac{1}{2\sigma^2} (x - \mu)^2dx \\ &= \frac{1}{2} \log 2\pi \sigma^2 + \frac{1}{2\sigma^2} \int_{0}^{1} x^2 - 2\mu x + \mu^2 dx \\ &= \frac{1}{2} \log 2\pi \sigma^2 + \frac{1}{2\sigma^2} \left[\frac{x}{3} - \mu x^2 -\mu^2 x \right]\Big|_0^1 \\ &= \frac{1}{2} \log 2\pi \sigma^2 + \frac{1}{2\sigma^2} \left[\frac{1}{3} - \mu -\mu^2 \right]. \end{align} $$ Now, if we want to minimise this KL divergence, we just take the derivative with respect to the parameters (\(\mu, \sigma^2\)) and set them to zero to get a couple of equations and solve for the parameters, as such $$ \begin{align} \frac{\partial \text{KL}(p || q)}{\partial \mu} &= -1 + 2\mu_* = 0,\\ \frac{\partial \text{KL}(p || q)}{\partial \sigma^2} &= \frac{1}{2\sigma^{2}_*} - \frac{1}{2(\sigma^{2}_*)^2}\frac{1}{12} = 0. \end{align} $$ Solving for the parameters, we get that \(\mu_* = \frac{1}{2}\) and \(\sigma^2_* = \frac{1}{12}\) i.e. the mean is equal to the mean of the uniform and the variance is equal to the variance of the uniform. No surprise there.
It is also required that the model gives a valid density i.e. it must integrate to one and it must be non-negative everywhere. This means that models like the Restricted Boltzman Machine (RBM) and other undirected graphical models are not easy to use because the partition function needs to be calculated. In a lot of cases, this is just intractable.

What's at our disposal?

Well there are a bunch of models out there that are used, with varying success. Probably the most familiar (after a regular Gaussian) is the Mixture of Gaussians (MoGs). This model is exactly what its name suggests (trust me, in machine learning, that is not always the case). It's pdf is parameterised as such \(\theta = \{\pi_k, \mu_k, \Sigma_k \ |\ \forall k \in [K]\}\). Here \(K\) is the number of mixing components (i.e. the number of Gaussians), \(\pi_k\) is the probability of a data point coming from the \(k\)th Gaussian and \(\mu_k, \ \Sigma_k\) are the associated mean vectors and covariance matrices respectively. This model is simple enough, but it is also quite powerful (for modelling continuous data) in many cases. It is usually fit using the Expectation Maximisation (EM) algorithm (which I will go through in another post hopefully).
Other models include Mixture of Factor Analysers (MoFAs), Mixture of Bernoullis (MoBs) which are for multivariate binary data and more recent models like the Neural Autoregressive Distribution Estimator (NADE) and its variants (which my supervisor Iain Murray was involved in developing). There are actually plenty of models out there, and the purpose of this post is not to list them all (because that would be incredibly boring for me, and I want to keep this infomative, yet fun for me as well). If you are interested in knowing more about the models that are available, you can send me an email and I'll send you some papers to read.

The cost function?

So, when we are fitting these models, what we actually want to do is to minimise some cost. A cost that usually comes to mind (both because it makes sense and because it is quite nice to work with), is the KL divergence (again, a post on this will come soon, hopefully). In this case, KL\((p || q_\theta)\), where \(qq_\theta\) is the density that our model (parameterised by \(\theta\)) gives and \(p\) is the actual density function of the distribution that produced the data. So, how do we do this practically? Well, let's take a look at this cost in more detail. We have $$ \begin{align} \text{KL}(p || q_\theta) &= \int p(x) \log \frac{p(x)}{q_\theta(x)} dx \\ &= \int p(x) \log p(x) dx + \int p(x) \log \frac{1}{q_\theta(x)} dx \\ &= -\int p(x) \log \frac{1}{p(x)} dx - \int p(x) \log {q_\theta(x)} dx \\ &= -H(p(x)) - \mathbb{E}_p\left[\log {q_\theta(x)}\right]. \\ \end{align} $$ Here, \(H(p(x))\) is the (differential) entropy of the true distribution and the final term is the expectation of the log likelihood of the model taken under the true data distribution. Notice that, the entropy of the true distribution does not depend on the parameters of our model at all (which makes sense, because it is a property of the true distribution). This means that minimising the KL divergence with respect to the parameters is equivalent to minimising the expected negative log likelihood of the data under the model (or maximising the log likelihood, if that's more your thing).
The next thing to note is that, we can't actually evaluate the expectation (since that would require us knowing \(p(x)\), which is what we are trying to model, so you see the problem right?). Instead, we make a Monte Carlo approximation to this expectation, as we have samples from this distribution i.e. $$ \mathbb{E}_p\left[\log {q_\theta(x)}\right] \approx \frac{1}{N} \sum_{n=1}^N \log q_\theta(x^{(n)}). $$ Therefore, to minimise the KL divergence (and thus fit the model), we can instead minimise the average log likelihood (or the expectation under the empirical distribution) of the model with respect to the parameters. This is a much easier problem, because we can use standard gradient descent procedures (assuming we can get gradients from our model) and minimise this loss. Therfore, we have that $$ \arg\min_\theta \text{KL}(p || q_\theta) \approx \arg\min_\theta \frac{1}{N} \sum_{n=1}^N \log q_\theta(x^{(n)}), $$ so to fit a density estimation model to some data set, we just need to maximise the likelihood that that data set would be generated by the model.
Another thing that is worth mentioning is that we know that the KL divergence is always non-negative (or you sould know when you read my soon-to-come-out-post). If we take this into account and re-arrange the original equation, we can see that our log likelihood can never be greater than the negative entropy. This can sometimes be useful for debugging by trying to fit our model to data from a distribution for which the entropy is known (e.g. a Gaussian) and checking whether it gets close enough to it (assuming it has the capacity to represent the distribution). Note, because we only have an approximation to the expected log likelihood, this may actually get slightly higher than the negative entropy but with enough data, this shouldn't happen or it shouldn't get much higher.
Last editted 02/04/2016