This post will be a more technical than my previous post; I will assume familiarity with how MCMC sampling techniques for sampling from arbitrary distributions work (an overview starts on page 24, this introduction is more detailed). This post is about a specific class of MCMC algorithms: derivative based MCMC algorithms. I have two goals here: 1) to convince people that derivative based MCMC algorithms will have a profound effect on statistics and 2) to convince MCMC researchers that they should work on such algorithms. The goal of my previous post was to provide motivation for why good MCMC algorithms are so exciting.
A friend of mine suggested that this post would make the basis of a good grant application for statistics or applied math research. I can only hope that he is correct and someone picks up that idea. I’d do anything I could to help someone doing so.
In my last post, I mentioned that one of the things holding Bayesian statistics back is the curse of dimensionality
Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).
The posterior distribution is a probability distribution function over the whole space of possible parameter values. If you want to integrate numerically over a distribution with 10 parameters to calculate some statistic, say the mean, and you split up the space into 20 bins along each parameter-dimension, you will need a 10 trillion element array. Working with a 10 trillion element array is very expensive in terms of both memory and computer time. Since many models involve many more than 10 parameters and we’d like to have higher resolution than 1 in 20, this is a serious problem.
Instead of integrating directly over the space, we can use Monte Carlo integration: sample from this probability distribution and use the samples calculate our statistic (for example, averaging the points to calculate the mean). Markov Chain Monte Carlo (MCMC) can be used to sample from any probability distribution. MCMC works by starting from an arbitrary point in the space and then picking a random point that’s near by, if that point is more likely than the current point, then that point is adopted as the current point. If it’s less likely than the current point, then it may still be adopted with a probability depending on the ratio of the likelihoods. If certain criteria are met (the detailed balance), this process will eventually randomly move around the whole distribution in a way that is proportional to the likelihood; the process will sample from the distribution (though each successive point is not statistically independent from the previous one).
Sounds great, but unfortunately naive MCMC does not solve our problem completely; in a high dimensional space, many more directions have decreasing probability than higher probability than have increasing probability. If we pick a direction at random, we have to move slowly or wait a long time for a good direction. Assuming an n-dimensional, approximately normal distribution, naive MCMC algorithms are O(n) in the number of steps it takes to get an independent sample. Now O(n) doesn’t sound that bad, but if you take into account the fact that calculating the likelihood is often already O(n), it means that fitting many models takes O(n**2) time. This drastically limits the models which can be fit without having a lot of MCMC expertise, or integrating over the distribution analytically.
Derivative based MCMC algorithms
MCMC sampling has many similarities with optimization. In both applications, we have an often multi-dimensional function and we are most interested in the maxima. In optimization, we want to find the highest point on the function; in MCMC sampling we want to find regions of high probability and sample in those regions. In optimization, many functions are approximately quadratic near the optima; in MCMC sampling, many distributions are near normally distributed and the log of a normal distribution is quadratic (taking the log of the distribution is something you have to do anyway). In optimization, if you have a quadratic function, many algorithms will find a maxima in 1 step or very few steps regardless of the dimensionality of the function. They do this by using the first and second derivatives of the function to find a productive direction and magnitude to move in.
There is a class of MCMC algorithms which solve the curse of dimensionality by taking a lesson from optimization and use the derivatives of the posterior distribution to inform the step direction and size. This lets them preferentially consider the directions where probability is increasing using 1st derivative information and get a measure of the shape of the distribution using 2nd derivative information. Such algorithms perform much better than naive algorithms. They take larger step sizes, mix and converge faster. With respect to the number of parameters, Langevin MCMC algorithms (which use 1st derivative information) are O(n**1/3) (link), and Stochastic-Newton algorithms (which use 1st and 2nd derivative information and are analogous to Newtons Method) are O(1) (link). A Stochastic-Newton method will independently sample an approximately normal distribution in approximately one step, regardless of the number of parameters. This opens up a huge swath of the space of possible models for fitting without needing to do much math or needing much MCMC knowledge.
Derivative based MCMC algorithms have other advantages as well.
First, both 1st and 2nd derivative methods take much larger steps than naive methods. This means it is much easier to tell whether the distribution is converging or not in normal ways. The downside of this is that such algorithms probably have different failure modes than naive algorithms and might need different kinds of convergence diagnostics.
Second, 2nd derivative algorithms are self tuning to a large extent. Because the inverse hessian of the posterior distribution represents the variance of the normal distribution which locally approximates the function, such algorithms do not need a covariance tuning parameter in order to work well.
The future of MCMC
The obvious problem with these methods is that they require derivatives which can be time consuming to calculate analytically and expensive to calculate numerically (at least O(n)). However there is an elegant solution: automatic differentiation. If you have analytic derivatives for the different component parts of a function and the analytic derivatives of the operations used to put them together, you can calculate the derivatives for the whole function using the chain rule. The components of the posterior distribution are usually well known distributions and algebraic transformations, so automatic differentiation is well suited to the task.
This approach fits in remarkably well with existing MCMC software, such as PyMC, which allow users to build complex models by combining common distributions and algebraic transformations and then allow users to select an MCMC algorithm to sample from the posterior distribution. Derivative information can be added to existing distributions so that derivative based MCMC algorithms can function.
I have taken exactly this approach for first derivative information in a PyMC branch used by my package multichain_mcmc which contains an Adaptive Langevine MCMC algorithm. I graduated a year ago with an engineering degree, and I have never formally studied MCMC or even taken a stochastic processes class; I am an amateur, and yet, I was able to put together such an algorithm for very general use; creating truly powerful algorithms for general use should pose little problem for professionals who put their mind to it.
There is a lot of low hanging fruit research fruit in this area. For example, the most popular optimization algorithms are not pure newton’s method because it is a bit fragile; the the same is likely true in MCMC, for the same reasons. Thus it is very attractive to look at popular optimization algorithms for ideas on how to create robust MCMC algorithms. There’s also the issue of combining derivative based MCMC algorithms with other algorithms with desirable properties. For example, DREAM (also available in multichain_mcmc) has excellent mode jumping characteristics; figuring out when to take DREAM-like steps for best performance is an important question.
Given its potential to make statistics dramatically more productive, I’ve seen surprisingly little research in this area. There is a huge volume of MCMC research, and as far as I can tell, not very much of it is focused on derivative based algorithms. There is some interesting work on Langevin MCMC; for example an adaptive Langevin algorithm, some convergence results, and an implicit Langevin scheme, and also some good work on 2nd derivative based methods; for example, optimization based work, some numerical work, and some recent work. But considering that Langevin MCMC came out 10 years ago much more focus is needed.
I’m not sure why this approach seems neglected. It might be that research incentives don’t reward such generally applicable research, or that MCMC researchers do not see how simplified MCMC could dramatically improve the productivity of statistics, or perhaps researchers haven’t realized how automatic differentiation can democratize these algorithms.
Whatever the issue is, I hope that it can be overcome and MCMC researchers focus more on derivative based MCMC methods in the near future. MCMC sampling will become more reliable, and troubleshooting chains when they do have problems will become easier. This means that even people who are only vaguely aware of how MCMC works can use these algorithms, bringing us closer to the promise of Bayesian statistics.