This post will be a more technical than my previous post; I will assume familiarity with how MCMC sampling techniques for sampling from arbitrary distributions work (an overview starts on page 24, this introduction is more detailed). This post is about a specific class of MCMC algorithms: derivative based MCMC algorithms. I have two goals here: 1) to convince people that derivative based MCMC algorithms will have a profound effect on statistics and 2) to convince MCMC researchers that they should work on such algorithms. The goal of my previous post was to provide motivation for why good MCMC algorithms are so exciting.
A friend of mine suggested that this post would make the basis of a good grant application for statistics or applied math research. I can only hope that he is correct and someone picks up that idea. I’d do anything I could to help someone doing so.
Some background
In my last post, I mentioned that one of the things holding Bayesian statistics back is the curse of dimensionality
Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).
The posterior distribution is a probability distribution function over the whole space of possible parameter values. If you want to integrate numerically over a distribution with 10 parameters to calculate some statistic, say the mean, and you split up the space into 20 bins along each parameter-dimension, you will need a 10 trillion element array. Working with a 10 trillion element array is very expensive in terms of both memory and computer time. Since many models involve many more than 10 parameters and we’d like to have higher resolution than 1 in 20, this is a serious problem.
Instead of integrating directly over the space, we can use Monte Carlo integration: sample from this probability distribution and use the samples calculate our statistic (for example, averaging the points to calculate the mean). Markov Chain Monte Carlo (MCMC) can be used to sample from any probability distribution. MCMC works by starting from an arbitrary point in the space and then picking a random point that’s near by, if that point is more likely than the current point, then that point is adopted as the current point. If it’s less likely than the current point, then it may still be adopted with a probability depending on the ratio of the likelihoods. If certain criteria are met (the detailed balance), this process will eventually randomly move around the whole distribution in a way that is proportional to the likelihood; the process will sample from the distribution (though each successive point is not statistically independent from the previous one).
Sounds great, but unfortunately naive MCMC does not solve our problem completely; in a high dimensional space, many more directions have decreasing probability than higher probability than have increasing probability. If we pick a direction at random, we have to move slowly or wait a long time for a good direction. Assuming an n-dimensional, approximately normal distribution, naive MCMC algorithms are O(n) in the number of steps it takes to get an independent sample. Now O(n) doesn’t sound that bad, but if you take into account the fact that calculating the likelihood is often already O(n), it means that fitting many models takes O(n**2) time. This drastically limits the models which can be fit without having a lot of MCMC expertise, or integrating over the distribution analytically.
Derivative based MCMC algorithms
MCMC sampling has many similarities with optimization. In both applications, we have an often multi-dimensional function and we are most interested in the maxima. In optimization, we want to find the highest point on the function; in MCMC sampling we want to find regions of high probability and sample in those regions. In optimization, many functions are approximately quadratic near the optima; in MCMC sampling, many distributions are near normally distributed and the log of a normal distribution is quadratic (taking the log of the distribution is something you have to do anyway). In optimization, if you have a quadratic function, many algorithms will find a maxima in 1 step or very few steps regardless of the dimensionality of the function. They do this by using the first and second derivatives of the function to find a productive direction and magnitude to move in.
There is a class of MCMC algorithms which solve the curse of dimensionality by taking a lesson from optimization and use the derivatives of the posterior distribution to inform the step direction and size. This lets them preferentially consider the directions where probability is increasing using 1st derivative information and get a measure of the shape of the distribution using 2nd derivative information. Such algorithms perform much better than naive algorithms. They take larger step sizes, mix and converge faster. With respect to the number of parameters, Langevin MCMC algorithms (which use 1st derivative information) are O(n**1/3) (link), and Stochastic-Newton algorithms (which use 1st and 2nd derivative information and are analogous to Newtons Method) are O(1) (link). A Stochastic-Newton method will independently sample an approximately normal distribution in approximately one step, regardless of the number of parameters. This opens up a huge swath of the space of possible models for fitting without needing to do much math or needing much MCMC knowledge.
Derivative based MCMC algorithms have other advantages as well.
First, both 1st and 2nd derivative methods take much larger steps than naive methods. This means it is much easier to tell whether the distribution is converging or not in normal ways. The downside of this is that such algorithms probably have different failure modes than naive algorithms and might need different kinds of convergence diagnostics.
Second, 2nd derivative algorithms are self tuning to a large extent. Because the inverse hessian of the posterior distribution represents the variance of the normal distribution which locally approximates the function, such algorithms do not need a covariance tuning parameter in order to work well.
The future of MCMC
The obvious problem with these methods is that they require derivatives which can be time consuming to calculate analytically and expensive to calculate numerically (at least O(n)). However there is an elegant solution: automatic differentiation. If you have analytic derivatives for the different component parts of a function and the analytic derivatives of the operations used to put them together, you can calculate the derivatives for the whole function using the chain rule. The components of the posterior distribution are usually well known distributions and algebraic transformations, so automatic differentiation is well suited to the task.
This approach fits in remarkably well with existing MCMC software, such as PyMC, which allow users to build complex models by combining common distributions and algebraic transformations and then allow users to select an MCMC algorithm to sample from the posterior distribution. Derivative information can be added to existing distributions so that derivative based MCMC algorithms can function.
I have taken exactly this approach for first derivative information in a PyMC branch used by my package multichain_mcmc which contains an Adaptive Langevine MCMC algorithm. I graduated a year ago with an engineering degree, and I have never formally studied MCMC or even taken a stochastic processes class; I am an amateur, and yet, I was able to put together such an algorithm for very general use; creating truly powerful algorithms for general use should pose little problem for professionals who put their mind to it.
There is a lot of low hanging fruit research fruit in this area. For example, the most popular optimization algorithms are not pure newton’s method because it is a bit fragile; the the same is likely true in MCMC, for the same reasons. Thus it is very attractive to look at popular optimization algorithms for ideas on how to create robust MCMC algorithms. There’s also the issue of combining derivative based MCMC algorithms with other algorithms with desirable properties. For example, DREAM (also available in multichain_mcmc) has excellent mode jumping characteristics; figuring out when to take DREAM-like steps for best performance is an important question.
Given its potential to make statistics dramatically more productive, I’ve seen surprisingly little research in this area. There is a huge volume of MCMC research, and as far as I can tell, not very much of it is focused on derivative based algorithms. There is some interesting work on Langevin MCMC; for example an adaptive Langevin algorithm, some convergence results, and an implicit Langevin scheme, and also some good work on 2nd derivative based methods; for example, optimization based work, some numerical work, and some recent work. But considering that Langevin MCMC came out 10 years ago much more focus is needed.
I’m not sure why this approach seems neglected. It might be that research incentives don’t reward such generally applicable research, or that MCMC researchers do not see how simplified MCMC could dramatically improve the productivity of statistics, or perhaps researchers haven’t realized how automatic differentiation can democratize these algorithms.
Whatever the issue is, I hope that it can be overcome and MCMC researchers focus more on derivative based MCMC methods in the near future. MCMC sampling will become more reliable, and troubleshooting chains when they do have problems will become easier. This means that even people who are only vaguely aware of how MCMC works can use these algorithms, bringing us closer to the promise of Bayesian statistics.
19 comments
Comments feed for this article
November 18, 2010 at 9:50 am
Justin
Very interesting article. A minor comment: The links immediately following “O(n**1/3)” and “O(1)” are presently broken.
November 18, 2010 at 9:56 am
jsalvati
Thanks. Should be fixed now.
November 18, 2010 at 2:57 pm
Morgan
So what exactly is the promise? What real-world problems can we solve with statistics?
November 18, 2010 at 3:56 pm
jsalvati
Statistics is used in a lot of areas. It’s used in manufacturing, engineering, finance all the hard sciences, social science, etc.. Anytime you want to learn something from noisy data, you’ll probably use statistics. The promise of Bayesian statistics in particular is that it will make learning, understanding and using statistics much easier than it is presently.
November 18, 2010 at 10:08 pm
Cyan
If a part 3 is in the works, I hope you’ll have something to say about prior distributions.
November 19, 2010 at 9:14 am
jsalvati
@Cyan, I’m not sure what you mean. In practice I like Gelman’s “weakly informative prior” approach.
November 20, 2010 at 12:26 am
Cyan
@jsalvati, In the first part, you wrote, “If Bayesian statistics is so great, why isn’t it more widely used?” and then, to my surprise, didn’t mention priors. If you were to ask many statisticians (Bayesian or otherwise) that question, I expect that more often than not the answer would be, “The use of priors is controversial.” This is especially the case for complicated, high dimensionality models where it’s hard to do thorough sensitivity analyses — exactly the kinds of models where advanced MCMC methods are necessary. Making Bayesian inference more practical in these kinds of models won’t make professional statisticians more (or less) likely to jump to Bayes.
November 20, 2010 at 5:56 am
Mark Girolami
Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Girolami, M. & Calderhead, B. Journal of the Royal Statistical Society (Methodology), with discussion, 73, 2, 2011.
You might find the above paper, which was read before the Royal Statistical Society on 13th October this year, of some related interest to this post. It will be published in the Journal of the Royal Statistical Society – Series B, with Discussion. Please feel to get in touch with me about the paper.
All the codes (in Matlab) that were used in the experiments reported in the paper are also available – currently – at http://www.dcs.gla.ac.uk/inference/rmhmc/index.html – included there are the slides from the presentations at the RSS meeting. The preprints of the paper are available at http://www.rss.org.uk/pdf/Girolami_13_Oct._2010.pdf
November 24, 2010 at 4:23 pm
The use of priors in Bayesian statistics « Good Morning, Economics
[…] my last post, Cyan brought up the issue that many practitioners of statistics might object to using prior information in Bayesian […]
December 13, 2010 at 1:48 am
Colin Fox
Thanks for your interesting article. I agree with (most of) your points, and am a little surprised and pleased to find that _other_ people are also cottoning onto the idea that MCMC needs a kick in the pants, and gradient based optimization looks a promising source of ideas.
One point I don’t agree with is that the problem is the curse of dimensionality. Currently I think the _curse of reversibility_ is the bigger issue. Statisticians seem enamored by detailed balance because it makes life easy, but the flip side is that it is equivalent to reversibility and that is a killer. Applying a fixed reversible transition kernel gives geometric convergence. We’ve got to do better than that.
I hope you are heartened to hear that there are indeed active research groups working hard on getting gradient based methods into MCMC — including the ones you listed work from. A few others you might be interested in are the delayed acceptance method that can use Jacobians (http://ba.stat.cmu.edu/journal/2010/vol05/issue02/christen.pdf), and a conjugate direction sampler that uses gradients, big-time (http://www.physics.otago.ac.nz/reports/electronics/ETR2008-1.pdf).
My guess is that the real gains are yet to be made, and will come when provably convergent non-reversible MCMC samplers make effective use of gradient infromation. We have interesting times ahead.
December 13, 2010 at 2:29 am
Colin Fox
Oops, the first link in my reply should have been http://www.physics.otago.ac.nz/people/fox/publications/ApproxMCMCPreprint.pdf
January 24, 2011 at 4:49 pm
Bob Carpenter
I’m trying to jump on board with Hamiltonian Monte Carlo and AD, but I’m finding it very hard to sort through the AD systems that are out there and what they are capable of.
Our high level goal is widely shared: sampling from the posterior of a large-ish multilevel generalized linear model. For example, in Gelman et al.’s voting models, we have a few thousand predictors (mostly interaction terms) arranged into a few dozen levels, with parameters at each level getting multivariate normal (or t) priors which are themselves given scaled inverse Wishart hyperpriors (or maybe some other things we’ve been playing around with).
We’re looking at Hamiltonian Monte Carlo and would like to use AD to compute the gradients if it’s feasible. For AD to be less trouble than its worth, I need AD to work with matrix libs (inverses, eigenvalues, etc.), stat libs (multivariate normals, inverse Wisharts, etc.) and math libs (log gamma functions, etc.)
I’ve spent a few days poring over autodiff.org and the refs in the Wikipedia, but can’t seem to find anything that exactly fits what we need. I’ve installed packages and gotten the basics to run, but I can’t find anything that does reverse mode and has extensive math libs linkable.
Oh, we’d like to do this in C, so it’s efficient at the scale we’re looking at, and we’d also like to use a package with a license that lets us redistribute.
January 24, 2011 at 5:40 pm
jsalvati
@Bob Carpenter
Unfortunately, I don’t have much constructive advice for you. My C kung-foo is weak, but I have looked around at the AD packages for both C and Python, and haven’t found anything that really seems like easy to learn.
Our aims diverge a bit because I am interested in making tools for general use, while you sound like you want to make something for specific use. Because I want something for general use, I have even more requirements, like I want the AD package to work well with Numpy arrays, supporting efficient broadcasting, the numpy lin-alg functions etc. all for at least the 2nd derivative. AlgoPy (http://pypi.python.org/pypi/algopy/0.3.0) may come close to this, but I haven’t had a chance to look into it.
January 24, 2011 at 6:08 pm
Bob Carpenter
We’re starting with some specific models to fit, but the goal’s a general tool for fitting a large class of models. Specifically, I’m imagining someting like BUGS, but based on Hamiltonian MC rather than Gibbs (though we’ll still need Gibbs for the discrete params). I’d like to allow arbitrary code in the model specifications, and thought I could get away with C for that if we could get autodiff running.
For the kinds of models we have with lots of arbitrary designs in the data matrix, it’s hard to vectorize in something like numpy or R. If we have to evaluate our functions in Python, we’re sunk speed-wise.
January 24, 2011 at 6:23 pm
jsalvati
What is the class of models you want to work with?
Let me try to sell you on NumPy, and PyMC: it’s actually not very difficult to write arbitrary C extensions for numpy. Cython is easy to use and compiles directly into C while interfacing nicely with NumPy, giving you the speed where it counts (I do this at work sometimes). This allows you to have all the benefits of numpy and the ability use the existing PyMC framework for doing MCMC. The downside is that derivatives would have to be custom written for the C extensions you create.
January 25, 2011 at 11:31 am
Bob Carpenter
I really like PyMC’s design. I’m only just starting to learn Python and we don’t have an in-house expert around. We are in fact planning to provide a Python (and R) front end to whatever we do. Python’s big in computer science and R in stats and the rest of the social sciences.
I’m also checking out Theano, which has C code generators and even a CUDA (Nvidia GPU) linker. I can’t see that it has a full LAPACK-like matrix lib, though, and we’re going to need to do some hairy matrix ops for hyperprior covariance matrices.
We can’t really do what we want easily directly in numpy because the functions aren’t easily vectorizable. Matt Hoffman’s been using Cython to speed up the basic Python implementation of HMC; I don’t know if he’s tried the type defs, because they appear to be critical for Cython speed if you want to achieve “almost the speed of C” (and by “almost”, it looks like a factor of 2 or so slower, which we could probably live with).
January 25, 2011 at 12:02 pm
jsalvatier
I am curious: are the nonvectorizable parts of your models modular or is it something where you have to write a whole custom model each time?
January 26, 2011 at 2:41 pm
Bob Carpenter
If I understand what you mean by “modular”, the answer is “no”. There’s no repeatable submodel with the same structure across all the applications we’re looking at (which include Bayesian spline fitting, autoregression and time series, latent parameter IRT-like models, and deep multilevel regressions with arbitrary data designs). If that were the case, we could code it up once and be done with it.
Specifically, we’d like to be able to fit any kind of model you could write in BUGS (i.e., any directed acyclic graphical model expressible with side arithmetic and a suite of known distributions).
October 13, 2013 at 6:16 am
Backpropogation is Just Steepest Descent with Automatic Differentiation | Idontgetoutmuch’s Weblog
[…] Good Morning, Economics Blog […]