You are currently browsing the monthly archive for November 2010.

In my last post, Cyan brought up the issue that many practitioners of statistics might object to using prior information in Bayesian statistics. The philosophical case for using prior information is very strong, and I think most people intuitively agree that using prior information is legitimate, at the very least in selecting what kinds of models to consider. I think most statistics users would be OK with using prior information when there is some kind of objective prior distribution. However, people justifiably worry about bias or overconfidence on the part of the statistician; people don’t want the results of statistics to depend much on the identity of the statistician.

In practice, this problem is not too hard to sidestep. There are at least two approaches:

The first is to include significantly less prior information than is available, to make make statistical inference robust to bias and overconfidence. The two common approaches to this are to use weakly informative priors or non-informative/maximum entropy priors. Weakly informative priors are very broad distributions that still include some prior information that almost no one would object to. For example, if you’re estimating the strength of a metal alloy, you might choose a prior distribution that expresses your belief that the strength will probably be stronger than that of tissue paper but weaker than a hundred times as strong as the strongest known material. Maximum entropy priors represent the minimum physically possible to know about the parameters of interest.

The second is to do the calculations using several different prior distributions that different consumers of the statistics might think are relevant. This accomplishes something like a sensitivity analysis for the prior distribution. For example, you might include a non-informative distribution, a weakly informative distribution and a very concentrated prior distribution. This allows people with different prior opinions to choose the result that makes the most sense to them.

This post will be a more technical than my previous post; I will assume familiarity with how MCMC sampling techniques for sampling from arbitrary distributions work (an overview starts on page 24, this introduction is more detailed). This post is about a specific class of MCMC algorithms: derivative based MCMC algorithms. I have two goals here: 1) to convince people that derivative based MCMC algorithms will have a profound effect on statistics and 2) to convince MCMC researchers that they should work on such algorithms. The goal of my previous post was to provide motivation for why good MCMC algorithms are so exciting.

A friend of mine suggested that this post would make the basis of a good grant application for statistics or applied math research. I can only hope that he is correct and someone picks up that idea. I’d do anything I could to help someone doing so.

### Some background

In my last post, I mentioned that one of the things holding Bayesian statistics back is the curse of dimensionality

Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).

The posterior distribution is a probability distribution function over the whole space of possible parameter values. If you want to integrate numerically over a distribution with 10 parameters to calculate some statistic, say the mean, and you split up the space into 20 bins along each parameter-dimension, you will need a 10 trillion element array. Working with a 10 trillion element array is very expensive in terms of both memory and computer time. Since many models involve many more than 10 parameters and we’d like to have higher resolution than 1 in 20, this is a serious problem.

Instead of integrating directly over the space, we can use Monte Carlo integration: sample from this probability distribution and use the samples calculate our statistic (for example, averaging the points to calculate the mean). Markov Chain Monte Carlo (MCMC) can be used to sample from any probability distribution. MCMC works by starting from an arbitrary point in the space and then picking a random point that’s near by, if that point is more likely than the current point, then that point is adopted as the current point. If it’s less likely than the current point, then it may still be adopted with a probability depending on the ratio of the likelihoods. If certain criteria are met (the detailed balance), this process will eventually randomly move around the whole distribution in a way that is proportional to the likelihood; the process will sample from the distribution (though each successive point is not statistically independent from the previous one).

Sounds great, but unfortunately naive MCMC does not solve our problem completely; in a high dimensional space, many more directions have decreasing probability than higher probability than have increasing probability. If we pick a direction at random, we have to move slowly or wait a long time for a good direction. Assuming an n-dimensional, approximately normal distribution, naive MCMC algorithms are O(n) in the number of steps it takes to get an independent sample. Now O(n) doesn’t sound that bad, but if you take into account the fact that calculating the likelihood is often already O(n), it means that fitting many models takes O(n**2) time. This drastically limits the models which can be fit without having a lot of MCMC expertise, or integrating over the distribution analytically.

### Derivative based MCMC algorithms

MCMC sampling has many similarities with optimization. In both applications, we have an often multi-dimensional function and we are most interested in the maxima. In optimization, we want to find the highest point on the function; in MCMC sampling we want to find regions of high probability and sample in those regions. In optimization, many functions are approximately quadratic near the optima; in MCMC sampling, many distributions are near normally distributed and the log of a normal distribution is quadratic (taking the log of the distribution is something you have to do anyway). In optimization, if you have a quadratic function, many algorithms will find a maxima in 1 step or very few steps regardless of the dimensionality of the function. They do this by using the first and second derivatives of the function to find a productive direction and magnitude to move in.

There is a class of MCMC algorithms which solve the curse of dimensionality by taking a lesson from optimization and use the derivatives of the posterior distribution to inform the step direction and size. This lets them preferentially consider the directions where probability is increasing using 1st derivative information and get a measure of the shape of the distribution using 2nd derivative information. Such algorithms perform much better than naive algorithms. They take larger step sizes, mix and converge faster. With respect to the number of parameters, Langevin MCMC algorithms (which use 1st derivative information) are O(n**1/3) (link), and Stochastic-Newton algorithms (which use 1st and 2nd derivative information and are analogous to Newtons Method) are O(1) (link). *A Stochastic-Newton method will independently sample an approximately normal distribution in approximately one step, regardless of the number of parameters.* This opens up a huge swath of the space of possible models for fitting without needing to do much math or needing much MCMC knowledge.

Derivative based MCMC algorithms have other advantages as well.

First, both 1st and 2nd derivative methods take much larger steps than naive methods. This means it is much easier to tell whether the distribution is converging or not in normal ways. The downside of this is that such algorithms probably have different failure modes than naive algorithms and might need different kinds of convergence diagnostics.

Second, 2nd derivative algorithms are self tuning to a large extent. Because the inverse hessian of the posterior distribution represents the variance of the normal distribution which locally approximates the function, such algorithms do not need a covariance tuning parameter in order to work well.

### The future of MCMC

The obvious problem with these methods is that they require derivatives which can be time consuming to calculate analytically and expensive to calculate numerically (at least O(n)). However there is an elegant solution: automatic differentiation. If you have analytic derivatives for the different component parts of a function and the analytic derivatives of the operations used to put them together, you can calculate the derivatives for the whole function using the chain rule. The components of the posterior distribution are usually well known distributions and algebraic transformations, so automatic differentiation is well suited to the task.

This approach fits in remarkably well with existing MCMC software, such as PyMC, which allow users to build complex models by combining common distributions and algebraic transformations and then allow users to select an MCMC algorithm to sample from the posterior distribution. Derivative information can be added to existing distributions so that derivative based MCMC algorithms can function.

I have taken exactly this approach for first derivative information in a PyMC branch used by my package multichain_mcmc which contains an Adaptive Langevine MCMC algorithm. I graduated a year ago with an engineering degree, and I have never formally studied MCMC or even taken a stochastic processes class; I am an amateur, and yet, I was able to put together such an algorithm for very general use; creating truly powerful algorithms for general use should pose little problem for professionals who put their mind to it.

There is a lot of low hanging fruit research fruit in this area. For example, the most popular optimization algorithms are not pure newton’s method because it is a bit fragile; the the same is likely true in MCMC, for the same reasons. Thus it is very attractive to look at popular optimization algorithms for ideas on how to create robust MCMC algorithms. There’s also the issue of combining derivative based MCMC algorithms with other algorithms with desirable properties. For example, DREAM (also available in multichain_mcmc) has excellent mode jumping characteristics; figuring out when to take DREAM-like steps for best performance is an important question.

Given its potential to make statistics dramatically more productive, I’ve seen surprisingly little research in this area. There is a huge volume of MCMC research, and as far as I can tell, not very much of it is focused on derivative based algorithms. There is some interesting work on Langevin MCMC; for example an adaptive Langevin algorithm, some convergence results, and an implicit Langevin scheme, and also some good work on 2nd derivative based methods; for example, optimization based work, some numerical work, and some recent work. But considering that Langevin MCMC came out 10 years ago much more focus is needed.

I’m not sure why this approach seems neglected. It might be that research incentives don’t reward such generally applicable research, or that MCMC researchers do not see how simplified MCMC could dramatically improve the productivity of statistics, or perhaps researchers haven’t realized how automatic differentiation can democratize these algorithms.

Whatever the issue is, I hope that it can be overcome and MCMC researchers focus more on derivative based MCMC methods in the near future. MCMC sampling will become more reliable, and troubleshooting chains when they do have problems will become easier. This means that even people who are only vaguely aware of how MCMC works can use these algorithms, bringing us closer to the promise of Bayesian statistics.

Cox showed that if you want to represent ‘degree’s of belief’ both using real numbers and consistent with classical logic, you must use probability theory (link). Bayes theorem is the theoretically correct way to update probabilities based on evidence. Bayesian statistics is the natural combination of these two facts.

Bayesian statistics two chief advantages over other kinds of statistics:

- Bayesian statistics is conceptually simple.
- This excellent book introduces statistics, some history and the whole of the theoretical foundations of Bayesian statistics in a mere 12 pages; the rest of the book is examples and methods.
- Users of classical statistics very frequently misunderstand what p-values and confidence intervals are. In contrast, posterior distributions are exactly what you’d expect them to be.
- After learning the basics, students can easily derive their own methods and set up their own problems. This is not at all true in classical statistics.

- Bayesian statistics is almost always explicitly model centric. It requires people to come up with a model which describes their problem. This has several advantages:
- It’s often very easy to build a model that’s very closely tailored to your problem and know immediately how to solve it conceptually if not practically.
- It makes it harder to be confused about what you’re doing.
- It’s easier to recognize when your assumptions are bad.

Here, I expand a little on the advantages of Bayesian statistics.

The promise of Bayesian statistics is that with Bayesian statistics, actually conducting statistical inference will only be as difficult as coming up with a good model. Statistics education will focus on modeling techniques, graphical display of data and results and checking your assumptions, rather than on tests and calculations. People with only a college class or two worth of statistics will be able to fit any kind of model they can write down. Fitting a model will be a non-event.

If Bayesian statistics is so great, why isn’t it more widely used? Why isn’t the promise of Bayesian statistics the reality of statistics? Two reasons:

- Bayesian statistics has historically not been very widely used. Scientists and engineers have grown up using classical statistics, so in order for new scientists and engineers to communicate with their peers and elders, they must know classical statistics.
- Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).

In a future post I will explain why I think a new kind of MCMC algorithm is largely going to resolve problem #2 in the near future.

If you’d like to learn Bayesian statistics and you remember your basic calculus and basic probability reasonably well, I recommend Data Analysis by Sivia. Bayesian Statistics by Bolstad has an intro to probability theory and the useful calculus, but isn’t as useful.