You are currently browsing the category archive for the ‘statistics’ category.
STAN is looking for someone to write a Python wrapper for STAN. In the comments, Bob Carpenter has some suggestions for PyMC3:
They should include Matt’s no-U-turn sampler. And they might be interested in how we handled constrained variables and vectorizations and cumulative distributions for truncation.
Bob’s comment made me realize we haven’t talked nearly enough, so I really appreciate Bob’s comment. I have some questions and comments for Bob and the rest of the STAN team:
- Handling constrained variables by transforming them (STAN manual, p.6): PyMC3 does not currently implement sampling on a transformed space, but that feature was actually one of the motivations for the design of PyMC3. It would be quite simple to add it, given that you have the log Jacobian determinant for the transformation. Of course, the Jacobian is non-trivial part. Any practical advice on when this has been most beneficial and by how much?
- Vectorization: Theano does pretty sophisticated vectorization which works the same way NumPy does.
- Cumulative distributions for truncation (STAN manual, p.145): this is a very cool idea, and should fit quite naturally in the PyMC3 framework.
- Neal Radford had some critcism for NUTS, saying it was didn’t provide strong benefits. His criticisms seem good to me, but I haven’t thought very deeply about it. Are there good responses to his criticism? Perhaps good examples of where NUTS works significantly better than a well tuned HMC sampler? I do not usually tune HMC by changing the mass matrix rather than the step size or trajectory length.
- Basing the scaling matrix on hessian matrix at a representative point (for example, the MAP) has often worked quite well for me. PyMC3 currently finds the hessian by differentiating the gradient numerically, but it’s also possible to calculate it analytically. Often, just the diagonal of the hessian is good enough. As I understand it, STAN is or perhaps was simply using the identity matrix as the mass matrix. Perhaps this accounts for why NUTS is so useful for STAN? This approach also allows you to supply either a covariance or hessian matrix; full or diagonal or sparse for either.
There are a couple of cool papers on Hamiltonian Monte Carlo (HMC) that I’ve found recently.
- Quasi-Newton Methods for Markov Chain Monte Carlo – The author, Y. Zhang, adapts the classic optimization algorithm BFGS to yield an algorithm for automatically building a local mass matrix for the HMC sampling from the history of posterior gradients (which are already required for HMC). There is also a limited memory (and computation cost) version of the algorithm analogous to L-BFGS which only stores an approximation of a full rank hessian matrix. This has the potential to make HMC require significantly less tuning. Currently, I usually use the hessian at the Maximum A Posteriori point has the mass matrix for HMC, but this will fail for distributions that are much more flat or peaked at the MAP than they are elsewhere in the distribution and this may be more robust.
- Split Hamiltonian Monte Carlo – The authors, Shababa, Lan, Johnson and R. Neal, develop a technique for splitting the Hamiltonian into a fast part and a slow part. In Algorithm 1, they split the Hamiltonian into a normal approximation of the posterior (centered at the MAP and with covariance based on the inverse hessian at the MAP) and a residual component. The normal approximation can be simulated exactly and will contribute nothing to the error term, hopefully decreasing the overall error. When the distribution is close to a normal distribution, this technique should have improved efficiency (per sample) which should stay relatively constant as the number of dimensions increases. In the paper, the authors show pretty significant speed increases when doing logistic regression.
I wonder whether this type of split HMC will work very poorly on distributions that have pretty hard edges (like scale parameters), or more generally, thin tails. I think it would be fairly easy to get into a situation where the exact part of the simulation wants to move into a part of the space that is of really low probability.
I would love to see a synthesis between and Zhang’s Quasi-Newton technique. A synthesis should allow both the computation (hessian related operations normally scale O(n**2)) and the acceptance probability to scale well as the number of dimensions increases.
Neal Radford and others had some interesting responses to my question about why Hamiltonian MCMC (HMC) might be better than Langevin MCMC (MALA). The gist of it seems to be that HMC is less random-walk like and thus mixes faster and has better scaling with number of dimensions.
Radford points to a survey paper of his (link) which discusses how the momentum distribution should be adjusted for changes in the scaling of the probability distribution (p. 22). This is something which I didn’t see last time I looked at HMC, and it’s necessary for an adaptive HMC algorithm. General use sampling algorithms can benefit a lot from being adaptive.
It also discusses tuning the step-count and step-size. This sounds rather difficult and non-linear.
I am going to try to implement an adaptive HMC algorithm in my multichain_mcmc package. I’d like to make this algorithm adaptive as I’ve done for my MALA implementation, though in general, this needs to be done carefully (see Atchade and Rosenthall 2005).
I’m interested in RM-HMC as it promises automatic scale tuning and better efficiency scaling with high dimensions, but it looks like understanding it requires differential geometry, which I haven’t yet worked through. I believe it also requires 2nd derivatives (which provide scale information), which I haven’t yet figured out how to implement in an efficient and generic manner for PyMC. I suspect that would require a fork and redesign of PyMC.
In my last post, Cyan brought up the issue that many practitioners of statistics might object to using prior information in Bayesian statistics. The philosophical case for using prior information is very strong, and I think most people intuitively agree that using prior information is legitimate, at the very least in selecting what kinds of models to consider. I think most statistics users would be OK with using prior information when there is some kind of objective prior distribution. However, people justifiably worry about bias or overconfidence on the part of the statistician; people don’t want the results of statistics to depend much on the identity of the statistician.
In practice, this problem is not too hard to sidestep. There are at least two approaches:
The first is to include significantly less prior information than is available, to make make statistical inference robust to bias and overconfidence. The two common approaches to this are to use weakly informative priors or non-informative/maximum entropy priors. Weakly informative priors are very broad distributions that still include some prior information that almost no one would object to. For example, if you’re estimating the strength of a metal alloy, you might choose a prior distribution that expresses your belief that the strength will probably be stronger than that of tissue paper but weaker than a hundred times as strong as the strongest known material. Maximum entropy priors represent the minimum physically possible to know about the parameters of interest.
The second is to do the calculations using several different prior distributions that different consumers of the statistics might think are relevant. This accomplishes something like a sensitivity analysis for the prior distribution. For example, you might include a non-informative distribution, a weakly informative distribution and a very concentrated prior distribution. This allows people with different prior opinions to choose the result that makes the most sense to them.
This post will be a more technical than my previous post; I will assume familiarity with how MCMC sampling techniques for sampling from arbitrary distributions work (an overview starts on page 24, this introduction is more detailed). This post is about a specific class of MCMC algorithms: derivative based MCMC algorithms. I have two goals here: 1) to convince people that derivative based MCMC algorithms will have a profound effect on statistics and 2) to convince MCMC researchers that they should work on such algorithms. The goal of my previous post was to provide motivation for why good MCMC algorithms are so exciting.
A friend of mine suggested that this post would make the basis of a good grant application for statistics or applied math research. I can only hope that he is correct and someone picks up that idea. I’d do anything I could to help someone doing so.
In my last post, I mentioned that one of the things holding Bayesian statistics back is the curse of dimensionality
Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).
The posterior distribution is a probability distribution function over the whole space of possible parameter values. If you want to integrate numerically over a distribution with 10 parameters to calculate some statistic, say the mean, and you split up the space into 20 bins along each parameter-dimension, you will need a 10 trillion element array. Working with a 10 trillion element array is very expensive in terms of both memory and computer time. Since many models involve many more than 10 parameters and we’d like to have higher resolution than 1 in 20, this is a serious problem.
Instead of integrating directly over the space, we can use Monte Carlo integration: sample from this probability distribution and use the samples calculate our statistic (for example, averaging the points to calculate the mean). Markov Chain Monte Carlo (MCMC) can be used to sample from any probability distribution. MCMC works by starting from an arbitrary point in the space and then picking a random point that’s near by, if that point is more likely than the current point, then that point is adopted as the current point. If it’s less likely than the current point, then it may still be adopted with a probability depending on the ratio of the likelihoods. If certain criteria are met (the detailed balance), this process will eventually randomly move around the whole distribution in a way that is proportional to the likelihood; the process will sample from the distribution (though each successive point is not statistically independent from the previous one).
Sounds great, but unfortunately naive MCMC does not solve our problem completely; in a high dimensional space, many more directions have decreasing probability than higher probability than have increasing probability. If we pick a direction at random, we have to move slowly or wait a long time for a good direction. Assuming an n-dimensional, approximately normal distribution, naive MCMC algorithms are O(n) in the number of steps it takes to get an independent sample. Now O(n) doesn’t sound that bad, but if you take into account the fact that calculating the likelihood is often already O(n), it means that fitting many models takes O(n**2) time. This drastically limits the models which can be fit without having a lot of MCMC expertise, or integrating over the distribution analytically.
Derivative based MCMC algorithms
MCMC sampling has many similarities with optimization. In both applications, we have an often multi-dimensional function and we are most interested in the maxima. In optimization, we want to find the highest point on the function; in MCMC sampling we want to find regions of high probability and sample in those regions. In optimization, many functions are approximately quadratic near the optima; in MCMC sampling, many distributions are near normally distributed and the log of a normal distribution is quadratic (taking the log of the distribution is something you have to do anyway). In optimization, if you have a quadratic function, many algorithms will find a maxima in 1 step or very few steps regardless of the dimensionality of the function. They do this by using the first and second derivatives of the function to find a productive direction and magnitude to move in.
There is a class of MCMC algorithms which solve the curse of dimensionality by taking a lesson from optimization and use the derivatives of the posterior distribution to inform the step direction and size. This lets them preferentially consider the directions where probability is increasing using 1st derivative information and get a measure of the shape of the distribution using 2nd derivative information. Such algorithms perform much better than naive algorithms. They take larger step sizes, mix and converge faster. With respect to the number of parameters, Langevin MCMC algorithms (which use 1st derivative information) are O(n**1/3) (link), and Stochastic-Newton algorithms (which use 1st and 2nd derivative information and are analogous to Newtons Method) are O(1) (link). A Stochastic-Newton method will independently sample an approximately normal distribution in approximately one step, regardless of the number of parameters. This opens up a huge swath of the space of possible models for fitting without needing to do much math or needing much MCMC knowledge.
Derivative based MCMC algorithms have other advantages as well.
First, both 1st and 2nd derivative methods take much larger steps than naive methods. This means it is much easier to tell whether the distribution is converging or not in normal ways. The downside of this is that such algorithms probably have different failure modes than naive algorithms and might need different kinds of convergence diagnostics.
Second, 2nd derivative algorithms are self tuning to a large extent. Because the inverse hessian of the posterior distribution represents the variance of the normal distribution which locally approximates the function, such algorithms do not need a covariance tuning parameter in order to work well.
The future of MCMC
The obvious problem with these methods is that they require derivatives which can be time consuming to calculate analytically and expensive to calculate numerically (at least O(n)). However there is an elegant solution: automatic differentiation. If you have analytic derivatives for the different component parts of a function and the analytic derivatives of the operations used to put them together, you can calculate the derivatives for the whole function using the chain rule. The components of the posterior distribution are usually well known distributions and algebraic transformations, so automatic differentiation is well suited to the task.
This approach fits in remarkably well with existing MCMC software, such as PyMC, which allow users to build complex models by combining common distributions and algebraic transformations and then allow users to select an MCMC algorithm to sample from the posterior distribution. Derivative information can be added to existing distributions so that derivative based MCMC algorithms can function.
I have taken exactly this approach for first derivative information in a PyMC branch used by my package multichain_mcmc which contains an Adaptive Langevine MCMC algorithm. I graduated a year ago with an engineering degree, and I have never formally studied MCMC or even taken a stochastic processes class; I am an amateur, and yet, I was able to put together such an algorithm for very general use; creating truly powerful algorithms for general use should pose little problem for professionals who put their mind to it.
There is a lot of low hanging fruit research fruit in this area. For example, the most popular optimization algorithms are not pure newton’s method because it is a bit fragile; the the same is likely true in MCMC, for the same reasons. Thus it is very attractive to look at popular optimization algorithms for ideas on how to create robust MCMC algorithms. There’s also the issue of combining derivative based MCMC algorithms with other algorithms with desirable properties. For example, DREAM (also available in multichain_mcmc) has excellent mode jumping characteristics; figuring out when to take DREAM-like steps for best performance is an important question.
Given its potential to make statistics dramatically more productive, I’ve seen surprisingly little research in this area. There is a huge volume of MCMC research, and as far as I can tell, not very much of it is focused on derivative based algorithms. There is some interesting work on Langevin MCMC; for example an adaptive Langevin algorithm, some convergence results, and an implicit Langevin scheme, and also some good work on 2nd derivative based methods; for example, optimization based work, some numerical work, and some recent work. But considering that Langevin MCMC came out 10 years ago much more focus is needed.
I’m not sure why this approach seems neglected. It might be that research incentives don’t reward such generally applicable research, or that MCMC researchers do not see how simplified MCMC could dramatically improve the productivity of statistics, or perhaps researchers haven’t realized how automatic differentiation can democratize these algorithms.
Whatever the issue is, I hope that it can be overcome and MCMC researchers focus more on derivative based MCMC methods in the near future. MCMC sampling will become more reliable, and troubleshooting chains when they do have problems will become easier. This means that even people who are only vaguely aware of how MCMC works can use these algorithms, bringing us closer to the promise of Bayesian statistics.
Cox showed that if you want to represent ‘degree’s of belief’ both using real numbers and consistent with classical logic, you must use probability theory (link). Bayes theorem is the theoretically correct way to update probabilities based on evidence. Bayesian statistics is the natural combination of these two facts.
Bayesian statistics two chief advantages over other kinds of statistics:
- Bayesian statistics is conceptually simple.
- This excellent book introduces statistics, some history and the whole of the theoretical foundations of Bayesian statistics in a mere 12 pages; the rest of the book is examples and methods.
- Users of classical statistics very frequently misunderstand what p-values and confidence intervals are. In contrast, posterior distributions are exactly what you’d expect them to be.
- After learning the basics, students can easily derive their own methods and set up their own problems. This is not at all true in classical statistics.
- Bayesian statistics is almost always explicitly model centric. It requires people to come up with a model which describes their problem. This has several advantages:
- It’s often very easy to build a model that’s very closely tailored to your problem and know immediately how to solve it conceptually if not practically.
- It makes it harder to be confused about what you’re doing.
- It’s easier to recognize when your assumptions are bad.
Here, I expand a little on the advantages of Bayesian statistics.
The promise of Bayesian statistics is that with Bayesian statistics, actually conducting statistical inference will only be as difficult as coming up with a good model. Statistics education will focus on modeling techniques, graphical display of data and results and checking your assumptions, rather than on tests and calculations. People with only a college class or two worth of statistics will be able to fit any kind of model they can write down. Fitting a model will be a non-event.
If Bayesian statistics is so great, why isn’t it more widely used? Why isn’t the promise of Bayesian statistics the reality of statistics? Two reasons:
- Bayesian statistics has historically not been very widely used. Scientists and engineers have grown up using classical statistics, so in order for new scientists and engineers to communicate with their peers and elders, they must know classical statistics.
- Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).
In a future post I will explain why I think a new kind of MCMC algorithm is largely going to resolve problem #2 in the near future.
If you’d like to learn Bayesian statistics and you remember your basic calculus and basic probability reasonably well, I recommend Data Analysis by Sivia. Bayesian Statistics by Bolstad has an intro to probability theory and the useful calculus, but isn’t as useful.
Bayesian statistics is a simple and unified theory of statistical inference built directly from probability theory. It is generally treated as an advanced subject in statistics, but I think this is inappropriate. Introductory statistics courses in college should use Bayesian statistics in classes designed for engineers and other technical students.
I can think of four significant advantages to teaching Bayesian statistics instead of frequentist statistics to engineering and other technical students:
- Because Bayesian methods are derived directly from a simple and unified theory of inference technical students will find it easier to learn and retain Bayesian statistics than they would frequentist methods. This will allow students to learn and retain more.
- Bayesian statistics, generalizes more obviously than frequentist statistics. It is easy for even Bayesian novices to do inference using novel models. Because doing Bayesian statistical inference is conceptually simple, engineers and other practitioners will use statistical analysis where they would otherwise not have and be able to tackle more complex problems than they would otherwise be able to.
- Bayesian methods can sometimes squeeze more information out of data than frequentist statistics. Bayesian methods perform at least as well as, and sometimes better than, frequentist statistics, even when evaluated by frequentist criteria.
- The output in Bayesian methods is a subjective probability distribution about the parameters of interest. This makes coming up with errors extremely simple, which is valuable for engineers because it is common in engineering problems for uncertainties to be just as important as point estimates.
- Bayesian statistics leads naturally into normative decision theory. This makes it easy to give students a technical understanding of what it means to arrive at optimal decisions.
The one problem with Bayesian statistics is that solving problems using Bayesian statistics is frequently computationally expensive. However, there are relatively simple and general numerical techniques that can be used to solve problems. Additionally, engineering students are generally already familiar with using simple simple tools (Microsoft Excel and graphing calculators) to do solve problems numerically.
For historical reasons, despite Bayesian statistics’ advantages over frequentist statistics, most introductory statistics courses use frequentist methods. Until about 15 years ago, computation was expensive enough that it made Bayesian methods impractical for most appliations. Today computation is cheap enough that it is not a barrier to using Bayesian methods, so schools should no longer teach frequentist statistics first.
Happily, there are at least some undergraduate courses that do teach introductory statistics courses using Bayesian statistics, for example at Waikato University in New Zealand. Students will benefit from more schools will adopt the Bayesian perspective for their introductory courses, and I hope that schools do so quickly.
Resources for learning the Bayesian perspective on probability (the very very basics of Bayesian statistics):