There are a couple of cool papers on Hamiltonian Monte Carlo (HMC) that I’ve found recently.
- Quasi-Newton Methods for Markov Chain Monte Carlo – The author, Y. Zhang, adapts the classic optimization algorithm BFGS to yield an algorithm for automatically building a local mass matrix for the HMC sampling from the history of posterior gradients (which are already required for HMC). There is also a limited memory (and computation cost) version of the algorithm analogous to L-BFGS which only stores an approximation of a full rank hessian matrix. This has the potential to make HMC require significantly less tuning. Currently, I usually use the hessian at the Maximum A Posteriori point has the mass matrix for HMC, but this will fail for distributions that are much more flat or peaked at the MAP than they are elsewhere in the distribution and this may be more robust.
- Split Hamiltonian Monte Carlo – The authors, Shababa, Lan, Johnson and R. Neal, develop a technique for splitting the Hamiltonian into a fast part and a slow part. In Algorithm 1, they split the Hamiltonian into a normal approximation of the posterior (centered at the MAP and with covariance based on the inverse hessian at the MAP) and a residual component. The normal approximation can be simulated exactly and will contribute nothing to the error term, hopefully decreasing the overall error. When the distribution is close to a normal distribution, this technique should have improved efficiency (per sample) which should stay relatively constant as the number of dimensions increases. In the paper, the authors show pretty significant speed increases when doing logistic regression.
I wonder whether this type of split HMC will work very poorly on distributions that have pretty hard edges (like scale parameters), or more generally, thin tails. I think it would be fairly easy to get into a situation where the exact part of the simulation wants to move into a part of the space that is of really low probability.
I would love to see a synthesis between and Zhang’s Quasi-Newton technique. A synthesis should allow both the computation (hessian related operations normally scale O(n**2)) and the acceptance probability to scale well as the number of dimensions increases.