STAN is looking for someone to write a Python wrapper for STAN. In the comments, Bob Carpenter has some suggestions for PyMC3:

They should include Matt’s no-U-turn sampler. And they might be interested in how we handled constrained variables and vectorizations and cumulative distributions for truncation.

Bob’s comment made me realize we haven’t talked nearly enough, so I really appreciate Bob’s comment. I have some questions and comments for Bob and the rest of the STAN team:

- Handling constrained variables by transforming them (STAN manual, p.6): PyMC3 does not currently implement sampling on a transformed space, but that feature was actually one of the motivations for the design of PyMC3. It would be quite simple to add it, given that you have the log Jacobian determinant for the transformation. Of course, the Jacobian is non-trivial part. Any practical advice on when this has been most beneficial and by how much?
- Vectorization: Theano does pretty sophisticated vectorization which works the same way NumPy does.
- Cumulative distributions for truncation (STAN manual, p.145): this is a very cool idea, and should fit quite naturally in the PyMC3 framework.
- Neal Radford had some critcism for NUTS, saying it was didn’t provide strong benefits. His criticisms seem good to me, but I haven’t thought very deeply about it. Are there good responses to his criticism? Perhaps good examples of where NUTS works significantly better than a well tuned HMC sampler? I do not usually tune HMC by changing the mass matrix rather than the step size or trajectory length.
- Basing the scaling matrix on hessian matrix at a representative point (for example, the MAP) has often worked quite well for me. PyMC3 currently finds the hessian by differentiating the gradient numerically, but it’s also possible to calculate it analytically. Often, just the diagonal of the hessian is good enough. As I understand it, STAN is or perhaps was simply using the identity matrix as the mass matrix. Perhaps this accounts for why NUTS is so useful for STAN? This approach also allows you to supply either a covariance or hessian matrix; full or diagonal or sparse for either.

### Like this:

Like Loading...

*Related*

## 6 comments

Comments feed for this article

March 4, 2013 at 2:29 pm

Andrew GelmanHi–regarding your fourth point above: raw Nuts can (and has) been improved, but the alternative of hand-tuning HMC can be a real mess. If you’re an HMC expert and are working on a single big hard problem, then maybe it can make sense to use intuitions and experiments to hand-tune HMC. If you want to fit all sorts of models in a general setting, I think Nuts or some other autotuned method is the way to go.

March 4, 2013 at 4:48 pm

Michael Betancourt(4) When the distribution is sufficiently simple (i.e. the typical set closely resembles a gaussian) then the there will be a globally optimal integration time, and hence epsilon/L. In this case a well-tuned implementation of “vanilla” HMC will almost always outperform NUTS, but only by a factor of two or so.

Unfortunately, more-often-than-not the complex models where HMC is motivated don’t satisfy those conditions. The optimal integration time varies with the parameters, and no amount of expert-level tuning will ensure an effective HMC implementation. The dynamic NUTS algorithm, however, can respond to these spatial variations to maintain high-performance even in challenging distributions.

Not only does NUTS identify near-optimal integration times in simple models, its utility persists in complicated models.

(5) To be super-pedantic, the MAP is often highly unrepresentative in high-dimensions where the probability mass and probability density have very little intersection. Moreover, the Hessian is not always positive definite in the neighborhood of high probability mass, which can lead to serious numerical instabilities, nor is it constant across that neighborhood.

Regardless, a well-chosen, non-trivial mass matrix actually improves the performance of NUTS as it prevents the early termination that can come about from a large spectral gap in the Hessian. Right now we’re working on building up a mass matrix from a running-average of the covariance (we’ve played with using Hessian estimates as well, but these require regularization as noted above), as well as implementing Riemannian manifold HMC which utilizes a dynamic mass matrix.

We’re able to use autodiff to compute the Hessian (or its diagonal), as well as the higher-order derivatives required for RMHMC, so we don’t have to deal with finite diffs or require analytic user implementations.

March 5, 2013 at 9:37 am

Bob Carpenter(1) We doc the Jacobians we use in the manual chapter on variable transforms. It’s useful for constrained variables because simply rejecting them when the momentum wanders into their territory is terribly inefficient. An alternative is to “bounce” off simple constraints. It’s been useful for positive-constrained variables (e.g., precision, scale), ordered constraints (cutpoints in ordinal regression), simplexes (anything Dirichlet), and correlation and covariance matrices. I don’t really see hwo you’d do simplexes, correlation or covariance matrices otherwise because bouncing would be too complex and just failing outside support too slow.

(2) I mean vectorization of the auto-dif and the functions. We can partially evaluate the derivatives themselves rather than just passing them through to the arithmetic ops and functions. This led to a speedup of a factor of 10 or so. This along with figuring out which terms in the log prob are required for the posterior up to a proportion makes a huge difference. I’m not just talking about using matrix operations instead of loops.

(3) That’s Radford Neal, not Neal Radford :-)

(4) There are examples of NUTS vs. well-tuned samplers in the paper. And keep in mind that the tuning params are very sensitive in HMC — that means a lot of work finding the right step size and number of steps. See Michael’s comments for more on the geometric issues. You definitely need at least a diagonal mass matrix (or varying step sizes) or the scale issues are deadly. The problem is just a diagonal breaks the rotation invariance. And as Michael points out, is only a global model of curvature, where examples like Neal’s funnel require local adaptation (coming soon in Stan, as in hopefully in the next few months).

In addition to this theoretical argument, we have the evidence of dozens and dozens of models at this point.

(5) Is the Hessian calculated for the mass matrix? Numerical Hessians are very slow in high dimensions, as is any kind of finite differences compared to auto-diff. And finite diffs are also very inaccurate. Finally, you need to maintain positive definiteness, though there are ways to do this.

Stan originally just did a scaled unit mass matrix, then a diagonal mass matrix, and with 1.2.0, out in a day or two, we’ll have full mass matrices. The main problem with full matrices is it slows down the leapfrog a bit, can be unstable to estimate in high dimensions, and is slow in high dimensions due to the quadratic matrix ops per leapfrog step and the need to invert the matrix a few times during adaptation (log N in number of iterations).

March 6, 2013 at 11:40 am

jsalvati@Michael

(4) Local adaptation does seem like a very useful feature. Does NUTS adjust both epsilon and L? I’ll go back and re-read the paper.

(5) Yes, that’s true. I have typically gotten around this by holding some of the parameters constant and finding the map with respect to everything else. Typically it’s not too hard to see which variables are causing problems, but you’re right that the less specialized knowledge required the better.

@Bob

(1) I’ll check those out. I hope you don’t mind it if I steal them.

(2) I’m not sure if I understand what you’re saying. Let me describe what Theano does and we’ll see if that’s different. Theano does a number of automatic graph optimizations. For example, if a, b are multidimensional arrays, and you want to calculate d/da[ a ^b] it will figure out how to do this as a scalar operations and then do a single loop through the data. Another example, it will cancel out terms which cancel (if it can tell). Theano will also, precompute constants.

In the future, I would like to have Theano compute f(q,q0) = logp(q) – logp(q0) instead of logp(q) directly, this should lead to some additional canceling. You’re saying this lead to a big improvement for you guys?

(3) Oops! Sorry Radford!

(4) I’ll definitely take a look at some of those. Looks like I should take NUTS more seriously.

(5) Numerical hessians are indeed slow in high dimensions, and I’ve run into that problem a bit myself recently. PyMC3 can compute the hessians automatically, but I haven’t played with this very much yet. Computing the hessians automatically has seemed a bit slow to me too, but maybe not as slow as numerical.

In the future, I’d like to get a graph coloring algorithm for computing the hessian. Apparently there’s been some very elegant work done recently (http://www.stce.rwth-aachen.de/twiki/pub/Teaching/Summer10/SemCPSC/jacobian_color.pdf)

Accuracy hasn’t been an issue as far as I have seen. I use numdifftools (http://code.google.com/p/numdifftools/), which works adaptively.

To maintain positive definiteness I have mostly just manually changed the matrix. You guys are right that the more we can eliminate the need for this the better.

Have you guys tried using sparse matrixes for high dimensional mass matrixes? When the hessian is sparse (e.g. time series models), I have found this quite useful.

March 7, 2013 at 10:24 am

Michael BetancourtNUTS dynamically optimizes the total integration time, tau = epsilon * L, which is the more fundamental quantity for HMC. The algorithm is also paired with a dual-averaging adaptation to statically tune epsilon alone (often the two together are referred to as “NUTS” but I like to distinguish between the different algorithms).

March 8, 2013 at 12:35 pm

Bob Carpenter(1) No, go right ahead. They’re not original. We wrote them up hoping the writeups would be useful for others.

(2) For us, it’s about removing virtual function calls, removing entries on the reverse-mode auto-dif stack, and saving shared computations. For instance, if you have y ~ normal(0,sigma) and y is a vector and sigma a scalar, then you only need to compute and differentiate log(sigma) once.

It’s also about rolling in custom derivatives done on doubles for operations like matrix division or dot products. That is, we compute the derivatives directly in terms of scalars, rather than using auto-dif to differentiate the sum of products involved in a dot product.

I didn’t understand the f(q,q0) discussion.

We also make sure to only compute as much of the log prob function as we need by dropping any constant terms in the sum.

(5) We can compute Hessians in O(N^2) where N is the number of dimensions. It’s impossible to do better than that in O() terms because there are N^2 entries. Having said that, the computations aren’t super efficient because they stack forward-mode auto-dif on top of reverse mode and evaluate the log prob function on these variables N times.

We maintain positive definiteness for our covariance matrix and correlation types by transforming from unconstrained variables. The transforms themsevles are a bit hairy (more so for correlation matrices, which go through partial correlations, than for covariance matrices, for which we use Cholesky factorization). We still haven’t vectorized these properly yet.

We have not tried using sparse or low-rank or combined approaches to high-dimensional mass matrix estimation, but we’ve thought about it.

I’ll check out the graph coloring paper. We’ve coded all of our Jacobians by hand. For Hessians, it’d be nice to save some shared computations, because the overall computation seems very redundant.