STAN is looking for someone to write a Python wrapper for STAN. In the comments, Bob Carpenter has some suggestions for PyMC3:

They should include Matt’s no-U-turn sampler. And they might be interested in how we handled constrained variables and vectorizations and cumulative distributions for truncation.

Bob’s comment made me realize we haven’t talked nearly enough, so I really appreciate Bob’s comment. I have some questions and comments for Bob and the rest of the STAN team:

1. Handling constrained variables by transforming them (STAN manual, p.6): PyMC3 does not currently implement sampling on a transformed space, but that feature was actually one of the motivations for the design of PyMC3. It would be quite simple to add it, given that you have the log Jacobian determinant for the transformation. Of course, the Jacobian is non-trivial part. Any practical advice on when this has been most beneficial and by how much?
2. Vectorization: Theano does pretty sophisticated vectorization which works the same way NumPy does.
3. Cumulative distributions for truncation (STAN manual, p.145): this is a very cool idea, and should fit quite naturally in the PyMC3 framework.
4. Neal Radford had some critcism for NUTS, saying it was didn’t provide strong benefits. His criticisms seem good to me, but I haven’t thought very deeply about it. Are there good responses to his criticism? Perhaps good examples of where NUTS works significantly better than a well tuned HMC sampler? I do not usually tune HMC by changing the mass matrix rather than the step size or trajectory length.
5. Basing the scaling matrix on hessian matrix at a representative point (for example, the MAP) has often worked quite well for me. PyMC3 currently finds the hessian by differentiating the gradient numerically, but it’s also possible to calculate it analytically. Often, just the diagonal of the hessian is good enough. As I understand it, STAN is or perhaps was simply using the identity matrix as the mass matrix. Perhaps this accounts for why NUTS is so useful for STAN? This approach also allows you to supply either a covariance or hessian matrix; full or diagonal or sparse for either.