I would find it really useful to have a simple concrete mathematical model that demonstrates monetary disequilibrium. I could use it to troubleshoot my intuitions about monetary matters, develop new and better intuitions, and better explain the logic of monetary disequilibrium better. Unfortunately, I haven’t run across such a model and it looks like my current math and modeling skills are insufficient to produce one myself.  Does anyone have a paper, book or post that presents mathematical model of monetary disequilibrium suited for at least one of these purposes?

Here’s an example of what I would expect such a model to look like:

An economy with a large number of two types of agents each producing a different good and an infinite number of periods. Both agents have the same type of utility function which has a term for how much of each good they consume in each period, how much of their production good they produce, and a term for the utility of money which is proportional to the amount they spend in each period. There should be some set of prices that characterize total equilibrium. We can investigate the effects of monetary disequilibrium by seeing how different price paths influence different agent’s utility, production and consumption over time.

Karl Smith claims

Money does not create anything. Value stored as money is value lost; lost because it represents resources not directed towards capital.

There is some truth to what he says, but this claim is false. It’s not that investing in money doesn’t actually cause an increase in capital available, it’s that it happens to invest in not very productive way.
As I’ve said before, I think it’s useful to think of central banks as private “producers of money” (who happen to have a monopoly and aren’t motivated primarily by profit). Think of the dollars as a product built and sold by the Fed. What does the Fed use to produce dollars? They use government bonds. They take them and use them to make their promise that dollars will maintain their value credible. This isn’t the only way money can be produced, other financial assets could be used, such as a basket of stocks. Because of this, investing in money is effectively the same as investing in whatever asset the central bank uses to produce its money (albeit at a worse interest rate).
Assuming the Fed approximately expands the dollar supply when demand to hold dollars goes up (and vice versa), an increase in demand to hold money means the Fed buys whatever asset they use to produce money. This causes an increase in the demand for that asset. You might not think that an increase in demand for government bonds causes good investment on the margin, but it’s also not wasted completely. How much waste depends on: 1) the government bond supply curve 2) the elasticity of demand between government bonds and other assets 3)  how good the government is at doing productive investments relative to private investment.

That said, it’s conceptually easy to make money a poor store of value: give it a large negative interest rate. This is necessary when the asset used to produce money (normally government bonds) have a low or negative interest rate in order to avoid having the central bank subsidize people’s holding of money.

Neal Radford and others had some  interesting responses to my question about why Hamiltonian MCMC (HMC) might be better than Langevin MCMC (MALA). The gist of it seems to be that HMC is less random-walk like and thus mixes faster and has better scaling with number of dimensions.

Radford points to a survey paper of his (link) which discusses how the momentum distribution should be adjusted for changes in the scaling of the probability distribution (p. 22). This is something which I didn’t see last time I looked at HMC, and it’s necessary for an adaptive HMC algorithm. General use sampling algorithms can benefit a lot from being adaptive.

It also discusses tuning the step-count and step-size. This sounds rather difficult and non-linear.

I am going to try to implement an adaptive HMC algorithm in my multichain_mcmc package. I’d like to make this algorithm adaptive as I’ve done for my MALA implementation, though in general, this needs to be done carefully (see Atchade and Rosenthall 2005).

I’m interested in RM-HMC as it promises automatic scale tuning and better efficiency scaling with high dimensions, but it looks like understanding it requires differential geometry, which I haven’t yet worked through. I believe it also requires 2nd derivatives (which provide scale information), which I haven’t yet figured out how to implement in an efficient and generic manner for PyMC. I suspect that would require a fork and redesign of PyMC.

Economists frequently mention the idea of an Optimal Currency Area. Krugman does it. Barry Eichengreen does it. Even monetary equilibriumist Nick Rowe does it.

As I understand it, the idea is that monetary policy helps alleviate recessions. Because different one area can be in a boom and another in a bust at the same time, it is useful to have small currency areas because then you can have more finely tuned monetary policy. This pushes the currency area that maximizes benefits (the optimal currency area) smaller. The fact that arranging trade with different currencies can be more expensive and that areas can have correlated business cycles pushes the optimal currency area bigger.

If you understand monetary economics from a monetary-equilibrium perspective, this should strike you as exceedingly odd.

First, lets make some important distinctions. Lets say a “recession” is a temporary decline in the production of market goods, without specifying it’s cause. The monetary equilibrium theorists note that an a decrease in the quantity of money relative to the demand for money can cause such a temporary decline in production and has a negative effect on welfare (explanation). Any given recession might be due to monetary disequilibrium  and/or other effects.

Monetary equilibrium theory implies that relieving monetary disequilibrium by adjusting the quantity of money to reflect changes in the demand for money is welfare enhancing because it avoids price adjustment costs as well as the costs of non-equilibrium production.

However, monetary equilibrium theory does not suggest that adjusting the quantity of money to respond to (temporary or non-temporary) changes in production for reasons other that monetary disequilibrium is welfare enhancing. If production of market goods falls because of a real productivity shock, increasing the quantity to compensate increases market good production but is welfare reducing because it adds adjustment costs and moves market good production away from it’s equilibrium level.

Thus, if Optimal Currency Areas are to make sense from a monetary disequilibrium perspective, it must be that different areas in the same currency zone can have monetary disequilibrium in the opposite directions.

The major purpose of the financial system is to move money (and other assets) from those who want them relatively less to those who want them relatively more. People who want to hold money relatively more than others borrow or sell assets and vice versa.

If the financial system is not doing this, then we already have two different currencies. Monetary policy conducted in the first area doesn’t have much of an effect on the second area and vice versa. The same bills in the first area may have a totally different price than in the second area. Making these two kinds currencies more readily distinguishable (by changing the “currency area”) would only make it harder for the whole economy to come to equilibrium.

I often see people express the idea that the production or destruction of money must necessarily cause problems for the economy because that money does not “represent new real wealth”. There are many variants of this notion, such as that “good money” must “represent” some real asset (like gold).

However, this notion is fundamentally confused.

First, notice that as method of economic reasoning “representation” not great; there is no deep economic notion of “representation”. At best it could be a heuristic, you notice that money is not connected to particular real projects and think “huh, that’s weird” and decide to investigate further.

Next, notice that financial assets in general do not derive their value from “representing” some project or another. A financial asset derives its value from another party’s credible promise that the holder of the financial asset may receive something of value at some point in the future. For example, a corporation may issue bonds to undertake a new project and these bonds will have value, but the value is not derived from the project, the value is derived from the promise the corporation gives that the bonds will be honored. Such corporate bonds would have the same value whether the corporation issued them for a new profitable project or an unprofitable project or because of a clerical error, and they would cease to have value if the corporation’s promise went away.

Financial assets are useful because they are useful to either the issuer or the holder. Bonds allow businesses to undertake projects or smooth out cash flows; stocks allow businesses to get initial capital and allow investors to store resources;  money helps lower transactions costs for people.

Finally, note that a financial asset is an asset to one party (the holder) and a liability to another party (the issuer). The subjective value of the asset to the holder may be larger or smaller than the subjective value of the liability to the issuer.

The US dollar has value because there are implicit (but credible) promises that it can be exchanged for something of value. These promises come from two sources: 1) the general public because they currently accept money as payment for other things of value 2) the Federal Reserve because they implicitly promise that they will trade dollars for something else of value in order to make sure that dollars continue to be valuable. Like other financial assets, its value has nothing to do with whether it represents real assets or not, and whether the economy would be better off with more or less of it has nothing to do with whether it “represents” real projects.

This was an attempt to address a popular confusion. I’m not totally satisfied with it, so if you have suggestions on how to improve it or know an article that does it better, let me know.

Silas Barta and I have a long ongoing debate (part 1, 2,3 part 2 actually comes before part 1) about monetary economics, 2008 recession policy and the views of mainstream macro-economists. This post summarizes the progress of the debate:

Theoretical issues

I think I’ve convinced Silas of a couple of things: I’ve clarified the mechanism by which monetary disequilibrium works. I’ve convinced Silas that the non-monetary impacts of conducting monetary policy, meaning buying and selling financial assets with newly created money, are not large. I have convinced Silas that having the Fed try to adjust the quantity of money to accommodate changes in the demand for money is not a terrible policy, though I don’t think I’ve convinced him it’s a good policy.

Silas has convinced me that the possibility of a decrease in the demand for money due to a decrease in market activity (for example, a shift towards consuming leisure instead of consumer goods) should be taken seriously. I think the evidence strongly indicates that’s not what’s going on right now, but a good monetary system should be able to handle such a change. I am not sure what kind of rule would deal well with this case as well as more conventional cases.

2008 recession policy

Silas and I still disagree about whether the evidence suggests that a high demand for money relative to the quantity of money has been a major problem over the last ~2 years. I haven’t convinced Silas that TARP and similar policies are basically independent of monetary policy, meaning not recommended (or disrecommended) by standard macro as well as implementable independently of monetary policy. Silas and I also disagree about how bad TARP and similar policies were. I claim that they were not terrible but not great. Silas seems to think they were terrible, but I am not clear on why.

Mainstream macro-economist’s view of the world

Silas and I still disagree about whether mainstream macro-economists see surface level economic statistics (inflation, GDP, spending, loans, interest rates, unemployment etc.) as ends in themselves, rather than being indicative of the state of the economy. I say it is obvious that mainstream macro-economists understand this distinction, while Silas maintains he doesn’t see any evidence they do. Silas and I do agree that many mainstream macro-economists have a poor understanding of monetary economics, so that even if they do understand the surface level statistics/ actual welfare distinction much of their advice will be bad.

Previously, I discussed the features which distinguish money from other goods (Money as a good), why you should view most money as a branded product, and how that affects the perspective you should take on central bank actions (Money as a product). I showed that it makes sense to talk about the best quantity of a particular money in the economy. Now I want to discuss one important process that affects what the best quantity of money is.

This process is called “monetary disequilibrium”, “excess cash balances mechanism” and probably some other things as well. The Keynesian concept of the “Paradox Of Thrift” is related, though less well developed. I will first describe the process informally. In later posts I will describe it more formally. My intent here is to give an intuitive explanation of the basics of monetary disequilibrium.

The real quantity of money that people would like to hold in equilibrium can change over time. Because prices are sticky this can have real effects in the economy. To see how, consider an economy initially at equilibrium with a fixed quantity of money and prices that adjust to changes only after some time (sticky prices). Some people in the economy decide they want to hold higher money balances than they had in the past:

When people hold less money than they would like, they try to increase their holdings of money in two ways: 1) try to reduce their spending 2) try to increase their income. The quantity of money is fixed, so if one person holds a higher nominal quantity of money than before, all others must hold a lower quantity of money than before in aggregate. Prices are fixed, so this is also true for the real quantity of money. When one person reduces their spending, they reduce the income of all others in aggregate. Unless those others desire to hold less money than before, they now hold less money than they would like. Now those others also try to increase their money holdings by the same means. This is a vicious circle and aggregate spending and incomes decline. The circle ends when people no longer want to cut their their spending to achieve higher money balances.

There are two effects which determine how far this process proceeds. 1) The quantity that people want to hold is positively related to the quantity people expect to spend, so as people expect to spend less they will need to hold somewhat less money. 2) As people reduce their spending, those reductions become more painful, so will be more reluctant to trade off consumption for increased money balances.

This process reduces the real quantity of market transactions below it’s equilibrium level. The real quantity of market transactions can only return to normal when prices have adjusted to the new equilibrium, so that people can hold higher real money balances given the fixed nominal quantity of money.

This is the foundational insight of money-based macroeconomics. For some reason this process is not explained in introductory macroeconomics classes, nor commonly discussed by mainstream macro-economists. I believe understanding this logic is critical for understanding the effect of money in the economy and for understanding macroeconomic fluctuations.

Arnold Kling constantly says things that give me the impression that he does not really grok the money-based macro theories he criticizes. For example, he once stated

Pretty much everything in AS/AD is riding on the hypothesis that labor supply is highly elastic at the nominal wage and labor demand is reasonably elastic at the real wage.

Depending on what exactly he meant, this is either false or very misleading.  There are certainly people who think it works this way, macro-economists even, but as Nick Rowe as explained, explanations that rely on the first order effects of real prices do not make sense. The only foundations for AS/AD-like models that make any sense is some kind of monetary-disequilibrium theory. In a monetary disequilibrium theory (Sumner calls it excess cash balances mechanism), if people hold lower real money balances than they would like, they try to accumulate higher money balances by reducing their spending or trying to increase their sales. Since one person’s spending is another’s income, an overall increase in the demand for money without an increase in the supply of money will lead to a decrease in overall spending (you can also call this a decrease in AD, though I don’t see the use).

The latest example is here (#2) (I was a tad too rude in the comments, and I apologize for that)

Yesterday in my high school econ class, I found myself trying to explain why having a separate currency that could depreciate would enable the PIIGS to live happily ever after. I made the textbook argument, but I found myself not so convinced. OK, so maybe you can tell a story where one country that has a recession and a large fiscal deficit would be better off with devaluation. But there are so many countries in that position right now, and they cannot all devalue.

Speaking of “cannot all devalue,” doesn’t the impact of the PIIGS crisis completely nullify QE2? If the dollar appreciates 10 percent and the foreign sector is 10 percent of the economy, then that represents 1 percent disinflation, which probably more than wipes out any inflationary impact of the Fed’s new bond buying program.

To me this just screams “missing the point”. Exchange rate effects are not how coherent money-based macro. Neither are the traditional income/substitution effects (unless you mean substitution towards holding money). It’s monetary disequilibrium.

In my last post, Cyan brought up the issue that many practitioners of statistics might object to using prior information in Bayesian statistics. The philosophical case for using prior information is very strong, and I think most people intuitively agree that using prior information is legitimate, at the very least in selecting what kinds of models to consider. I think most statistics users would be OK with using prior information when there is some kind of objective prior distribution. However, people justifiably worry about bias or overconfidence on the part of the statistician; people don’t want the results of statistics to depend much on the identity of the statistician.

In practice, this problem is not too hard to sidestep. There are at least two approaches:

The first is to include significantly less prior information than is available, to make make statistical inference robust to bias and overconfidence. The two common approaches to this are to use weakly informative priors or non-informative/maximum entropy priors. Weakly informative priors are very broad distributions that still include some prior information that almost no one would object to. For example, if you’re estimating the strength of a metal alloy, you might choose a prior distribution that expresses your belief that the strength will probably be stronger than that of tissue paper but weaker than a hundred times as strong as the strongest known material. Maximum entropy priors represent the minimum physically possible to know about the parameters of interest.

The second is to do the calculations using several different prior distributions that different consumers of the statistics might think are relevant. This accomplishes something like a sensitivity analysis for the prior distribution. For example, you might include a non-informative distribution, a weakly informative distribution and a very concentrated prior distribution. This allows people with different prior opinions to choose the result that makes the most sense to them.

This post will be a more technical than my previous post; I will assume familiarity with how MCMC sampling techniques for sampling from arbitrary distributions work (an overview starts on page 24, this introduction is more detailed). This post is about a specific class of MCMC algorithms: derivative based MCMC algorithms. I have two goals here: 1) to convince people that derivative based MCMC algorithms will have a profound effect on statistics and 2) to convince MCMC researchers that they should work on such algorithms. The goal of my previous post was to provide motivation for why good MCMC algorithms are so exciting.

A friend of mine suggested that this post would make the basis of a good grant application for statistics or applied math research. I can only hope that he is correct and someone picks up that idea. I’d do anything I could to help someone doing so.

Some background

In my last post, I mentioned that one of the things holding Bayesian statistics back is the curse of dimensionality

Although Bayesian statistics is conceptually simple, solving for the posterior distribution is often computationally difficult. The reason for this is simple. If P is the number of parameters in the model, the posterior is a distribution in P dimensional space. In many models the number of parameters is quite large so computing summary statistics for the posterior distribution (mean, variance etc.) suffers from the curse of dimensionality. Naive methods are O(N^P).

The posterior distribution is a probability distribution function over the whole space of possible parameter values. If you want to integrate numerically over a distribution with 10 parameters to calculate some statistic, say the mean, and you split up the space into 20 bins along each parameter-dimension, you will need a 10 trillion element array. Working with a 10 trillion element array is very expensive in terms of both memory and computer time. Since many models involve many more than 10 parameters and we’d like to have higher resolution than 1 in 20, this is a serious problem.

Instead of integrating directly over the space, we can use Monte Carlo integration: sample from this probability distribution and use the samples calculate our statistic (for example, averaging the points to calculate the mean). Markov Chain Monte Carlo (MCMC) can be used to sample from any probability distribution. MCMC works by starting from an arbitrary point in the space and then picking a random point that’s near by, if that point is more likely than the current point, then that point is adopted as the current point. If it’s less likely than the current point, then it may still be adopted with a probability depending on the ratio of the likelihoods. If certain criteria are met (the detailed balance), this process will eventually randomly move around the whole distribution in a way that is proportional to the likelihood; the process will sample from the distribution (though each successive point is not statistically independent from the previous one).

Sounds great, but unfortunately naive MCMC does not solve our problem completely; in a high dimensional space, many more directions have decreasing probability than higher probability than have increasing probability. If we pick a direction at random, we have to move slowly or wait a long time for a good direction. Assuming an n-dimensional, approximately normal distribution, naive MCMC algorithms are O(n) in the number of steps it takes to get an independent sample. Now O(n) doesn’t sound that bad, but if you take into account the fact that calculating the likelihood is often already O(n), it means that fitting many models takes O(n**2) time. This drastically limits the models which can be fit without having a lot of MCMC expertise, or integrating over the distribution analytically.

Derivative based MCMC algorithms

MCMC sampling has many similarities with optimization. In both applications, we have an often multi-dimensional function and we are most interested in the maxima. In optimization, we want to find the highest point on the function; in MCMC sampling we want to find regions of high probability and sample in those regions. In optimization, many functions are approximately quadratic near the optima; in MCMC sampling, many distributions are near normally distributed and the log of a normal distribution is quadratic (taking the log of the distribution is something you have to do anyway). In optimization, if you have a quadratic function, many algorithms will find a maxima in 1 step or very few steps regardless of the dimensionality of the function. They do this by using the first and second derivatives of the function to find a productive direction and magnitude to move in.

There is a class of MCMC algorithms which solve the curse of dimensionality by taking a lesson from optimization and use the derivatives of the posterior distribution to inform the step direction and size. This lets them preferentially consider the directions where probability is increasing using 1st derivative information and get a measure of the shape of the distribution using 2nd derivative information. Such algorithms perform much better than naive algorithms. They take larger step sizes, mix  and converge faster. With respect to the number of parameters, Langevin MCMC algorithms (which use 1st derivative information) are O(n**1/3) (link), and Stochastic-Newton algorithms (which use 1st and 2nd derivative information and are analogous to Newtons Method) are O(1) (link). A Stochastic-Newton method will independently sample an approximately normal distribution in approximately one step, regardless of the number of parameters. This opens up a huge swath of the space of possible models for fitting without needing to do much math or needing much MCMC knowledge.

Derivative based MCMC algorithms have other advantages as well.

First, both 1st and 2nd derivative methods take much larger steps than naive methods. This means it is much easier to tell whether the distribution is converging or not in normal ways. The downside of this is that such algorithms probably have different failure modes than naive algorithms and might need different kinds of convergence diagnostics.

Second, 2nd derivative algorithms are self tuning to a large extent. Because the inverse hessian of the posterior distribution represents the variance of the normal distribution which locally approximates the function, such algorithms do not need a covariance tuning parameter in order to work well.

The future of MCMC

The obvious problem with these methods is that they require derivatives which can be time consuming to calculate analytically and expensive to calculate numerically (at least O(n)). However there is an elegant solution: automatic differentiation. If you have analytic derivatives for the different component parts of a function and the analytic derivatives of the operations used to put them together, you can calculate the derivatives for  the whole function using the chain rule. The components of the posterior distribution are usually well known distributions and algebraic transformations, so automatic differentiation is well suited to the task.

This approach fits in remarkably well with existing MCMC software, such as PyMC, which allow users to build complex models by combining common distributions and algebraic transformations and then allow users to select an MCMC algorithm to sample from the posterior distribution. Derivative information can be added to existing distributions so that derivative based MCMC algorithms can function.

I have taken exactly this approach for first derivative information in a PyMC branch used by my package multichain_mcmc which contains an Adaptive Langevine MCMC algorithm. I graduated a year ago with an engineering degree, and I have never formally studied MCMC or even taken a stochastic processes class; I am an amateur, and yet, I was able to put together such an algorithm for very general use; creating truly powerful algorithms for general use should pose little problem for professionals who put their mind to it.

There is a lot of low hanging fruit research fruit in this area. For example, the most popular optimization algorithms are not pure newton’s method because it is a bit fragile; the the same is likely true in MCMC, for the same reasons. Thus it is very attractive to look at popular optimization algorithms for ideas on how to create robust MCMC algorithms. There’s also the issue of combining derivative based MCMC algorithms with other algorithms with desirable properties. For example, DREAM (also available in multichain_mcmc) has excellent mode jumping characteristics; figuring out when to take DREAM-like steps for best performance is an important question.

Given its potential to make statistics dramatically more productive, I’ve seen surprisingly little research in this area. There is a huge volume of MCMC research, and as far as I can tell, not very much of it is focused on derivative based algorithms. There is some interesting work on Langevin MCMC; for example an adaptive Langevin algorithm, some convergence results, and an implicit Langevin scheme, and also some good work on 2nd derivative based methods; for example, optimization based work, some numerical work, and some recent work. But considering that Langevin MCMC came out 10 years ago much more focus is needed.

I’m not sure why this approach seems neglected. It might be that research incentives don’t reward such generally applicable research, or that MCMC researchers do not see how simplified MCMC could dramatically improve the productivity of statistics, or perhaps researchers haven’t realized how automatic differentiation can democratize these algorithms.

Whatever the issue is, I hope that it can be overcome and MCMC researchers focus more on derivative based MCMC methods in the near future. MCMC sampling will become more reliable, and troubleshooting chains when they do have problems will become easier. This means that even people who are only vaguely aware of how MCMC works can use these algorithms, bringing us closer to the promise of Bayesian statistics.

Follow

Get every new post delivered to your Inbox.