81 – The cult of the asterisk

Researchers in many fields have an essential reliance on statistics, but researchers often apply them mechanically, and sometimes they misrepresent what the results really mean. In particular, statistical “significance”, while a useful concept, can be a poor indicator of the importance of a variable in an economic or management sense.

Statistical analysis is a standard and essential tool of researchers in most scientific disciplines. Statistical methods can be things of beauty and power. They allow us to make rigorous statements about the probabilities of certain ideas being true, based on the evidence embedded within a set of data. Unfortunately, statistics are often applied in a mechanical way, and this can lead to problems.

For example, in the early days of statistics, someone decided that it would be reasonable to choose 5% as the cut off point for uncertainty about the idea being tested. If the statistics showed that there was less than 5% probability of being in error when concluding, for example, that there was a positive relationship between fertilizer input and crop yield, then we would accept that there probably is a relationship. If the probability of being in error was more than 5%, we would conclude that the idea was not true. (Strictly speaking, we would not reject the idea that it was not true, although in practice, this usually is taken as evidence that it is not true.)

Of course, the 5% cut-off is just an arbitrary choice. Why not 10%, or 1%, or 3.3%? Recognising this arbitrariness, researchers often put asterisks next to their statistical results to indicate just how low the cut-off can be set and still conclude that the result is “significant”: e.g. one asterisk for 10%, two asterisks for 5%, three asterisks for 1%. The more asterisks, the better.

While that strategy avoids some of the arbitrariness in the general approach, it doesn’t get away from another problem: that this approach to testing the truth of an idea is unbalanced in the way it deals with different sorts of potential errors.

To illustrate, return to the example of testing for a relationship between fertilizer input and crop yield. In the standard approach to statistics, we start by assuming that there is no relationship (that the slope of the relationship is zero) and test whether this appears wrong. A zero slope is established as the point of comparison.

[From this point, the way that standard statistics proceeds can be a bit hard to get your mind around. I’ll warn you that the next paragraph might be a bit of a brain twister. I can’t make it any simpler, because it is trying to represent the way statistics actually operates.]

We then ask ourselves the following: assuming that the slope actually is zero, what is the probability that a non-zero slope as big as the one we observe in the data set would occur just by chance, as a result of random fluctuations. The bigger the observed slope, the less likely it is that it could have occurred just by chance, and therefore, the more likely it is that the slope really is non-zero. If the probability of getting the observed slope by sheer chance is less than 5%, we reject the starting assumption that the slope is zero.

This implies that, if we looked at lots of examples where the slope really was zero, we would mistakenly reject the idea of a zero slope 5% of the time (and we’d correctly accept that there is a zero slope the other 95% of the time). Clearly, this approach is conservative in avoiding the error of concluding that there is a slope when there isn’t one. (This is the so-called Type-I error that we are taught in statistics.)

On the other hand, if there actually is a positive slope, the approach has a tendency to lead you to a conclusion that there isn’t one (a Type-II error). If the slope isn’t big enough, we conclude that there is no slope, rather than concluding that there is a low slope. There is, in a sense, a bias towards accepting that there is no slope.

The approach effectively gives a high weight to avoiding Type-I errors, but pays little or no attention to Type-II errors. But who’s to say that Type-I errors are much more important than Type-II. In reality, Type-II could easily be more important in an economic sense.

A related problem is that, just because a variable is statistically significant (at 5% or any other level), it does not necessarily follow that the variable is important, in the sense of having a major influence on the issue. Even if variable X has little effect on variable Y, its effect might be statistically significant, if the relationship is very tight, meaning that there is little random scatter in the data, or if the data set is large enough. Statistical significance indicates that the relationship is real, not that it is important.

Various writers have pointed this out. For example, Dillon (1977) puts it beautifully:

“[Through] tests of statistical significance (the “cult of the asterisk”) involving mechanical application of arbitrary probabilities of accepting a false hypothesis, traditional procedures … have aimed at protecting the researcher from “scientific error”. In doing so, these procedures have led to a far greater error of research-resource waste. The farmer’s problem is not whether or not there is a 5 per cent or less chance that a crop-fertilizer response function exists. His problem is how much fertilizer to use. Even if the estimated function is only statistically significant at the 50 per cent level, it may still be exceedingly profitable … for the farmer to base his decisions on the estimated function.”

“In the mechanical fashion in which they are usually applied, significance levels have no economic relevance, except by chance, to farmer decisions about best operating conditions. (Dillon, 1977, p. 164).”

He was referring to the use of statistics in the analysis of agricultural experiments. Unfortunately, the problem is just as serious in economics. McCloskey and Ziliak (1996) went through all of the statistical papers published in the American Economic Review during the 1980s to check how many researchers were relying solely on statistical significance as their measure of real-world importance.

The answer was, most of them. Even in what is arguably the highest prestige economics journal, only about 30% of articles considered more than statistical significance as being decisive in drawing conclusions about the real world, or made any distinction between statistical significance and substantive importance.

That was in the 1980s, but little has changed. The cult of the asterisk is alive and well. I find it particularly disappointing that it affects economics so deeply. One would have thought that, given their disciplinary interests in decision making, economists would have known better.

Traditional statistics is an important tool, but it can be useful to supplement tests of statistical significance by also calculating other indicators of the importance of the variables. For example, in Abadi et al. (2005) we used “importance” indicators, representing how much difference the variables make to the predicted results. Essentially, our importance indicators answered the following question: if we varied a variable over the range that is present in the data, how much difference does it make to the model’s output? Predictably, we found that not all statistically significant variables were important, and not all of the important variables were statistically significant.

David Pannell, The University of Western Australia

Further Reading

Abadi Ghadim, A.K., Pannell, D.J. and Burton, M.P. (2005). Risk, uncertainty and learning in adoption of a crop innovation, Agricultural Economics 33: 1-9.

Dillon, J.L. (1977). The Analysis of Response in Crop and Livestock Production, Pergamon, Oxford.

McCloskey, D. N. and Ziliak, S. T. (1996), “The standard error of regressions”, Journal of Economic Literature 34, 97-114.