## 159 – The cost of inaccurate data

How much does it matter if you use inaccurate data to inform a decision? It might matter less than you think. Here’s an example.

Not very long ago, I had a discussion with someone involved in prioritising environmental projects. In his view, the quality of the data for variables considered in the prioritisation process is particularly important. If there is no good independent data for a variable, he argued, the variable should just be excluded from the metric used to select projects. (A metric is a formula used to score and rank each project.)

In PD#158 I showed that if you leave out relevant variables like this, the cost, in term of lost environmental benefits, can be huge. A better metric would deliver much greater environmental benefits – of the order of 100% greater – than one that omits key variables.

But that was assuming you have perfect data. What if the data you feed into those metrics is not accurate? Is it better to include a variable, even if the data for that variable is poor? Or is the cost of data inaccuracy so large that it’s better to leave the variable out.

I did an analysis simulating millions of prioritisation decisions for different situations. Each simulation involved selecting the best projects from a list of 100 randomly generated projects.

The analysis shows that, although bad data might cause you to make individual poor decisions, overall you are far better off including weak data than no data.

In the simulations, I looked at four levels of data inaccuracy: perfect accuracy about all five variables; small errors for all five variables; medium errors, and large errors. See Pannell (2009) for details of how I represented small, medium and large errors.

The results are as follows.

• Small data errors hardly matter at all.
• Medium sized errors matter a bit. If the budget is extremely tight, and you use a good metric for prioritisation, the cost of the errors is around 14% (compared to a 30-60% cost of using the wrong metric for prioritisation). If you use a poor metric (such as omitting a variable, or adding variables when you should multiply them – both very common errors), the extra cost of using inaccurate data is extremely low – around 1%.
• Large data errors matter a moderate amount, with cost of up to 23% in my simulations, which is still far less than the cost of using a poor metric. If you use a poor metric, the extra cost of also having large data errors is again very small – around 2%.

There are some really important messages out of this.

1. For an organisation wishing to select the best projects, it is crucial to use the right metric (including all relevant variables) to score the projects, rather than dropping a variable because of weak data. (This is assuming that the data errors are random. If you suspect that data has been systematically biased in a particular direction, this conclusion may not hold. But it would be better to work on reducing this bias, rather than omitting the variable.)
2. If you currently have both a weak metric and weak data, improving the metric is far more important than improving the accuracy of the data. If you don’t use the right metric, there is almost no benefit from improving the accuracy of data. This is true even if the errors in the data are large!
3. Even if you do use the right metric, the benefits of reducing data errors are only moderate, at best. As long as you consider all relevant factors and combine them appropriately, it may be that expert judgments about the values of key variables may be sufficient, rather than requiring highly accurate field measurements.
4. Once you have reduced data errors down to a moderate level, it is unlikely to be worthwhile trying to get them down to a low level.

These findings strongly reinforce the approach we take in INFFER (Pannell et al., 2009). In the design of INFFER, we made sure that all relevant variables are considered, and that they are combined using a good metric. The simulations show that these two things are crucial. At the same time, we use a simplified approach for each individual variable, accepting that strict accuracy about numerical values is not essential. We do emphasise the importance of using the best available information, but argue that the best available is likely to be good enough to work with, even if it is not highly rigorous scientific data. We are not complacent about bad data, and put an emphasis on the need to fill key knowledge gaps, but we recognise that one can work with lower quality data for now, rather than being paralysed by it.

David Pannell, The University of Western Australia