Which experts should we trust?

Over the past two years, we have seen in a very salient manner how the government must rely on expert predictions to decide which policies—some strongly limiting freedom— to implement. In parallel, in the media (including social platforms), other self-proclaimed experts give their opinion. But who should be trusted? What is the credibility of official and unofficial experts?

Expertise is often measured by education and experience: an expert usually has at least a diploma and some working experience in the corresponding field. This is only one way to approximate expertise. An alternative exists, at least when expertise concerns short term predictions: we can score experts, asking them to make numerical predictions, so as to measure and keep track of their prediction accuracy.

Why scoring?

Being right and being convincing do not always align. Fake experts, or quacks, can use rhetoric to make their point, and simple tricks to avoid being caught, such as using vague words or general statements. Asking experts to make numerical predictions levels the playing field and makes it possible to compute an accuracy score.

Being right once can just be luck. Many fake experts rely on their luck. Someone predicting a stock market crash every day for the following day, is bound to be right, one day. But he was wrong so many times in the process that we should not trust him. This is why tracking accuracy is crucial, scoring experts again and again. Conversely, establishing track records also protects real experts from bad luck. Uncertainty implies that even real experts are bound to be wrong, regularly, but they will perform far better than quacks on average.

How would that work?

There is a whole literature on proper scoring rules, which have been studied for decades by economists and decision analysists. Here is the simplest and first one proposed.

In the early 1950s, with the development of weather forecasting came the question how to identify good forecasters. If forecasters predict that it will rain with a 60% probability and it does rain, how right or wrong are they? Brier [1], from the US Weather Bureau, proposed a simple solution: take the square of the distance between what happens and the forecasts. If it rains, and the forecast was 60%, the distance is 40%, and the square is 0.16. This is called the Brier score. The smaller, the more accurate. The best possible Brier score is 0 (always perfectly right) and the worst one is 1. Someone who would always say 50% (so completely not informative) would get 0.25. Anyone with an average score higher than 0.25 is creating noise. A dart-throwing monkey (or someone guessing perfectly randomly) would get 1/3 on average.

Is it feasible?

For ten years, the Good Judgment Project, led by Philip Tetlock and Barbara Mellers at the University of Pennsylvania, has been scoring and keeping track of the predictions of a large pool of forecasters of geopolitical events. If anything, such events are even harder to forecast than Covid-related events because geopolitical forecasts cannot easily be based on statistical models. The Good Judgment forecasters were, initially, amateurs and volunteers, but among them, Tetlock and Mellers [2] identified so-called Superforecasters: people who were especially accurate, and consistently so. The Good Judgment Project has hugely improved our knowledge on what makes good forecasters, how to train them, and the importance of teamwork.

Imagine if, since the beginning of the Covid-19 pandemic, we would have kept track of weekly forecasts, asked respective forecasters for numeric predictions, and score them. By now, we would have a nice database and a pretty good idea if anyone does better than RIVM, for instance. By now, we would know whether self-proclaimed experts deserve this title.

Is there a catch?

The approach I suggest here is not perfect, in many ways. It works well to measure expertise in terms of short-term predictions, but we may miss other dimensions of expertise, such as long-term predictions, counterfactual thinking or qualitative diagnostic. Yet, let us keep in mind the problem from where we started out: the short-term predictions on which freedom-limiting policies are being based. Shouldn’t we score both those who contribute to establishing these policies and those who oppose them? Keeping track of experts’ predictions and accuracy would provide useful data to know whom to trust.

Author:

Prof. dr. Aurelien Baillon (professor of ‘Economics of Uncertainty’ at the Erasmus School of Economics)

References:

[1] Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1), 1-3.

[2] Tetlock, P. E., & Gardner, D. (2016). Superforecasting: The art and science of prediction. Random House.

We Trust. We Share. We Build.

Naar overzicht