Support Aeon Donate now The aim of science is to establish facts, as accurately as possible. And false positives are alarmingly common in some areas of medical science. How can this happen?

The problem of how to distinguish a genuine observation from random chance is a very old one. It turns on the distinction between induction and deduction.

Science is an exercise in inductive reasoning: Induction can never be certain. In contrast, deductive reasoning is easier: In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning.

In the s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.

Tests of statistical significance proceed by calculating the probability of making our observations or the more extreme ones if there were no real effect.

The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. But that turns out to be very difficult.

The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect — that the hypothesis is true — given the observations.

And that is a problem of induction. Confusion between these two quite different probabilities lies at the heart of why p-values are so often misinterpreted. Even quite respectable sources will tell you that the p-value is the probability that your observations occurred by chance.

And that is plain wrong. Suppose, for example, that you give a pill to each of 10 people. You measure some response such as their blood pressure.

Each person will give a different response. And you give a different pill to 10 other people, and again get 10 different responses.

How do you tell whether the two pills are really different? The conventional procedure would be to follow Fisher and calculate the probability of making the observations or the more extreme ones if there were no true difference between the two pills.

How can this be so? Take the proposition that the Earth goes round the Sun. Which brings us back to induction. The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis the deductive problem to what we actually want, the probability that the hypothesis is true given some observations the inductive problem.

But how to use his famous theorem in practice has been the subject of heated debate ever since.

Instead, he proposed the wholly deductive process of null hypothesis significance testing. In theory, picking up on the early signs of illness is obviously good. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right negative answer for people who are free of the condition.

So only 14 per cent of positive tests are correct. An example should make the idea more concrete. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high.

