Problems with science – and fixing them with more science

My last post might have given the impression that I think science is infallible, or at least that I do not think there is room for improvement. That is the unfortunate side-effect of defending something from a specific claim. For that reason, it is prudent to offer a counterbalance. Science is imperfect in a number of ways, and these vary in kind: motives, practice and practitioners. This is worth exploring.

If this were a book, I could include the Researcher Degrees of Freedom ― how there are a number of legitimate ways of designing experiments and interpreting data, that can lead to different conclusions ― and how that might preserve Thomas Kuhn’s model of science as always being one generation behind as it lurches from groupthink to groupthink. But, I’m already not doing full justice to all the concepts.

Positive publication bias

There’s a tendency in science publications to take little interest in negative results. Unless research is designed to investigate common myths, the preference is to publish positive results. As John Oliver quips “No one is publishing Nothing much up with Acai berries”.

Selection biases like this end up having an impact on the end result that is bigger than our intuitions might recognise. Based on the fact that there are more ways to be wrong than to be right, if we assume that a third of all research is investigating a question to which there is a positive answer, then two thirds of the investigated questions have a negative answer and get ignored.

If we make another assumption ― that 5% of investigations get their answer wrong (which I’m based on ‘significance testing’, which we’ll come to) ― and do a bit of arithmetic we find that of 100 investigations that happen, about 36 papers get published, all with positive results, and so about 64 don’t get published, all with negative results. More importantly, of those 36 papers about 4 are wrong (11%); they were questions with a negative answer that the investigators got wrong.

A solution

This is one of those times where the answer to a problem in science is more science. It might not make interesting newspaper print nor even science magazines, and that’s fine; science isn’t done for newspaper publications. The scientific journals should feel more comfortable with negative results.

If negative results are published, it has two impacts: there’s a greater breadth of science to call on, and that assumed 5% of investigations that get their answer wrong isn’t exaggerated by selection biases ― so it stays at 5% (as opposed to 11%).

Significance testing

There are two issues with significance testing: an over-focus on a positive result, and an overconfidence in a positive result. Let’s explore both.

Statistical significance

There is a difference between statistical significance, which is what significance testing is about, and practical significance. Statistical significance is a way of investigating whether any results are a result of an actual feature of reality, or a fluke. And it’s probabilistic. If significance testing concludes there’s a 95% chance that a result actually reflects a feature of reality, instead of a fluke, then it’s normally considered accurate. But that leads to 1 in 20 being wrong.

Imagine it like this: if you go into a forest and measure the height of 10 mature trees at random and note their average height, and then measure a different 10 mature trees at random, and note their average height, you don’t expect the number to be exactly the same. Statistical significance testing, as the name suggests, tells you whether the difference between those two averages is significant and to what probability.

Practical significance

Science has become incredibly sensitive. It can detect statistically significant differences even when they are very small. This can be achieved through very large data sets or meticulous experiment design. The ideas that play on the question can be isolated and studied with precision. This has drawbacks.

Imagine for a moment a paper that discovered that eating breakfast reduced the chances of some rare cancer by 20%. Assume it is correct, for now, and the paper followed millions of people for decades. Is that a compelling reason to eat breakfast? That the statistical significance enough?

For the sake of this thought, take some person who does not eat breakfast so that they can benefit from an extra 20 minutes sleep. (Pretend extra sleep has none of its own benefits, for the moment, other than comfort.) How much risk is this person taking, for 20 minutes of comfort? That depends. I said “some rare cancer”, and if it’s 1 in 10,000,000,000 rare, then a 20% decrease in risk has the exact notation: 1 in 10,000,000,000 (assuming we write to significant figures).

Medical practice and life coaches shouldn’t rush to update their advice on the back of advice like this. It’s not practically significant.

Moreover, our person has many more health benefits from that extra 20 minutes of sleep. The change in cancer risk relating to breakfast is completely swept away in the noise of life, and the decision you should make between getting enough sleep and having breakfast (if you are forced to choose) is to sleep. The breakfast thing wasn’t significant, practically.

Overconfidence

This is pretty simple, really. We don’t treat published science like it has a 5-11% chance of being wrong. That’s it.

A solution

If science took a broader view of its position, there would be more interesting discussion (most good science papers have a ‘Discussion’ section at the end) beyond just ‘is this true?’. A focus on defending statistical significance is possibly borne out of the positive publication bias, where you want a positive result. And so the question of whether this matters is just another obstacle to getting a positive result. So, again, that positive selection bias needs to be removed.

Also, the solution to the overconfidence problem is more science. One paper with an answer has a considerable chance of being wrong. Several papers on related questions all broadly telling the same story? That quickly diminished the chances of it being wrong. Which leads us to the replication problem.

The replication problem

Precisely how many papers have the ‘Replication Problem’ is difficult to know. But the Wikipedia page on the Replication Crisis cites some worrying numbers from a survey of 1500 scientists: 70% have experienced failing to replicate the results of a study; 50% had failed to replicate their own results for at least one of their publications. A systematic review of falsified research found that 2% of scientists admitted to having falsified one of their publications, and 14% admitted to knowing someone who has.

If 5% of papers are wrong by virtue of the statistics used, and 11% are wrong by combining statistics with a publication bias, then the replication problem was always going to happen. But, there is also fraud. The fraud can have any number of motives, from motivated funding to publication biases coupling with publication pressure (often jobs are dependent on successful publications).

A solution

You might see this coming, by now. There are two good solutions: remove the incentives to only find positive results, and more science.

This time, more science should be particularly focussed: replicating experiments exactly, and answering the same question with a different method, just to make sure the answer isn’t a quirk of the methodology.

P-hacking

The “p” in p-hacking refers to significance testing. The value that tells you how likely a difference is to be a real difference is called “p” (because: statisticians).

If you’ve followed the Significance Testing points in this post so far, you might have thought that you cynically might be able to publish one in every twenty papers just by fluke; hacking the way significance testing is interpreted. There’s versions of that happening. One type is rather wittily summarised by XKCD, where the claim that jelly beans cause acne was investigated one colour at a time. The green one showed a link; 19 other colours did not. On one view, there was then no replication of the result for the “green” result. On another view, the experiment was just replicated over and over until a fluke result came up (and then the publication process picked only that one).

The other method is to dredge through very large data sets (like, say, all Yougov surveys). In that there are thousands of things being measured (“do you support lockdowns?”, “do you like tomatoes?”, “what is your favourite music genre?” etc), and thus millions of possible combinations. Of which, 5% should show “significant” relationships by accident.

Versions of each have merit, under specific conceptual frameworks. The XKCD p-hack can also be seen as a way of investigating subgroups and may be a legitimate way of carrying out research. The YouGov p-hack can be used to investigate things you have a good reason to suspect might have a relationship.

A solution

These are misuses of methods that can be valid. There are two steps that can be used to increase the validity of this. The first is at the research proposal stage: methods this open to misuse need to have a robust framework proposed at the outset. Science tends not to lurch forward in one massive leap, instead it crawls. So, if the method involves using a huge database or repeating an experiment over and over with only very small alterations, the question should be well defined and the reason for thinking there will be a positive result (or legitimate interest) should be easy to establish; the question emerges out of existing literature.

The second step is more science. Results from a p-hacking methodology should be considered a very good reason to design a specific methodology and not a lot more. This is related to the replication problem: more experiments.

Purpose and Black Swans

Science’s power lies in creating generalised explanations. These explanations might explain how a system evolves (there are myriad evolutions: biological, stellar, cosmic etc) or behaves (think of Laws like Boyle’s Gas Law, gravitational laws), or in discovering phenomena and objects from radiation and distant planets to new species. On its face it would seem that all this knowledge would allow us to forecast with confidence. That hubris is a failure of science and science communication.

Black Swan Events (BSE) are events that are unpredicted and consequential. In one sense they are common, and in another sense they are rare. They are common in that events we can call Black Swan Events happen frequently. They are rare in that each actual event has few parallels in history. These events are the reason that forecasting is hubris: one is very likely to happen, but you cannot know what it will be or how it will affect your forecast.

Imagine you are trying to forecast the number of employees in an organisation. Your forecast will assume that people will be hired into different roles at the same sorts of ages they always have; that careers will evolve around the same sort of pattern (promotions after approximately this long); careers eventually end around the same sorts of ages. You might be looking at that simple forecast and noticing how very naive it is. But the forecaster doesn’t know if a sudden technological breakthrough will occur that benefits them, or an economic downturn that changes the pattern of career-ends. Even if such an event does happen, or both, the forecaster cannot know when or in what order.

Technological breakthroughs are common. Ones that affect your industry are rare. Whether they benefit or hinder your company’s HR hiring is essentially random. These are the characteristics of a BSE.

Climate Change Forecasts have a BSE in the form of Covid-19. Experts in epidemiology all knew a global pandemic was going to happen, but the experts couldn’t know when, so it was a BSE. In some countries, it was even more a BSE in some countries as they either had no pandemic response or had scrapped it (i.e. it was less predicted). What the epidemiologists didn’t know was that the political response would be to shut down large parts of the economy (it wasn’t their job to think about that) and what no one knows is whether this will lead to cultural shift to home-working.

This has huge climate change forecasting implications. And no forecast before 2019 considers it. That’s not a criticism: the forecasts also don’t consider the impact of a massive tourist attraction being built in the middle of China and so air travel drastically increases and people try to get there ― and that one hasn’t happened. But, it could.

Limitations of the limitation

One of the things about BSE is that if they are properly anticipated they cease to become BSEs. Management is put in place to put resilience into the normal system, so they are no longer “consequential”. If we knew Covid-19 was coming, we’d have shut down all the airports and borders, it would have been contained in China, disappeared, and before long everything would have returned to the Old Normal (instead of there being questions over the New Normal).

This leads to the Fort Knox Paradox. Fort Knox has never had a break in, so why is so much invested in securing it? The answer isn’t that Fort Knox is so secure just in case. Fort Knox has never been breached because it is so secure. Empirically, it looks like an over reaction. This same paradox is evident in lockdowns: why lockdown when Covid is killing so few people? The answer is that it kills so few people because people are having less contact.

My point, here, is that if we knew the BSE was coming it wouldn’t cripple our forecasts. Treating the BSE phenomenon like it annihilates science misses the point: if we knew more, we could make more reliable scientific forecasts. But, that will always be limited.

A solution

All forecasts should be understood to be qualified with “assuming current trends hold…” or “if this or that happens…”. More importantly, all forecasts should be viewed sceptically. Equally, science shouldn’t be seen as just a method of creating forecasts.

2 thoughts on “Problems with science – and fixing them with more science”

  1. The problem with publishing negative results is the causes of the results are myriad. Only if the “results” caqn be interpreted is such a thing worth publishing, otherwise we end up publishing mistakes.

    The biggest problem facing science is … funding. In order to do science takes money. Only those questions someone is paying to find answers to get done, which leaves huge holes in our curiosity map. It used to be the case that the government funded fundamental research and it still does that but one of the major political parties wants to cut that funding … further, thinking that it interferes with research that could be done “for profit.” These idiots don’t even understand that conservative scientists want more, not less, fundamental research as the spin offs into practical matters is where companies make money, not in doing fundamental research.

    Not having someone in charge is science’s greatest strength … and greatest weakness.

    Oh, and part of the problem is the discussion about the efficacy of science is broken into two parts: one regarding science as a whole and the other regarding science applied to a specific topic. Sometimes people in each camp end up debating each other … to no good end.

    1. True, if negative results are to be published, the experiment design quality has to be watched more carefully. But, I’ve seen a lot of psychology papers and fitness papers with n<5, so maybe that eye on quality should be there regardless.
      That basic issue of political purse strings, from non-scientists, is also a laughably big problem: if you think it's worth money, you probably already think you know the answer… and that's not really a discovery.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s