What do you think of e-ScienceCity? Click here!

Data vs. theory

Big Data will fundamentally change how science is done – in fact we won’t need theories any more. That was the conclusion of Wired Editor-in-chief Chris Anderson in July 2012 and based, in part on a comment reportedly made by Google’s Peter Norvig. The provocative piece looked at the traditional view of the scientific method, that we have a hypothesis – a model of the way something works – that we test by doing experiments, refining our model based on the data. Historically – compared to today – datasets have been small, because scientists collected results carefully and methodically. Big Science however produces data all the time – in a constant stream, just like the millions of people using single, monolithic online retailers. Whereas bricks and mortar stores might perform market surveys to determine the demographic makeup of their customer base, online retailers know enough about yours (and everyone else’s) browsing and buying habits to do away with the concept of demographics altogether. In other words, they don’t construct a model of you based on socioeconomic group, gender or age; they target you as an individual.

Anderson takes what seems to be the logical step a transfers this to big data in science. Why formulate a hypothesis as all? With so much data, why construct a model of reality, when reality is staring us in the face?

We will still need theories
There have been arguments against the need for theory before: the Logical Positivist philosophers of the 1920s, who argued that only things that could be directly observed were important, and that theories were essentially irrelevant, eventually ran into problems. Not everything is observable, and some things that aren’t observable now might be in the future. But the main fault with attempting to draw this parallel between Big Data in commerce and in science is that single measurements in science don’t produce reliable data. An online retailer knows for sure whether you bought an item or not, but scientific instruments in particle physics and in nascent fields such as nanobiomedicine are often operating at the limits of their accuracy, so it’s not as clear cut. To get around this, measurements are taken many times and averaged; outliers – obvious ‘mistakes’ in the data that could skew the data point if included, are omitted. Similarly, ‘artefacts’ – what looks like data but is actually the result of instrument operation, similar to a ‘bug’ in computer software – are also omitted. But the only basis for doing this is to have a model or a theory as a means of judging its quality. Theories form part of the context in which data must be considered judged, especially when it comes to curating and preserving data for posterity.

We might not need theories all the time (Matthew Dovey, JISC, UK expert group on digital technology for education and research)
In some branches of applied science, being able to make accurate predictions can be of more practical importance than understanding the underlying models. For example: determining future weather patterns, or choosing between different but established medical treatments based on a patient’s lifestyle. Here, Big Data can be used to identify trends and patterns with improved reliability. Ever increasing sophistication of analytical tools may even one day replace the role of the theoretical scientist in hypothesising new models. Scientists then have the task of devising experiments to challenge and test these computer-generated models.