Unlocking the benefits of open data
Coordinated national approaches to developing open data strategies will help the UK’s researchers to forge ahead on the international stage, says Phil Richards
Robert Boyle, Charles Darwin, Benjamin Franklin, Allessandro Volta. What do all these famous ‘gentleman scientists’ have in common?
For one thing, they would all recognise and feel comfortable with the classical scientific method – systematic experiment, observation, measurement, testing against a hypothesis and then modification where required. For another, they would certainly expect to plan and execute these activities either alone or in a small group.
Then, as now, the academic researcher tends to be a competitive animal who perhaps operates most naturally at this more individualistic scale.
But scientists are a pragmatic bunch – and so, where needs must, larger groups of researchers set personalities and personal differences aside to tackle 'big science' problems such as genomics or the discovery of the Higgs boson.
I have some personal experience of this as part of a European research collaboration at CERN, where 'creative tension' was one of the hallmarks of our co-operative venture!
But now, digital technology is opening up new possibilities for research. A researcher can test a new hypothesis relatively quickly against a sizeable pre-existing set of open digital research data, originating from a whole range of different past experiments in which he or she had no direct involvement, but which can be repurposed at large scale. Could that herald a step-change in the rate of scientific discovery, and associated creation of new knowledge and economic value?
Global organisations such as the G8 certainly think so; which is why this group, representing some of the leading industrialised nations, announced itself as supporting the principles of open data last year.
It remains a difficult area, however. There isn’t even much agreement on just how ‘open’ open data should be. The Open Data Institute (ODI), for example, has a very pure view of what makes data open: it is wholly unrestricted data that is available for anyone to use for any purpose. In such cases, it is axiomatic that there is literally no issue of privacy or data protection.
Others have a more nuanced approach. Particularly for health-related data sets, even where reasonable steps have been taken to remove individual personal data, it could be possible to recover some personal data if somebody tried hard enough.
In cases like this it may not be appropriate to make data sets fully open as per the ODI; but sharing them among trusted, ethical parties can still potentially allow new medical research that could tackle deadly diseases. Again, at Jisc, we are keen to help researchers operate at this pragmatic level of open data, and we’re helping to develop tools that enable ground-breaking medical research to proceed while also protecting individual privacy in a practical and ethically sound way.
At the conceptual level, the potential benefits of open research data are easy to understand; but in practice there are a number of challenges to overcome.
Academic research is increasingly being done on a global scale, and countries that facilitate international collaboration first are likely to reap a greater share of the scientific, cultural and economic benefits; Jisc aims to help the UK maintain its position at the global forefront in this area.
One of the most significant barriers previously has been storage, and particularly the cost of storage for large data sets. That is now starting to go away, as large cloud-based storage providers such as Google and Microsoft are starting to offer unlimited storage capacity to the education and research sectors. I guess someone senior at Google must have asked the question ‘when did we last need to delete a file because we had run out of space?’ and the answer came back, ‘never’!
The next issue is mark-up – sometimes called metadata – giving raw data sets additional descriptors that are essential to allow their appropriate re-use. It does not seem practical or achievable to devise one single set of metadata that can work to describe data across the full spectrum of academic disciplines; particle physicists are always going to need a different approach from genomicists, and theirs will be different again from those that social scientists find useful.
So, instead, we are seeking to ensure that sets of metadata standards do not proliferate more than is absolutely necessary. If each research group had its own distinct metadata standard we would almost be back to the pre-open data position!
There is no getting away from the fact that marking up one’s data to make it readily available for re-use takes a significant amount of time and effort. Mandates from UK research councils requiring a formal approach to data management and retention are a helpful start, and we aim to complement this work by developing tools to make the process as painless and low cost as possible for the researcher.
We are also keen to point out that such work is not purely altruistic – in our experience, the researchers who are most in tune with methods for preserving their own data are the ones who will be in the best position to tap effectively into the wider emerging corpus of open research data. They will be able to progress their own research interests at a much more rapid pace in the medium or long term.
There’s the issue, too, of more general ‘discoverability’ of research data sets that have been made openly available, and properly marked up. This is an area of active interest, where automated deterministic approaches based on machine readable standards (semantic web) will be combined with artificial intelligence (AI) approaches.
While there’s been speculation that this could lead to ‘human-free’ research, with hypothesis being constructed then tested by machine, my own opinion is that human intelligence will continue to be a key ingredient in the most creative and valuable research.
On Wednesday, 12 November I’ll be at the Inside Government conference, Higher Education Data: Redesigning the Information Landscape, to reflect further on these areas and look in more detail at the work that is being done to make sure UK researchers get the early and full benefit of this transformational new approach to research.
Phil Richards is Jisc’s chief innovation officer