Dare to dream with data
Sian Harris reports back on discussions about data sharing at the recent ALPSP conference
Sometimes – often, even, - it is a good idea to take a step back from what you do and have a good look at it. This is what some of the speakers at recent Association of Learned and Professional Society publishers (ALPSP) conference encouraged publishers to do.
One example is Tony Brookes, an academic from the UK’s University of Leicester who, after two days of listening to talks from publishers, observed that he’d heard about many challenges and changes in the publishing industry at the conference. What he had not heard, however, was excitement about this. Instead there was a lot of ‘navel gazing’.
Brookes was speaking on the topic of research data and arguing that there are many opportunities with data and that many of these opportunities could fall to publishers.
Plenty is already going on with data, and has been for many years. For example, Simon Hodson provided delegates with insight into CODATA, an international, inter-disciplinary project that was established in 1966 for 'data zealots'. Hodson explained that the project can provide an authoritative voice on data policy such as data sharing and that it hopes to form a focal point on data. He noted that CODATA is in the process of setting up a data policy committee and is involved in a task force on data citation.
Such activities are important as more and more research begins to look at the power of using datasets. Hodson pointed to a report by the Royal Society in June 2012 that argued the case for open data and he noted that, amongst other things, ‘data should be assessible, accessible and intelligible.’
However, there are challenges, according to Hodson. ‘Data is changing and not all is necessarily appropriate to the same model.’ He noted, for example, that there are decisions about which versions of data to share (such as whether it should be the raw data or data that has been processed or aggregated in some way as part of the research process) and that there is often also interest in the algorithms used with data.
He noted some other activities in the data area too. For example, the Jisc-funded Journal Research Data Policy Bank (JoRD) project found that nearly 50 per cent of journals have a data availability policy, although only 25 per cent of these are described as 'strong'. And the data repository Dryad began charging for its service at the start of September. According to Dryad's blog, ‘The Data Publishing Charge (DPC) is a modest fee that recovers the basic costs of curation and preservation, and allows Dryad to make its contents freely available for researchers and educators at any institution anywhere in the world.’
Data initiatives often emerge from the research community in response to an observed need. This is how lntegrated Earth Data Applications came about. Kerstin Lehnert, of Columbia University in the USA and director of the project, explained, ‘Open access to data is really important for research. Data must be discoverable, robust and useful.’
She went on to describe what makes data useful. She said that it is about knowing that you can trust the numbers. This means knowing how the data has been derived, what instruments were used and the uncertainty involved in any measurement.
This can be a challenge for institutional repositories, she argued. ‘Institutional repositories don't have the domain expertise to make data available in each domain.’ She believes instead that domain-specific repositories, like the one she is involved with, are best poised to ensure fitness for reuse. However, not all are ready to do this yet.
The process at lntegrated Earth Data Applications includes thorough review of data and metadata, she said. Each piece of data has a DOI and points to the relevant paper with a DOI. The repository already provides linking with Elsevier’s articles of the future and she says the team wants to extend this with more visual representation of data.
‘In the future we need more collaboration, IT involvement and working with publishers. Some domains don't have data repositories. We need a process to streamline linking between journals and repositories,’ she noted. ‘Publishers need to agree to at least ask that data is disclosed and goes into community-acknowledged data centres.’
But there are other challenges too: what if data can’t be shared? This is an issue that Brookes has been tackling through his work on a range of data projects such as the GEN2PHEN project, which connects genotype to phenotype data.
He noted that there is often resistance to data sharing from researchers, hospitals, companies and others. ‘Sometimes you can't share data, especially in biomedicine, or data may need to be anonymised,’ he added.
In such situations there are significant hurdles. ‘Data sharing needs lots of standards. There have been many projects that have made very little progress,’ he told the conference. He added that the most progress has happened with the DataCite project because it uses published data, which is already digital and so there is little resistance.
In the case of other data opportunities, he advocated an alternative approach. 'Don't share data, exploit knowledge. Instead of sharing you could think about discovery,' he urged.
He argued that many of the concerns about sharing data could be alleviated if, instead of sharing data, people shared the existence of data and that this could be a great opportunity for publishers to provide services.
He gave an example of a project he is involved in called Cafe Variome. This is a system for openly sharing the existence of data. People can do federated searches of the metadata. Once the system finds relevant data there are a range of options. If the data is open then the user can simply access it. If it is closed but the person searching has authority to use that data then they are also directed to it. If they do not automatically have access they can start up an email dialogue with the holder of the data.
Sometimes, however, the power is not so much in individual pieces of data but in the aggregated information. For example, if somebody is searching across patient records the search can return information about the data, such as ‘11/55 patients show a response to a particular drug’. This type of result provides valuable insight into the effectiveness of a drug without looking at any of the individual records.
‘Data discovery has about 90 per cent of the benefits of sharing and 10 per cent of the problems. If you extend it to knowledge sharing you are getting perhaps 99 per cent of the benefits,’ he said.