Publisher's supercomputer assists in research process
Scholarly publishers are accustomed to dealing with big data and need the computing power to deal with it. Sian Harris asks Elsevier about its supercomputing technology and how it came to open up this resource as a service to researchers
Sometimes the numbers that research labs need to crunch outstrip their own computing resources. But help may be on hand from an unexpected source.
Many researchers only come into contact with the work of scholarly publishers in the early days of a project when they are reading around a subject and then at the end when they are writing up their results for publication.
However, publishers, especially large ones like Elsevier, generate and use enormous quantities of data and they need to deal with it in increasingly sophisticated ways. And, when there are large amounts of data and complex tasks to do, supercomputing technology plays a very valuable role.
As Darren Person, SVP, product engineering at Elsevier explained, ‘We’ve been in the business of publishing science for a long time, so naturally we are data rich; some of our collections of content span hundreds of years. To increase the value we deliver to customers, we must dive deeper and deeper into our content collection to harvest its value, and we need serious computing power to do this. We need supercomputer technology for detailed processing of our collections because we’re always discovering new patterns and linkages in our data.’
The large scholarly publisher is also helped by being part of an even bigger company, Reed Elsevier, which, according to Person, invested in big data supercomputing technology more than a decade ago. ‘As Reed Elsevier, we were already putting our own supercomputer technology (HPCC) to use in other parts of the business. So it’s only natural that Elsevier would adopt that same technology for use in our business,’ said Person.
He added that ‘academic content is more critical than ever. Our world is becoming increasingly analytical and innovative, and trusted academic and published content has a crucial part to play in supporting that innovation. We are trying new approaches to product development – approaches that allow us to cycle through opportunities faster while looking for the next big answer for our customers. Those new approaches include fast computer processing; the kind that allows us to process our entire collection in hours.’
He explained that the supercomputing technology (HPCC) is used in a number of the company’s applications. ‘An example is SciVal; in 2009 we launched a module in SciVal which allows research managers to explore the potential productivity of various collaborations among researchers. This module was built on top of complex analytics of citation information for all of the publications in our Scopus database. Without a supercomputer, we might still be calculating these analytics! We also applied HPCC to the challenge of creating a plot of an institution’s key “competencies” to dramatically increase our throughput while decreasing our cycle time,’ he said.
With this array of technology available, researchers began to approach the publisher about using its HPCC resources.
‘Researchers work with us for a number of reasons,’ explained Meeuwis van Arkel, VP, product development at Elsevier. ‘First, we have the raw computing power on offer to tackle such huge datasets. Second, we have a combination of technical and scientific knowledge that means we can apply taxonomies and ontologies that enable more accurate data analysis. Third, our content, given our long history of scientific research, is able to deliver crucial and trusted insights.’
He went on to say that the company’s supercomputing resources are used in a huge range of projects across many different fields. ‘Increasingly, in life sciences for example, our customers are coming to us with vast datasets that they have collected themselves that they want to interrogate for insights. While having their own proprietary data electronically available sounds great, it in fact creates a challenge of its own – by presenting even more information to be ploughed through, which contains a lot of ‘noise’. We help in this case by structuring these vast amounts of unstructured content for analysis. One example would be that we ask the HPCC architecture to look for specific relationships and triplets – using Elsevier’s existing taxonomies – in our customer’s proprietary content. The outcome is that we then deliver back all identified biological targets, including the known relationships to chemical compounds.’
So how does it work? The process is very collaborative, according to van Arkel. ‘We work with our customers to structure their own proprietary data while applying our knowledge of scientific taxonomies. We also conduct research on the HPCC where Elsevier provides the data and researchers themselves construct the unique algorithms to apply to the data. We are very open to different working relationships, given the dynamic nature of the work that we are doing.'
Person agreed: ‘We are always looking for new ways to help our customers improve their outcomes and research success, and so continually evaluating new projects where supercomputing technology is needed. We will be extending this concept on a project-by-project basis for researchers where supercomputer technology is the answer – and, as the vast body of information grows, we expect that more and more research will require supercomputer technology to ensure its success. We are also considering what opportunities there are to use the technology to mine the value of our other owned assets.’