Data Preservation: Great data, but will it last?
While most of us fathom out how to navigate the new digital information landscape, those with foresight are worrying about the past - how to preserve electronic content. Vanessa Spedding reports.
The process of saving or storing a document, some data, or a pre-print paper on a computer is a simple and automatic act that (perhaps with the exception of remembering to back-up periodically) does not require much thought. But how will today's climate-change data, say, appear to the climate researchers of the future, searching for correlations between the temperature trends of the early 21st century and the crippling Norwegian droughts of 2060? Will the datasets be recognisable by the software of the day? Will it be possible to extract the contents of the electronic file containing the journal paper? If so, it will only be because someone thought well in advance about how to 'future-proof' the digital objects that constitute this information, so that they can be regenerated in the computer applications of tomorrow.
It may seem like an academic problem, to be addressed once computer technology has evolved sufficiently to require something to be done, but a number of factors point to the fact that the time to do something is right now, and momentum is gathering worldwide to work out exactly what.
A key factor is the sheer volume of electronic data currently being produced. Lynne Brindley, chief executive of the British Library, pointed out in a presentation to the November 2002 ALPSP meeting 'Archiving - whose problem is it?' that the world produces 1-2 exabytes (i.e. 1-2 billion gigabytes) of unique information per year, of which only 0.003 per cent is in printed form. That's equivalent to 250Mb for every person on the planet. Each day sees the production of 7.3 million new web pages alone.
Another factor is the value of those data, both economic and societal. The world economy centres increasingly on knowledge - the ownership, management and licensing of information, or 'intellectual capital'. Not taking adequate steps to ensure the longevity of that information, whether printed or digital, amounts to economic suicide - and, equally importantly, risks placing severe limitations on future research progress. Think of the real worth of all the genome maps, biodiversity databases and environmental datasets that are accumulating year by year. Much of this electronic information, including an increasing number of e-journals, is now 'born digital' and so can only be used, modelled, analysed and managed in digital form.
It's important to distinguish between long-term archiving - commonly termed 'digital preservation' - and short-term storage. The open archive project (OAI) for example, with which some readers may be familiar, is a worthy project that ensures access to a freely available repository of research information, but that does not address long-term archiving issues.
In fact, perhaps surprisingly, the only systems that do make some attempt to confront this challenge are in their pilot stage; any final solution is a long way off. This is partly because of the high costs associated with the required technological developments, and partly because of difficulties in assigning responsibility for undertaking such investments, particularly given that there is no apparent short-term 'return' on them.
Mark Bide, of the UK-based consultancy Rightscom, which specialises in digital content, thinks that the onus must lie with governments to protect their digital heritage, because no other organisations can be expected to act in the public interest over long-term periods - and because the investments required are so high. 'Traditionally, managing repositories has been the responsibility of libraries,' he said, 'but the question now is who is going to pay for this in the long run? It's very expensive. But then so is building national libraries. The US Congress announced $100m for a Library of Congress digital preservation programme: that's the scale of this sort of initiative. Certainly big publishers will also be looking at preservation issues within their own domain, but will they be looking at the next 100 years? Probably not.'
The fact is that there are so many factors to resolve that most of the activity so far has focussed on debating and researching the fundamental issues: for example, how to cope with different data formats and provide a uniform interface; how to ensure interoperability between different archival systems; and how to preserve data so that it can be regenerated and viewed as meaningful information at a later date, by which time existing hardware and software might be completely obsolete. None of these issues is fully resolved. The community (and even individual projects) shows divisions on the relative merits of three main approaches, which are: maintenance (of the original technology); emulation (of the original software/hardware technology); and migration (of data formats to new technologies as they evolve), although an amalgamation of these methods is usually the consensus.
Different projects are addressing different combinations of all of these issues, using different approaches, so the overall impression of the state of the art is a rather jagged cutting edge.
A good place to get a feel for the latest developments, rather predictably, is the US. The Library of Congress initiative to which Bide referred (specifically, the 'Plan for the National Digital Information Infrastructure and Preservation Program (NDIIPP)', which has just received approval and first stage funding from Congress), is one of several major projects underway there, due largely to a recognition of the urgency of the issue and the investments required.
This is openly acknowledged in a new report from the Advisory Committee for Cyberinfrastructure of the National Science Foundation (NSF), entitled Revolutionising Science and Engineering through Cyberinfrastructure. It calls for a distributed information and communication technologies system to provide a long-term platform that will support scientific research, and emphasises the risks of failing to act quickly. These include: 'leaving key data in irreconcilable formats; long-term failures to archive and curate data collected at great expense; and artificial barriers between disciplines built from incompatible tools and structures'.
None of the existing pilot projects in the US are commercially driven at the moment, although many involve collaboration with large IT companies. Development has so far been hosted by academic institutions with funding from charitable trusts and governments, the end result being freely available, open-source systems.
A leading example of this is the DSpace project, which has produced the open-source digital repository system, DSpace. This stores intellectual output in multiple formats, from computer simulations to journal papers, and became operational at the end of last year following a two-year collaboration between The Massachusetts Institute of Technology (MIT) Libraries and Hewlett-Packard Laboratories. It is described as a specialised type of digital asset management system that manages and distributes digital items comprised of 'bitstreams' (thereby resolving most future hardware compatibility problems).
The system allows the creation, indexing and searching of associated metadata (currently in 'Dublin Core' format) to locate and retrieve the items via the web. According to MacKenzie Smith, associate director for technology at MIT Libraries: 'DSpace was developed as a simple 'breadth-first' digital asset management system (i.e. it handles everything an institution needs to get started, but minimally). It's easy to extend and customise to meet additional needs. Functionally it has a very nice workflow subsystem for submitting things, which I haven't seen elsewhere.'
In terms of how exactly these digital objects will be preserved and regenerated, Smith explained: 'The DSpace system at MIT has policies for what we promise to preserve (i.e. open, popular, standards-based formats like TIFF and ASCII), and those that we will try our best to preserve but can't promise (e.g. Microsoft, etc). Lots of formats fall somewhere in between (e.g. Adobe PDF) so there we make judgement calls. As for how to preserve them, it's going to vary from format to format. Some will be possible, cheap even, to mass-migrate forward with time. Others will have to be emulated because they aren't really formats (video games or simulations, for example). It's going to be years before we really understand how to preserve these things.'
The MIT Libraries have announced the 'DSpace Federation,' a collaboration with six other major North American research universities and also Cambridge University in the UK (which will be focusing on the regeneration problem), to take the technology further, for which it has received a $300,000 grant from the Andrew W. Mellon Foundation. The plan now is to extend the scope of DSpace by encouraging other organisations to install it, run repositories and help to further develop its adaptability and potential for 'federated collections' - distributed digital libraries held on DSpace repositories in different locations.
Meanwhile, the Harvard University Library and three journal publishers - Blackwell Publishing, John Wiley & Sons, and the University of Chicago Press - have agreed to work together in another project, to develop an experimental archive just for electronic journals. The Andrew W. Mellon Foundation has sponsored this venture too, to the tune of $145,000. The grant challenges Harvard and its publishing partners to 'address fundamental issues in the digital environment,' and to develop a proposal for an archive for these journals. Key to the proposal is the establishment of an agreement between the partners regarding archival rights and responsibilities, the formulation of a technical implementation plan and the creation of organisational and business models.
All this impressive stateside activity doesn't preclude anything happening in Europe. Quite the opposite: a number of long-term projects have already concluded and moved on to a second stage. But few of them have gone beyond gaining an understanding of the issues and proposing solutions - funding for practical applications has yet to materialise at US levels. 'Lots of higher-level, abstract thinking has been done,' explained Bide, 'although transferring to real technology is proving to be as difficult as expected. None of it is going to be cheap.'
Cambridge University in the UK has received funding from the DTI by hitching a lift on an American project: it is contributing to (and benefiting from) the DSpace programme, by working with MIT and the DSpace Federation to improve techniques for regenerating digital objects.
The Cambridge version of the system, known as 'DSpace@Cambridge', has two principal roles. It can capture, index, store, disseminate and preserve digital material created by the academic community, and will also provide a home for the increasing amount of material that is being digitised from the University Library's collections. Peter Morgan, project director for DSpace@Cambridge, explained that the decision to go with DSpace was partly as a result of the existing Cambridge-MIT collaboration and partly as a result of realising that the needs of researchers were rather more demanding than could be fulfilled by an electronic journal archive alone. However he acknowledged: 'We are still a long way from getting satisfactory solutions to some of the problems associated with digital repositories. If we don't sort this soon, we are in danger of losing valuable material for ever.'
Also in the UK, the Joint Information Systems Committee (JISC) is addressing the digital preservation issue and disseminating funds to research groups. Its Digital Preservation Programme director, Neil Beagrie, pointed out that preserving data is essential for the future of new science. 'To this end JISC is working with a number of partners and has set up the UK Digital Preservation Coalition,' he explained. 'We are discussing a number of practical activities, in the areas of web and library archiving. We're also planning a Digital Curation Centre for e-science, and a distributed network of digital preservation centres.'
JISC has already supported a number of preservation programmes, including CAMiLEON, a joint effort between JISC and the NSF, and Cedars, both of which researched potential ways forward for long-term digital libraries. Beagrie is now keen to explore possible business models and how commercial publishers could fit in. This is an important issue, since it's hard to imagine how a digital preservation system could simultaneously accommodate the different requirements of its users and investors - which could be researchers, authors, research organisations, libraries, intermediaries, aggregators and commercial publishers - let alone determine how to assign the responsibilities and gains, given the different budgets and investment priorities of the stakeholders.
Meanwhile, there are other issues to resolve, such as the choice of technologies and standards, for both data and metadata, and what to do about digital rights management and legal deposit requirements. All of these complicate the job yet further and communication between all parties - commercial, library and research - will be absolutely essential to solving the whole challenging problem. It may even force a warming of some of the strained relationships brought about by other, new business implications of the electronic publishing revolution.
There are certainly signs of hope. Probably the most impressive example of no-nonsense collaboration is itself within Europe: namely the three-way arrangement established in the Netherlands, with its national library, the Koninklijke Bibliotheek (KB), at the fulcrum. The KB has teamed up with IBM-Netherlands in a long-term project to develop a comprehensive digital preservation solution, called the Deposit of Netherlands Electronic Publications-implementation (DNEP-i). It has also involved Elsevier Science in a new electronic archiving agreement.
These two initiatives combined give the KB both the technology and the appropriate business arrangement to provide the first official digital archive for Elsevier Science journals. The result is that responsibility for the long-term archiving has been placed with a specific body to the benefit of all (the KB will ensure migration of the content and associated software as technologies change), allowing Elsevier to relax about long-term preservation issues. Elsevier will, meanwhile, ensure permanent availability of their journals to the KB - which will be accessible to walk-in users of the library at any time.
It can be hard, in all this, to see where profit comes in, if indeed it does at all. But there must be some potential benefit for companies like IBM, Sun and HP, given their investments in some of the big US projects. In fact they claim to be learning enough through the R&D process to be able to deploy their new-found wisdom in future commercial archiving products aimed at the academic (and other information-reliant) communities. It won't be long, think both Beagrie and Smith, before they and other companies, probably operating in the records management or digital asset management sectors at the moment, join what could ultimately be a very profitable digital preservation movement, at which point we will see a whole new market open up.
Further information
The British Library
www.rightscom.com
US Congress Digital Preservation Project
www.communitytechnology.org/nsf_ci_report/
The NSF Advisory Committee for Cyberinfrastructure
www.nsf.gov
DSpace
www.lib.cam.ac.uk/dspace/
Harvard Library Project
www.jisc.ac.uk/dner/preservation/
The Cedars project report
www.dpconline.org
The Dutch KB and DNEP
www.dpconline.org/graphics/handbook
The European Commission on Preservation and Access (ECPA)