Opening library data
Libraries generate and hold large amounts of data. David Stuart looks at the potential benefits of sharing and allowing people to innovate with it
An increasingly wide range of organisations are starting to recognise the advantages of making some of their data publicly available online, and allowing external individuals and organisations to innovate around it. Although such practices were once primarily associated with leading-edge web organisations such as Google and Amazon, it is increasingly expected to be part of every organisation’s online strategy. This ranges from supermarkets such as Tesco (www.techfortesco.com/forum) looking to exploit the commercial potential of making data available, to government departments responding to demands for increased transparency and access to potentially valuable publicly-funded data (for example, data.gov.uk).
Like any other type of organisation, libraries have much to gain from making their data publicly available. And, while they have a long tradition of sharing data and resources, it is important that they look beyond the small community of library and information professionals.
Generally the role of the library in the web of data is focused on helping researchers make their data available online in institutional repositories, as well as facilitating access to data that is already available online. However, it is important to recognise that libraries also have considerable amounts of their own data that would have value if it were publically available. I am not talking here of publishable data that may exist within a library’s collection, but the vast quantities of data that are collected during the provision of library services. This is both data that is collected within internal library systems and those of external parties: from the data stored in public catalogues, to download rates within institutional repositories and subscription services. When such data is made publicly available external developers can start to build applications that extend library services in a way the library may not have had the time, money, or capability to do.
This effect can already be seen with Twitter, the microblogging site. Its range of application programming interfaces, which enable computer programs to interact with a web service, have allowed the creation of numerous applications interacting with data from the core service. Applications have been built on a host of different platforms, including multiple different mobile-phone operating systems and social network sites. The basic Twitter service of sharing 140 character text messages has been extended to include the sharing of richer content, such as photos, video, and location information. Most importantly, the data can be combined with data from other sources to provide new insights and services, for example displaying ‘tweets’ on a map to provide insights into differing opinions on a breaking news story.
Comparisons between Twitter releasing data and the average library making its data available can only go so far, as the average library doesn’t have access to anywhere near the same number of potential developers as an international service, the early adopters of which were the technical elite. Nonetheless, not only do libraries have established communities of users, but their data has recognisable value that is likely to appeal to developers beyond their immediate community of users, especially when it can be aggregated from multiple sources.
The most obviously useful and visible library dataset is the library catalogue. In fact, as an increasing number of library resources are provided in an electronic format, it may be a user’s only visible point of interaction with library services. There is obvious value for a library’s users from such services being made available on numerous different platforms. Library users shouldn’t be forced to interact with a library’s services in a single place (albeit a virtual one) but should be able to interact with such services where and when they want them. This could be a user scanning the barcode of a book they stumble across in their local book store to see if it is available in a library they have access to, or someone being able to search their local catalogue without leaving the social network site where they increasingly spend much of their time.
Borrowing insight
While the data in the library catalogue is useful, so too is the other data collected by the library and information services. The practice of proactively suggesting books of potential interest to readers has been an integral part of Amazon’s online success. Libraries making usage data available could enable the development of numerous applications to help users in identifying relevant resources.
Unlike Amazon, which necessarily focuses on the books that people buy, libraries are able to provide a far richer picture of the way people use resources. They not only have information about the books people borrow, but for how long books are borrowed. They also have more information about the people who are borrowing the resources.
It is important that the necessary steps are taken to make sure that a user’s anonymity is maintained, but the potential value of making use of such data is increasingly recognised. The JISC MOSAIC project in the UK has highlighted the potential of such data in the scholarly arena (ie-repository.jisc.ac.uk/466), noting that the University of Huddersfield had, during a general period of downturn in book borrowing, actually increased both the number of titles borrowed per student, and the number of unique titles borrowed. The value of usage data for identifying resources is not limited to the physical items within a catalogue, but also the electronic resources that are available.
It is also important to recognise that the value of library data is not limited to the direct provision of the library services, but can contribute to the overall information and research systems. There is increased interest in the potential of bibliometrics and webometrics to provide objective indicators of researchers’ outputs and identify emerging areas of research interest. The European Commission has funded research investigating the potential of webometric indicators to provide indicators of less formal collaborations and research that may not result in traditional publications. Releasing the data stored within library systems offers the opportunity to provide a much richer picture of the impact of research and emerging areas of scientific interest.
Most importantly, the release of library data offers the opportunity for it to be used in ways unthought-of by the library and information community – both data from a single library and from multiple libraries combined together. For this to happen, however, libraries need the data to be accessible to as wide a range of potential developers as possible, something that the library community is not necessarily good at.
Beyond the library community
The community of library and information professionals has a well-established tradition of sharing information and resources, with widely-adopted standards (such as MARC) and protocols (such as Z39.50) predating the web. However the standards and protocols that they use have not necessarily extended beyond the library community – and, if the data that they make available is to reach its potential value, it is important that it is available in as accessible a manner as possible.
Accessibility may be seen as one of the main drivers behind the adoption of RESTful APIs, which allow queries to be structured as URIs and sent over the web’s HTTP protocol, outside the library community. RESTful APIs lower the barrier to user interaction with an organisation’s data to the point that the data can be queried through embedding requests within a URI that may be sent via an HTTP request. In reality this means that users can often investigate the potential of a data source by simply constructing a query in the address bar of their web browser. In comparison, few users would know where to start with constructing a query that could be automatically sent to most library catalogues.
Increasingly, barriers to entry provided by the use of library specific protocols and standards are being recognised within the library community, and an increasing number of institutions are providing simple APIs (e.g., Cambridge University Library – www.lib.cam.ac.uk/api and North Carolina State University – www.lib.ncsu.edu/dli/projects/catalogws). However, catalogues are only the beginning of the kind of data that can be made available, and the number of institutions that are even making this limited portion of data easily accessible is still very limited.
Cultural and technical challenges
Like many other industries, the widespread adoption of the web has brought about many changes in the library and information profession, and organisational change is rarely enthusiastically adopted. However, the most successful organisations are rarely those that try to cling on to their traditional role, but rather those that embrace the new opportunities that are offered. The role of the library as the sole provider of entry points in to their collections has served libraries well for many years, but if libraries are to provide library services fit for the twenty-first century they need to embrace the opportunities offered by releasing their data and taking advantage of the opportunities offered by the wisdom of the crowd.
David Stuart is a research associate at the Centre for e-Research, King’s College London, as well as an honorary research fellow in the Statistical Cybermetrics Research Group, University of Wolverhampton