Switzerland builds next-generation metacatalogue
Four languages, five metadata standards, 13 OPACS, four federated search platforms and more than 800 libraries... Tobias Viegener reveals the challenges in building a nationwide search platform for Switzerland
Users often turn to search engines rather than their libraries’ OPACs as they can be a simpler way to find relevant information. In 2006 this so-called OPAC crisis and analysis of new search platforms prompted the University Library of Basel to start considering a nationwide search platform for Switzerland. In 2007 the first call for proposals for E-Lib.ch, a project aiming to build a nationwide Swiss electronic library, was held. At the same time, the idea of a next-generation metasearch tool covering all Swiss university libraries and the Swiss National Library was born – SwissBib.
From a user perspective it would seem logical to bring such resources together – and this is not the first attempt to do so. But, as always in a federal environment with different political bodies funding their own universities and the Swiss Confederation doing the same at the central level, it needed a spark to get people united on the project and centralised funding to start collaborative work.
E-Lib.ch provided both the spark and the funding. Its mission is to deliver a central national gateway with a single point of access for academic information provision and information research in Switzerland and to establish it as a long-term entity. In addition, the project is financed with around €7.3 million and will run for four years between 2008 and 2011. It has two major areas: ‘research and use’ and ‘digital content’ and these together cover around 30 projects.
A central entry point
SwissBib aims to build a central entry point with a consistent user interface for searching, finding, and accessing scientific information in Switzerland. It spans all Swiss university libraries and the Swiss National Library. It mainly focuses on metadata, which is still the core business of the libraries involved. In addition to the libraries’ catalogue data, it includes the metadata and if possible the full text of the institutions’ document repositories and academic e-archiving projects. These cover additionally-digitised images and old manuscripts.
Looking at the problems the library world is currently facing, especially the OPAC crisis, several factors and functionalities were defined as crucial. A major area is access to resources, either to online content or printed material. The latter is quite complex because of the federal structure of the Swiss library system. There are different lending and access policies for over 800 libraries ranging from the big university libraries down to the community library or the small research library.
Another important issue is multilingualism. At least four languages (French, Italian, German and English) have to be supported in the interface as well as at the search engine level. Therefore it is very important to make use of the different methods used in Swiss university libraries for subject indexing and classification. In order to allow the project partners to make extensive use of the search tools provided by SwissBib, it must be possible to filter results by criteria such as region, language or content. These filters must also be linked with a customised user interface to allow institutions to present their data separately as their own service.
Another prerequisite for SwissBib is that it has to cope with different types of bibliographic metadata without interfering with the local data providers. This is of special importance because, within Switzerland, there are three different adaptations of AACR2 in use, as well as two different breeds of MARC. Even library consortia that use the same rules interpret them differently so that the same resource may be catalogued in at least eight different ways. This situation occurs because, although a large portion of Swiss libraries is organised into three consortia, there are eight different server locations. Each of these has its own bibliographic database as well as authority databases. It becomes slightly more complicated at the system level as the consortia use two different library systems, Aleph and Virtua. For SwissBib this means that, in addition to the handling of bibliographic data, it has to work on two solutions for availability, backlinking and authentication.
One big issue at the data level is the high percentage of duplicates. This is inherent from the multiplicity of institutions serving the universities of the different cantons. This is coupled with a low rate of copy cataloguing over the years, although the situation has improved during the last decade. At the moment we estimate that, of the around 17 million bibliographic records, approximately three million will be duplicates.
The complex organisational situation and the heterogeneity of systems and data formats has also led to different approaches with federated search tools. Currently, in addition to 13 OPACs, there are four different federated search platforms, all covering the same data pool but built by the different consortia. The range of products used covers KVK (Karlsruhe Virtual Catalogue), Metalib and a home-grown solution based on z39.50. This situation, coupled with local user data administration, makes it difficult for a standard university user to get materials over these portals.
How the different components of SwissBib fit together
Looking at this background, it became clear that SwissBib should not be just another federated search tool. A technical solution was needed that could prepare the bibliographic data independently of the local system. Furthermore, the component for data preparation must be able to cope with the differences in cataloguing rules and must provide flexible mechanisms for deduplication and clustering. The search engine must be powerful and speedy to deliver results considerably faster than the OPACs and federated search tools while also getting the maximum out of the data with respect to multilingualism. And the user interface must be versatile and open enough to allow a connection to the local services. Additionally, it should deliver interfaces to present the SwissBib index to other interested parties of E-lib.ch so they are able to process the SwissBib data however they want.
Looking at the different functionalities required, it was decided that the SwissBib system should be layered and modular, and that the modules should be as independent as possible. The different modules should communicate over open interfaces using standard protocols as far as possible. Communication with the ‘outside world’ should be possible and organised via open standards. Another benefit of modularity is the option to expand or exchange parts of the system if necessary.
Picking a solution
Most of the current products on the market cover some of the functionality described above. Unfortunately it was difficult to find solutions that could cope with all the areas that are needed for SwissBib. The most problematic part was the component for data preparation as the focal point of most vendors is the search application coupled with an up-to-date user interface.
SwissBib chose, as the result of a GPAconforming tender, a solution from OCLC consisting of three stand-alone components that interact via standard protocols and data exchange mechanisms. The component for data preparation is CBS (Central Bibliographic System), FAST is used as a search engine and the interface for user interaction as well as machine-to-machine communication is provided through TouchPoint.
Data preparation
To keep up with the local systems’ capabilities and meeting the needs of different metadata providers, three mechanisms for data delivery and update are provided: file transfer, OAI and ‘SRU record update’. The latter would be the most elegant for the local library systems but is currently not supported by Aleph or Virtua.
For the library systems, the MARCXML format for data delivery is preferred and ISO2709 is possible as a fallback. For the repository data, DC, MODS and MARCXML are welcome.
Generally, CBS is very flexible when importing data as it uses an internal format for data storage and supports different formats for import and export. The data is normalised, converted, deduplicated, enriched and clustered at this stage. All the different stages of data preparation can be defined according to import source. The mechanism for the evaluation of possible duplicates and their merging is heavily tweakable which is of great use given the complex situation. It is normally run during the import process but deduplication of data already loaded into the database is also possible. FRBR-clustering is done at this level. Together with the deduplication it should ease the problems caused by the varying cataloguing practices.
Enrichment is carried out at several stages in SwissBib. The easiest of these is deduplication where many weak records – results of retrospective cataloguing – are enriched by just combining them with richer records from other databases. A more promising approach on the metadata level is enrichment via WorldCat as CBS has a bidirectional interface to WorldCat. The third step will be at the level of FAST where available full text as well as abstracts and indexes (the products of catalogue enrichment at the local library level) will be ingested.
After preparation, all the data is stored in a database, as the data preparation at the CBS stage does not necessarily need to discard bibliographical or administrative information that is not important for the search interface or hinders the preparation of a good index.
This leads to the export component of CBS where the data processed by CBS is exported – in our system again as MARCXML – to be imported and prepared by FAST. At this stage it would be possible to bulk export data back to the local systems or to other interested parties in different metadata formats.
Searching and indexing
FAST ESP (Enterprise Search Platform) is a search and index engine that is specialised in high performance and scalability. Although mainly focused on full text, it works well with structured metadata too. This combination makes it very interesting as many university libraries start large digitisation projects. FAST has the standard features such as ranking of result sets, faceted searches and linguistic algorithms that it applies to improve the search results. FAST has a modular design that makes it possible to integrate it easily with CBS on the side of data loading as well as with TouchPoint on the search side. It is also possible to bypass CBS and access other sources directly. FAST ESP currently supports 80 languages and provides dictionaries that can be used for enriching the indexes and for helping users with the definition of search queries in the user interface. The dictionary feature is of interest as it allows the creation of our own index-based dictionaries that guarantee hits. This is especially interesting with regard to authority data, either name or subject headings.
The data import from CBS to FAST is done via the SRU Record Update protocol to secure the continuous update of the index. The initial load is done by file transfer.
Communicating with users and machines
TouchPoint is an end user application. It combines the advantages of the traditional meta-search engines and their access to remote databases with the much more controlled access to internal databases. It offers the possibility to load and index frequently-accessed databases and to access remotely those that are less frequently used. The connection to the FAST index is established by a connector that interprets the queries as well as the results and brings them to the interface. This mechanism makes TouchPoint very flexible when it comes to the inclusion of other search engines based on local indexes, for example Lucene/SOLR. The possibility to include external indexes is generally interesting although not relevant for the standard SwissBib application we are building.
Technically, TouchPoint is a JSP application deployed on Tomcat. The styling is organised by tiles. The functionality of TouchPoint can be addressed via taglibs and, if necessary, new taglibs can be added, which makes it very flexible. In the interface it uses modern Javascript functionality to secure the extensibility in the field of Web 2.0. It includes basic Web 2.0 functionality such as tagging, reviewing and personalised lists.
The architecture of TouchPoint makes it easy to implement different views. The degree of difference compared with the original layout is up to the designer and the staff resources available.
In general, TouchPoint fits in well with the Swiss library world as it supports OpenURL and SRU. It also has powerful backlinking mechanisms and supports Shibboleth-based authentication. This allows a cross-domain single sign-on and removes the need for content and service providers to maintain databases for user names and passwords. In addition, via SRU it is possible to access the FAST index with an open standard protocol and get back bibliographic data as well as facets and spell checking suggestions.
Current status
The first version of SwissBib is planned this year. This will provide a basic version that will be extended over the next two years until 2011 when the E-Lib.ch project officially ends. In the project team’s opinion, with the software solution chosen, SwissBib’s foundations are solid. This secures the further development and enhancement of the service to let it cope with the vivid Swiss library environment.
Tobias Viegener is project coordinator for SwissBib.
E-Lib.ch: www.e-lib.ch/index_e.html
SwissBib www.swissbib.org