Introducing REPOX: a tool to manage metadata spaces
Here we introduce how REPOX, a software tool we have been developing, makes it easier for our partners to share their data.
Nowadays, data synchronisation is a 'must have' functionality for metadata repositories. With huge amounts of data moving around networks, changes happen often, and end users expect up-to-date information.
Something that helps with improving data synchronisation are incremental updates. Incremental updates allow an existing repository to be updated by requesting only data added, changed or deleted from a particular time period. This reduces the synchronisation time by reducing the data that must move through the network, as well as the amount of data that needs to be processed.
One functionality of the OAI-PMH protocol is incremental updates. OAI-PMH is a Protocol for Metadata Harvesting that gives a simple technical option for data providers to make their metadata available to other services, based on the open standards HTTP and XML. Version 1.0 was introduced to the public in January 2001. OAI-PMH is very popular with institutions that need to share distributed resources. Galleries, libraries, archives, and museums are very familiar with the OAI-PMH protocol as they have lots of data that needs to be shared. One of the reasons the Open Archives Initiative was funded was to introduce the OAI-PMH protocol.
Sharing data was one part of the equation, while the other was working out how to handle such a large amount of data. When providers have hundreds of metadata records everything is fine, but once the data scales to hundreds of thousands of metadata records then problems transferring them arise.
OAI-PMH was built to support harvesting a big amount of data by using so-called “Resumption tokens”. Resumption tokens are, essentially, the values used to paginate the number of metadata records retrieved. The harvester therefore gets chunks of data, which means that not only is the flow of the data controlled, but also, in the case of a network failure, the state of harvesting stays alive and the next harvest continues from the state the unfinished harvest was in. A collection of data is called a dataset and OAI-PMH can hold multiple datasets as well as multiple XML formats for each one of them. It is important to mention that the link between the metadata and the digital object that the metadata is describing is not defined by the OAI-PMH protocol.
Repox
REPOX is a software tool to manage metadata spaces. It was originally implemented by Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa (INESC-ID). Europeana and The European Library, as well as their partners, have used REPOX extensively for sharing their data between each other. After INESC-ID stopped supporting REPOX, the ownership was handed over, some years ago, to Europeana so that customisations needed in the software could be handled internally.
REPOX comes with an integrated OAI-PMH server for hosting your data where others can find it, and a harvester for retrieving data that other institutions are sharing. REPOX has a comfortable graphical interface to manage all of its functionality. It comprises several channels (OAI-PMH, HTTP, FTP, File System, Z39.50) to import data from data providers, services to transform data between schemas according to user specified rules, and services to share your data. Scheduled tasks are a functionality that REPOX provides so that harvested operations can be planned and incremental updates automatised.
Europeana has been using REPOX to manage all its harvest operations, schedule thousands of datasets, and handle its datasets in one place. Europeana is currently developing Europeana Cloud, a project that will provide a cloud-based environment for easier storing and sharing data from data providers. REPOX will become a part of Europeana Cloud. Internally, REPOX uses databases to store its data and with the newest version of REPOX there will be the possibility to store data in Europeana Cloud as well. Europeana took the opportunity to re-architect the REPOX project in a more modularised way several months ago, as well choosing to provide new features. One of the features already implemented is the new REST API implementation using the latest standards, which will help developers have better control over REPOX from a programming perspective.
More information about REPOX can be found on its Github page, and the available wiki pages have more detailed information about the installation process. The older version of REPOX and its owner documentation, before its handover to Europeana can be found here.