Update on Europeana Cloud's Technical Infrastructure

It has been a while since we updated you on the latest news about the technical infrastructure developed by the Europeana Cloud project. This post shares our recent process and introduces some of our plans for next year.

In 2014, team members from Pozna? (Poland), The Hague (Netherlands), Pisa (Italy) and Milton Keynes (United Kingdom) worked together to build the new infrastructure following the design proposed one year ago.

We are now almost ready for the alpha release of our infrastructure, which will serve as the basis of our first experiments with real metadata and content.

As proposed in the original design, the system consists of a distributed database for storing technical metadata, a distributed file system for storing metadata records as well as digital objects and several distributed frontend and backend services. The word distributed appears three times in the last phrase for a reason. It reflects our choice of distributed components, in other words components which can run in parallel on several machines. This will help us to build a scalable and reliable infrastructure.

Frontend Services are made available via a standard API. Using this API, data holders, such as aggregators and data partners, will be able to easily upload content to Europeana Cloud, download it from there and apply standard processing tasks.

The Unique Identifiers Service ensures that for every metadata or content record, Europeana Cloud both allocates its own unique identifier and stores the original (local) identifier used by the client who uploaded it. This will allow clients to refer to records using their local identifiers (rather than storing Europeana Cloud identifiers).

The Metadata and Content Service is the de-facto uploading and downloading mechanism. It is coupled with the Data Lookup Service, which allows the searching of records by a set of criteria, commonly referred to as Europeana Cloud’s administrative metadata.

At the heart of all these services is a simple but powerful data model. Its objective is to allow the building of standard aggregation workflows, so that individual workflows used by current and future partners can be easily mapped to this model and implemented using it.

Aggregation workflows will use the Data Processing Service, consisting of two parts. The first one is a standard component Apache Storm. This system (also used by Twitter) can parallelize processing tasks by distributing and managing them over multiple computational resources.

The second one is an API that allows interaction with the first system. The actual aggregation workflows will be implemented separately. The transformation between XML records (XSLT transformation) is already working as a sample workflow and soon we will ensure that more workflows are supported out-of-the-box by the system.

The Notification Service will inform clients about any changes made to records such as new versions created.

In addition to the core storage components Apache Cassandra and OpenStack, chosen during the design phase, we have added two more: Apache Kafka for communication between the services and Apache Zookeeper to manage groups of services and ensure high availability.

The system implements standard authentication and authorisation mechanisms following common security practices. Permissions on records are stored together with the rest of administrative metadata.

All in all, the emerging system is quite complex but by using standard and open-source components we hope to make it more extendable and configurable in the future.

On the operation front, the Pozna? Supercomputing and Networking Center has set up the first instance of a pre-production system which will be used for experimenting with real content as of 2015. Next year we will also start looking at how the new system can be integrated into our existing aggregation flows and fulfill parts of their storage and computation requirements. We will continue to refine the system and make it fit for the big purpose of being the infrastructure of the entire Europeana´s aggregation ecosystem.

The Europeana Cloud technical infrastructure is still young but very gifted and dreams of a big purpose. While it is dreaming, we wish you a very Happy New Year!