Europeana Cloud: Ingesting for the Future

Marian Lefferts of CERL (a Europeana Cloud project partner) blogs about some of the content which will be ingested into Europeana Research, to allow academic use of data held in the Cloud we are building.

A key strand of the Europeana Cloud project is the creation of services and tools targeted at researchers, collectively labelled Europeana Research, to allow scholarly use and re-use of the data held in our newly-built cloud.

As part of that goal, much work is done on exploring what material Europeana and The European Library hold that will be of interest to scholars and how this might be best made available to them. In addition, we are ingesting a great variety of data that we feel will be interesting to academics.

While it is probably not unexpected that this ingestion covers digitised maps, manuscripts, incunables, archival materials, pamphlets, playbills, dissertations and journals, and visual materials such as portraits, architectural drawings, photographs, images of plaster casts, films and videos, further datasets have also been included for their special relevance to scholars in the Humanities and Social Sciences (the core target audience of Europeana Research).

Diego Ribero's map of the world, 1529. Image by Wellcome Library, CC-BY

The project has, for example, ingested the Directory of Open Access Books, which brings together metadata of Open Access books contributed by publishers who publish academic, peer reviewed books. Aggregators can integrate the records in their commercial services and libraries can integrate the directory into their online catalogues, helping scholars and students to discover the books.

The European Library also aggregated the Bielefeld Academic Search Engine (BASE) – one of the world's most voluminous search engines for academic open-access web resources. BASE collects, normalises and indexes data repository servers that use OAI-PMH , and currently supports access to over 60 million documents from over 3,000 sources.

Research organisations DANS and CESSDA will contribute further data sets. DANS provides access to thousands of scientific datasets, e-publications and other research information in the Netherlands, while CESSDA is an umbrella organisation for the European national data archives (including DANS). Its major objective is to provide seamless access to data across repositories, nations, languages and research purposes, and to encourage standardisation of data and metadata, data sharing and knowledge mobility across Europe.

With all of this ingestion, it is important to stress that we are not only aggregating metadata for digital objects (including the all-important link to the object) but also the actual digital object. Both will be stored within the Cloud, with the aim of creating a supportive environment for innovative exploration and analysis of Europe’s digitised content.

At the moment we are speaking with content providers to determine how much content they can make available, what type of content, in what format and how this content can be delivered to the Cloud.

A substantial amount of the content will follow the tried-and-tested route of aggregation and enrichment by the team of The European Library, who will then deliver the metadata in the Europeana Data Model (EDM) format to Europeana and will transfer the digital objects to the Europeana Cloud. But we are also exploring how it might be possible to ingest data directly into the cloud.

We are looking at issues such as what impact this has on the data ingestion workflows of Europeana and The European Library, how we can organise it so that enrichment of metadata takes place and how we can convert data in the original format (as it is stored in the Cloud) to EDM. This last point is aimed at facilitating integration of data into the tools that are being developed.

We strongly suspect that content ingestion may have further implications for EDM, specifically in terms of metadata transactions. There will likely be a need for more ‘administrative metadata’ related to the transactions (versioning, who/what/when was data enriched, record management, etc.). The definition of what would be required, and how this could be accommodated in EDM, will be developed in the course of 2015.