Cultural heritage data for the social sciences and humanities
Bridging the Europeana and CLARIN infrastructures
Europeana is the European digital platform for cultural heritage. CLARIN has been involved as a partner in Europeana's Digital Service Infrastructure (DSI) from the start of this project in April 2015. CLARIN stands for "Common Language Resources and Technology Infrastructure". It is a research infrastructure that was initiated from the vision that all digital language resources and tools from all over Europe and beyond are accessible through a single sign-on online environment for the support of researchers in the humanities and social sciences.
During DSI-2, the recently completed second iteration of this project, CLARIN established an integration of Europeana data into its infrastructure. As of September 2017, the project is in its third phase (DSI-3, see also this post from Photoconsortium, one of the DSI partners).
Europeana provides access to digitised cultural resources from a wide range of cultural institutions all across Europe. Its aim is to give users the possibility to search and access knowledge in all the languages of Europe, either directly via its web portals, or indirectly via third-party applications built on top of its data services. The Europeana service is based on the aggregation and exploitation of (meta)data about digitised cultural heritage objects (images, text, audio, video and even 3D models) from very different contexts. Europeana has developed infrastructures and workflows for ingesting, indexing, normalising and publishing data, providing seamless efficient services on top of this. The Europeana Network has defined the Europeana Data Model (EDM) to be used as its model for metadata exchange within the Network and with other communities. To achieve this wide range of interoperability, EDM was designed in line with the vision of linked open vocabularies. One of the lines of action of Europeana is to facilitate research on the digitised content of Europe’s galleries, libraries, archives and museums, especially for the digital humanities and the social sciences. This work is conducted in the scope of Europeana Research, where issues affecting the research re-use of cultural heritage data and content (e.g. licensing, interoperability and access) are addressed.
Completed work in Europeana DSI
In DSI’s first phase (DSI-1), which ran from April 2015 to June 2016, CLARIN contributed (with most of the work carried out at the centre at the Berlin-Brandenburg Academy of Sciences and Humanities) to the Europeana Research distribution plan, which describes the task of “[placing] Europeana data in CLARIN infrastructures” (Dunning, 2015). An analysis was carried out that resulted in a selection of relevant data sets. CLARIN also carried out an analysis of the outcome of the digitisation of historical newspapers that took place in the context of Europeana Newspapers.
In the second phase (DSI-2), which ran until August 2017, a pipeline of harvesting Europeana metadata (using OAI-PMH) and metadata conversion (via XSLT) and metadata import into the Virtual Language Observatory (VLO) was implemented. Records from a selection of collections can now be found in the VLO. The included resources have been selected on the basis of their relevance to the CLARIN community, the quality of the included data and metadata, and suitability for machine processing, and they cover scanned and OCR'd newspapers from Austria, Slovenia and Finland, historical Travel books from Hungary and Slovenia, Finnish travel brochures and audio recordings of Romanian poetry from the 1960s. Many of these resources can already be processed with tools that are listed in the Language Resource Switchboard (LRS). For such resources, users can open the LRS directly from the VLO, choose a tool that supports it format and language, and then immediately trigger that tool with the selected resource. The principles and workings of the LRS are explained in detail in this paper by Claus Zinn; you can also watch a video of this presentation.
The report for milestone 2.2 "Results and Impact of Sharing Europeana Data with CLARIN" lists a number of recommendations for improving the content provided by Europeana (both metadata and data) and the infrastructural components upon which the integration between Europeana and CLARIN as well as other potential infrastructures is built. These improvements would contribute to the inclusion of larger amounts of Europeana records into CLARIN's infrastructure, and will allow for a better integration with other infrastructure components such as the LRS.
Future and ongoing work
In DSI-3, which runs until the end of August 2018, CLARIN and Europeana will improve and expand upon the integration established during DSI-2. Limitations of the current state of integration will be assessed and, wherever possible, addressed in cooperation with the data and tool providers in order to support functional discovery and processing in a way that is both 'deep' (working seamlessly from resource querying to the retrieval of processing result) and 'broad' (working for many resources and tools, covering a large portion of languages, resource formats and processing types). In addition, we aim to add many new good quality resources from different collections to the set that is directly accessible through the CLARIN infrastructure. Within the same project phase, Europeana will also continue working with DARIAH, and investigate connections to other large research infrastructures, such as the European Holocaust Research Infrastructure.
Besides the additional data now accessible through the VLO, there are millions of other resources, including paintings, photographs, audio and video recordings, that can be accessed through Europeana's collection browser and search facilities at europeana.eu - something we can highly recommend doing. Chances are that you will stumble upon data relevant to your area of research while you're at it!