Enrichment plays a fundamental role in Europeana’s activities. In our context, enrichment can be defined as generating metadata from the data provided by our partners, adding extra value to the data we receive. We use the combination of original and enriched metadata for indexing our records, and this lets us build functionalities that allow people to search and browse our collections, and receive recommendations. Achieving automatic enrichment using machine learning algorithms is one of the objectives of the Europeana Strategy 2020-2025, triggering projects such as Saint George on a Bike.
Europeana's R&D team is exploring how computer vision techniques (systems which can make sense of visual data) can improve the enrichment Europeana conducts. We decided to start a pilot on image classification, where we build a model that is able to classify images from digitised cultural heritage objects into a set of predefined categories. We believe that a system trained with the selected categories would prove useful in enriching our collections.
Deep learning techniques, based on a certain type of mathematical model called neural networks, are the method of choice for this type of problem. In order to train a neural network, we need to obtain a training dataset containing a large amount of images already classified into selected categories. In simple terms: if we show a computer model images of paintings and tell the model that all these images are paintings, we train that model to recognise whether images it has never seen are a painting or not.
The first steps necessary to build the image classification model were to select a target vocabulary and gather a training dataset using the Europeana Search API; explore how we did this below.
Defining a vocabulary for classification
Controlled vocabularies are sets of predefined and uniquely identified concepts, which can be used to index data and make it interoperable. The use of vocabularies in information retrieval is a convenient way to organise and reference knowledge.
At Europeana, we use concepts from vocabularies (identified by Uniform Resource Identifiers, URIs) as part of the metadata for indexing cultural heritage objects. For this project, we focused on a selection of concepts from the Europeana Entity Collection, which have equivalences with concepts from the Getty Art and Architecture Thesaurus (AAT). This vocabulary was originally gathered for organising the sourcing of content for our thematic collections. We included 20 categories like photographs, paintings, sculptures, clothing, and jewellery. Explore the complete list on Github.
Accessing data using the Europeana Search API
Once we had our vocabulary, we wanted to access images belonging to the different categories for training our model. We did this through the Europeana Search API, one of the many interfaces that allow us to retrieve cultural heritage objects displayed at europeana.eu. Given a query and a set of parameters, the Search API will return a machine readable response containing the metadata of the resulting objects. The API response serves the data following the Europeana Data Model.
In our setting, we considered that there was only one possible category for each image. This allowed us to assemble an annotated dataset by querying the Search API for images corresponding to the different concepts in our vocabulary, and using this concept as the label. In this way we assembled the dataset automatically and no manual annotation was necessary.
Since we wanted our dataset to follow the FAIR (findable, accessible, interoperable and reusable) principles, we uniquely identified both the concepts and the cultural heritage objects retrieved, and we only used openly licensed content. The metadata served by the Search API is under an open license, whereas the content of the cultural heritage objects might be subject to copyright. For this pilot we only considered images free of copyright by setting the reusability parameter as open.
In our case, we wanted to retrieve objects indexed with the different concepts of the vocabulary. Instead of using the human readable version of the concepts, we made a query for the concept URI directly by using the skos_concept parameter (one of the search parameters of the API).
We were interested in keeping track of the objects used to assemble our dataset. For each object retrieved we stored relevant information in a CSV file. This file is easy to share, and it contains all the necessary information for assembling the dataset. The images will eventually need to be downloaded and stored in disk for training the image classification model.
Find out more
The image training dataset can now be used for building an image classification model that will output one of the concepts of the vocabulary given an input image. We are planning to continue our work by evaluating whether this dataset contains enough information for training an image classification model, and assessing whether the resulting model is suitable for automatic enrichment. We will share updates through Europeana Pro news!
We hope this post encourages engineers and researchers interested in experimenting with cultural heritage to use our Search API for assembling datasets for machine learning, and in particular to use our collections for training and applying computer vision algorithms! Feel free to check out the Github repository, where you can find the vocabularies used, the datasets gathered, and code for harvesting the dataset and training an image classification model. Don’t forget to contact us at firstname.lastname@example.org if you have any questions, ideas or experience to share!
If you are interested in finding out more about AI and digital cultural heritage, explore our AI theme on Europeana Pro.