Hierarchical clustering: making sense of Europeana data

By Valentine Charles, Europeana Interoperability Specialist.

Europeana now gives access to 27 million objects (books, paintings, sound recordings, videos etc) contributed by over 2,200 cultural heritage institutions from all over Europe.

Aggregating such heterogeneous collections raises issues such as ambiguity between original and derivative versions of the same object, or even duplication if different providers give access to the same object. Relationships between objects within Europeana or between Europeana and external objects or collections are often lost. The loss of these semantic relationships is mostly due to a metadata quality issue: simple formats like Europeana’s former data model (Europeana Semantic Elements - ESE) are not good at capturing internal and external semantic links between objects. In addition, Europeana’s providers are not always in the position to provide rich data. The result can be an unsatisfying browsing experience for Europeana’s visitors.

In May 2012, a small team of experts from the Europeana Office and from Online Computer Libary Centre (OCLC) Research Europe started collaborating to address these challenges. In particular, investigating automatic clustering (grouping) of cultural heritage objects so that we can find relationships between objects within Europeana.

From left to right: Titia Van de Werf (OCLC), Antoine Isaac (Europeana), Shenghui Wang, Rob Koopman (OCLC), Valentine Charles (Europeana).

The main goals of the experiment were to create semantic links between objects and to detect duplicates. The OCLC team developed an advanced clustering methodology which was applied to the entirety of the Europeana dataset. Results have been analysed and categorised, looking out for similarities in the metadata and digital representations of the grouped objects. Examples of clusters include: all parts of the same object (e.g. scanned pages of a book), translated copies of the same archive, multiple letters belonging to the same set of correspondence, multiple digital representations for the same object.

You can find a more detailed presentation of the work in Shenghui Wang’s presentation 'Hunting for Semantic Clusters: How Can We Find Interesting Stuff in Over 22 Million Europeana Objects?' given at the OCLC Annual Meeting in 2012.

The work on the categorisation of similar clusters is very relevant to Europeana as it will provide new ways of organising and visualising objects in Europeana. This work is also crucial for Europeana’s data quality improvement plans.

The research findings of the project are discussed in the paper Hierarchical structuring of Cultural Heritage objects within large aggregations which will be presented at the next conference on Theory and Practices of Digital Libraries in September 2013.