Building on state-of-the-art machine translation services

The challenge of multilingual metadata

Europeana works with collections described in no less than 37 languages and strives to match them with search terms that may occur in any language. All items in the collections on the Europeana website are described in a set of metadata fields that convey essential information about them, such as their title and creator. This information helps people to discover and understand the objects they are interested in. Currently, the majority of records contain terms in a single language, the data providers’ language. This lack of multilingual metadata hampers Europeana’s goal of offering broad access to its collection across languages.

Addressing multilinguality in this respect is quite a challenging endeavour. To begin with, metadata isn’t a natural language with complete sentences and predictable grammar; it is often presented in short phrases or even single words, which means that the context needed for an accurate translation is difficult to find. In addition, the terms used can be very specific; they may look like a general term but have a different meaning when used in a cultural heritage context.

For example, the Greek religious term reflecting the Last Supper could be incorrectly translated as Secret Dinner. The repercussion of this inaccurate translation - or the absence of a translation to English altogether - would be that Greek artefacts with a title or description referring to the particular theme would not appear among the results when someone searches for paintings about the Last Supper on the Europeana website.

Building a bridge between Europeana and eTranslation Digital Service communities

How is the Europeana Translate project working with other stakeholders and tools to address this challenge?

Developed by the European Commission, eTranslation is a language tool created using the newest AI technologies and has been trained on the large amounts of data available both in-house and gathered through an EU-wide language resource collection effort. In the ELRC-SHARE repository used by the eTranslation DSI, cultural heritage is underrepresented, and, as a result, existing technology solutions are less well-equipped to handle the specific aspects of cultural heritage data.

In this context, building collaborations between stakeholders from the Europeana and eTranslation communities is key to customising machine translation tools so that they can serve the particular needs of the cultural heritage domain. Europeana Translate seeks to bring the eTranslation and the Europeana communities together to address challenges encountered by both sectors. Improving multilingual access to digital cultural heritage requires a number of complementary roles and expertise, which are served by the diverse partners of Europeana Translate (see them here).

Experiments with machine translation

Over the past several months, project partners have worked together to select and appropriately segment and clean metadata records from the Europeana website. This data was then exploited by project partner Pangeanic, who used it on top of 12 million translation textual segments from existing generic language resources to improve the accuracy of machine translation algorithms when translating cultural heritage metadata.

Pangeanic conducted a number of experiments considering different combinations of training data. This included bilingual metadata from Europeana, synthetic data produced from metadata in one language, and multilingual vocabularies relevant to the cultural heritage domain. Alternative sources of data, beyond Europeana, were also considered for languages for which few or no resources with translations to English exist. The automatic evaluation of these experiments using established metrics allowed partners to decide on the setup for the best-quality automatic translations and compare them with the results achieved by other translation tools, such as Google Translate and eTranslate. In general, the evaluation demonstrates improvements in results compared to generic models for most languages.

The machine translation engines resulting from this process will be used to translate metadata from the 23 official EU languages to English (the 24th official language). These translation engines will be used to generate automatic English translations for at least 25 million metadata records on the Europeana platform. The translations will be indexed and displayed, improving the multilingual user experience on the Europeana platform. Revisiting the person who searches for artefacts inspired by the religious theme of the 'Last Supper', after the completion of Europeana Translate, they will be able to also access paintings from Greece, Romania and many other countries that are currently not included in the search results.

Moreover, Europeana Translate will make openly available the selected and appropriately processed language resources it produced via the ELRC-SHARE repository under a free reuse licence (CC0). This will enable the machine translation community to make use of open data to train, adapt and test their translation services in the cultural heritage domain.

Involving humans in the loop

In the coming months, two complementary evaluations of the automatic translations produced by the experiments will be carried out by linguists and cultural heritage professionals.

The Machine Translation Evaluation Tool will be used to evaluate the accuracy and performance of all 23 translation engines. Three crowdsourcing campaigns will be organised to engage cultural heritage professionals to help test and evaluate automatic translation (the languages to be evaluated in this respect include French, Italian, and Dutch). The campaigns will also engage audiences and raise awareness in the cultural heritage community about the power of automatic translation services. The CrowdHeritage platform will be used to present the automatic translations in the context of the cultural heritage items to which they refer.

The results of these evaluations will provide useful insights and be used to determine the acceptable quality threshold for publishing automatic translations to Europeana and for use on cultural heritage organisations’ own platforms.

Find out more and get involved

To find out more, you can watch an introductory video, a video about the project’s first results, or read about the Europeana Translate architecture in this paper presented at the European Association for Machine Translation 2022. Professionals in the field of audiovisual, fashion and museums will have the chance to contribute to the project by helping evaluate the results in our niche-sourcing campaigns, which will occur at the beginning of 2023. Keep an eye on the Europeana Pro event page to find out more.