The eTranslation CEF Telecom project Europeana Translate aims to strengthen the connections between the eTranslation Infrastructure and the common European data space for cultural heritage, deployed by the Europeana Initiative, for the benefit of both. On one hand, the project aims to improve the usability of cultural heritage resources by enriching cultural heritage datasets with multilingual metadata. On the other, it enhances the language resources made openly available through the European Language Resource Coordination with metadata from millions of cultural heritage objects, which were carefully selected, cleaned and normalised so that they become amenable for training purposes.
For these purposes, Europeana Translate has developed and deployed machine translation tools adapted to the needs of the cultural heritage sector. The tools are being applied to translate the metadata of more than 25 million records currently available through Europeana’s infrastructure from 22 official EU languages to English, improving the multilingual experience provided to its users.
Over the course of the project, partners trained a set of translation engines provided by partner Pangeanic with a selection of metadata selected from the Europeana infrastructure, including bilingual and monolingual data as well as multilingual vocabularies. Additional data selected from the OPUS collection website were also considered for languages that were not sufficiently represented. A number of experiments were performed to decide on the best combination of training data and set-up of the engines for each language. By splitting data between training and test sets, an automatic evaluation based on standard metrics (such as BLEU and TER) was performed for all language pairs. The results demonstrate considerable improvement compared to the generic Pangeanic models (before the in-domain training) and the eTranslation DSI for most languages.
Evaluation of the automatic translation by human experts
The automatic translations also underwent extensive evaluation by linguists and cultural heritage experts. Evaluators were asked to rate the automatic translations into English on a scale from 0 to 100, considering aspects such as fluency (grammatical correctness), accuracy (general meaning), and adequacy (proper use of terminology). They were also asked to provide additional feedback, including reporting important and recurrent errors. Three crowdsourcing campaigns were organised through the CrowdHeritage platform to engage members of the cultural heritage sector. Overall, these saw the participation of 44 expert linguists and 29 cultural heritage professionals, who gave quite high ratings (above 80%) for the majority of the 22 languages.
The results obtained by human evaluation provided us with insights about the behaviour of the machine translation engines for different languages. An in-depth statistical analysis of the assigned ratings from humans, in correlation with the automatic confidence scores calculated by the machine translation engines, allowed us to determine appropriate quality thresholds for publishing translations from various languages to the Europeana infrastructure.
Benefits for users and cultural heritage institutions
The translation engines are being used by the Europeana infrastructure to produce, index, share and display automatic English translations of metadata, which will allow people to better discover, analyse, and reuse material.
The positive impact that this work is having has been confirmed by an impact assessment survey filled in by 27 linguists and 18 cultural heritage experts. When asked about the added value that automatic English translations can bring to the search and display of cultural heritage items on the Europeana website, both communities considered it important. They also reported that they appreciated the expected increased amount of search results, which would include cultural heritage items that are not currently returned when searching in English: 83.4% and 62.9% of the cultural heritage experts and linguists respectively considered this improvement valuable.
Moreover, the translation engines set up by the project can be useful to data providers who wish to translate the metadata of their collections to English, improving their collections’ accessibility. Users of the MINT aggregation platform can make direct use of the existing API-interlinking with the engines, while cultural heritage institutions with technical expertise can take advantage of the readily deployable machine translation engines made openly available on the ELG repository. All cultural heritage experts who participated in the survey declared that they would consider using the Europeana Translate tools to enrich the collections of their organisation with automatic translations to improve discoverability.
Europeana Translate Event - how machine translation & multilingual access impacts cultural heritage
Are you interested in learning more about the Europeana Translate project, its methodology and results? Would you also like to deepen your knowledge of state of the art machine translation technologies and how it can be applied in the cultural heritage sector?
Then join us at the Europeana Translate Event - How machine translation & multilingual access impacts cultural heritage. This is an online event taking place April 13, 2023, from 14:00 to 17:00 CEST. You will hear project partners explain in detail the methodology and results obtained in these two years of work. Similar projects will also be discussed, always critically considering the importance of automated translations of cultural heritage data/metadata with reflections on future steps, usability and challenges of AI-technology for the cultural heritage sector.