This website uses cookies to ensure you get the best experience. By clicking or navigating the site you agree to allow our collection of information through cookies. More info

2 minutes to read Posted on Thursday July 18, 2024

Updated on Thursday July 18, 2024

portrait of Marco Rendina

Marco Rendina

Managing Director , European Fashion Heritage Association

portrait of Tom Vanallemeersch

Tom Vanallemeersch

Machine translation specialist , CrossLang

Close Encounters with AI: transcribing multilingual cultural heritage text with AI

EFHA’s Marco Rendina and CrossLang’s AI advisor Tom Vanallemeersch delve into the challenges of automatically transcribing texts. How can Optical Character Recognition (OCR) transcriptions be further enhanced for Machine Translation applications, and how does this benefit cultural heritage institutions? Read the interview - part of the AI4Culture interviews series - to find out.

Two digitally illustrated green playing cards on a white background, with the letters A and I in capitals and lowercase calligraphy over modified photographs of human mouths in profile.
Title:
Handmade A.I
Creator:
Alina Constantin
Institution:
Better Images of AI

Marco Rendina: Let's start by unpacking OCR. What is it, and why is it relevant to the preservation of cultural heritage?

Tom Vanallemeersch: OCR (Optical Character Recognition) or HTR (Handwritten Text Recognition) is a technology that produces a digital transcript of printed or handwritten texts. Transcriptions of scanned documents are mainly important for searchability as they allow keywords to be used to look for a specific document or to search for a specific part within a document. To further enhance this searchability, transcriptions can be translated using machine translation, enabling users to search for words in documents in different languages using, for example, only an English search term.

MR: How effective is current state-of-the-art OCR technology?

TV: Recent years have seen remarkable progress in OCR technology, and some OCR models perform impressively well, especially on modern printed texts. There is also a wide array of increasingly specialised models catering for different needs, such as 18th-century texts or handwritten WWII letters.

However, despite these advancements, challenges persist due to factors like different handwriting styles and text layouts, the languages involved, or the presence of ‘noise’ (degraded characters or bleed-through in double-paged documents, where the ink of the backside appears on the front side). Issues like the misrecognition of characters can dramatically impact the accuracy of OCR transcriptions, a problem that becomes particularly evident when these outputs are used for translation purposes.

Based on our experience at CrossLang with the development of systems for multilingual document processing and translation automation, we addressed these challenges head-on to ensure that the OCR output is not just accurate, but also translation-ready.

MR: Can you walk us through how you make OCR transcriptions ready for translation?

TV: Certainly. Making the transcriptions translation-ready is a multi-step process.

Firstly, the document or image is uploaded, and OCR technology is applied to generate a digital transcript. This involves analysing the page layout and identifying characters in the text areas. This process being automated, the resulting output may contain errors such as character misrecognition and missing spaces. Additionally, the OCR output typically lacks segmentation, presenting lines of printed or handwritten characters as they are displayed in the image, without any segmentation into sentences. While this might be fine as long as the end-user can read the text in the original language, using the OCR output directly, including its spelling errors and lack of segmentation, will very likely result in inaccurate translations.

We employ various techniques to address these inaccuracies. I’ll mention two main approaches. First, segmentation and dehyphenation techniques are employed to identify and separate sentences within the text and remove word-splitting hyphens at the end of lines. Second, to further enhance the accuracy of the OCR output, we use lexicon-based tools and Large Language Models (LLMs), including open-source chatbots, for automatically identifying and correcting errors in words to align the text as closely as possible with the original image.

Finally, with the corrected OCR output, MT can be applied to generate translations that are more accurate. This step relies on the quality of the input text, making the previous two automatic correction steps crucial for achieving useful MT results.

Fragment of a Dutch letter from World War II. Correcting errors in the OCR output using various techniques and identifying sentences in the output improves the results of automated translation.
Title:
Fragment of a Dutch letter from World War II. Correcting errors in the OCR output using various techniques and identifying sentences in the output improves the results of automated translation.
Fragment of a Dutch letter from World War II. Correcting errors in the OCR output using various techniques and identifying sentences in the output improves the results of automated translation.

MR: How do you evaluate whether this correction process has been successful?

TV: We use automated metrics such as Character Error Rate (CER) and Translation Edit Rate (TER) to assess the accuracy and quality of the corrected OCR output and its translation. These metrics allow us to compare the corrected OCR output with the ground truth (the desired transcription), providing valuable insights into the efficacy of our methods. We have observed significant enhancements in this regard, as both CER and TER generally decrease after the correction of OCR output.

We also occasionally conduct manual inspections to ensure the overall accuracy of a text, as even a minor error could alter the sentence's meaning, possibly resulting in misunderstandings or inaccuracies. There may also be instances where someone (like a historian) wishes to preserve certain elements of the text, including potential errors (such as wrongly spelt words); in such cases, an LLM might ‘overcorrect’ (similarly, it may replace words written in an older variant of a language by their newer versions). Such preservation-oriented scenarios (‘diplomatic transcription’) require careful manual inspection.

MR: What advice would you give to cultural heritage institutions that want to integrate advanced OCR and translation technologies into their preservation efforts?

TV: The paramount advice I can offer is to closely follow the developments of the AI4Culture project. In October 2024, we will offer an online workshop targeted towards cultural heritage students and experts, in which we explain the application of OCR and MT to scanned documents in a hands-on fashion and provide some more technical details on aspects such as the automated correction of OCR output. So stay tuned on the AI4Culture social media accounts.

Find out more

In September 2024, the AI4Culture project will launch a platform where open tools, like the OCR tools presented above, will be made available online, together with related documentation and training materials. Keep an eye on the project page on Europeana Pro for more details and stay tuned on the project LinkedIn and X account!

top