Close Encounters with AI: an interview on automatic subtitling

Marco Rendina: Let’s start from the beginning. Can you give us a definition of subtitles?

Mauro Cettolo: Sure. Subtitles are short pieces of text that usually appear at the bottom of a screen. Many, if not all of us, have seen subtitles at least once in our lives, for example, when watching a film in a language we do not speak. They extend the accessibility of audiovisual content to people who either do not know the language in which it is spoken or, for various reasons, cannot listen to the audio.

MR: Ah, of course, so subtitles are translations of what is being said?

MC: Actually, there are different types of subtitling. In addition to subtitles presenting users with actual translations of what is being said, there is subtitling in the same language as the speech, as well as a richer form of subtitling, which includes the description of sounds, making content more accessible.

MR: What type of subtitling is the AI4Culture project working on?

We are focusing on cross-lingual subtitling, following our dream of making the video content available through Europeana.eu accessible across languages to an increasingly diverse audience. This is an active and challenging line of research that in recent years has seen the emergence of various automatic approaches. These include the so-called “cascade” approaches, where the task is tackled by a pipeline of separate AI components for audio segmentation, speech transcription, text translation and temporisation. It also includes novel solutions, where the task is performed by a single neural model designed to execute all the steps of the process.

MR: What challenges does the development of automatic approaches for subtitling pose?

MC: Cross-lingual subtitling is not a mere translation. It is a multifaceted task, made more complicated by the need to balance many aspects simultaneously.

We start from audio input: this aspect alone, taken in isolation, presents challenges in a research area that is very active today, known as Speech Translation. Consider, for example, the fact that words in written text are delimited by spaces, while in audio speech reaches us as a continuous stream, in which often words become challenging to distinguish from one another.

If we add to this the fact that spoken words reach us distorted by particular accents, pronunciation, hesitations, with the interference of music and background noises, or with the confusion caused by the overlap of multiple speakers, we can imagine the difficulties a machine, a software model, faces in a seemingly simple task like translating speech.

MR: Now we understand why you defined subtitling as a multi-faceted task! What else makes it difficult?

MC: Well - the kind of translation required by subtitling is a typical example of what we call constrained translation. A good subtitle must meet specific requirements, it has to be minimally invasive. To be user-friendly, subtitles must minimise the cognitive load required for the user to read the text while watching the content. This way, a person can enjoy the video content without distractions and, above all, without excessive effort due to reading.

MR: What constraints must a subtitle meet to avoid being invasive?

MC: Constraints are temporal, spatial and syntactic. From a temporal point of view, subtitles must be perfectly aligned with the video stream, to avoid situations where someone is speaking but we can not read what they are saying. From a spatial point of view, subtitles must be concise enough not to require too much time to read and reduce the eye movements (known as saccades) necessary for reading. Finally, there are syntactic constraints; the splitting of a subtitle into lines should not separate the constituents of phrases. These are not general principles: there are strict rules, albeit slightly different across content providers.

MR: Is it possible for machines to perform these tasks that, just a few years ago, were considered unachievable?

MC: In part, yes, thanks also to projects like AI4Culture. Today we have neural network-based models capable of generating acceptable subtitles for different language pairs. ‘Acceptable’ means that they are certainly not suitable for major Hollywood productions, but usable for that enormous amount of audiovisual material that otherwise would remain forever inaccessible due to language barriers and lack of resources for translation. Sometimes our models still make mistakes, even funny ones, but we are on the right track: we train models on specific languages, and the results are sufficient to convey the meaning of what was said and, if possible, are suitable for manual revisions - way better than starting from scratch!

MR: Sounds great - what are the next challenges we will face then?

MC: I'll mention three.

The first one concerns the automatic evaluation of systems. At the moment, our evaluations are fragmented into a multitude of metrics to assess models against each of the constraints at play. Combining these judgments into a single score remains a complex problem, as well as one of my main research interests in the immediate future.

The second one is that of language coverage: today we are able to deal with a very limited set of language pairs, mostly English-centric. However, there are over 7,000 languages in the world and, for most of them, there is no data, nor computer tools and models.

The third challenge is environmental. Today's AI is capable of doing great things, but the energy costs of the so-called foundation models, which depend on huge computational resources, are extremely high. Still a lot to do, but projects like AI4Culture give us the chance to share our work with the world and collectively advance in the field.

MR: Thank you for your insights into this challenging and exciting research area. From now on, we will enjoy subtitles with a completely different and much more aware perspective!

Find out more

Later this summer, the automatic subtitling pipeline presented above is going to be integrated into an open-source and user-friendly automatic subtitling tool. It will allow cultural heritage institutions to automatically create subtitles in eight languages for their audiovisual materials enabling also their manual editing and validation.

In September 2024, AI4Culture will also launch a platform where open tools, like the automatic subtitling tool, will be made available online, together with related documentation and training materials.

Keep an eye on the project page on Europeana Pro for more details and stay tuned on the project LinkedIn and X account! For now, all the people interested in deploying the automatic subtitling pipeline can explore the open-source code available on GitHub.