This website uses cookies to ensure you get the best experience. By clicking or navigating the site you agree to allow our collection of information through cookies. More info

2 minutes to read Posted on Thursday October 15, 2020

portrait of Twan Goosen

Twan Goosen

Software developer , CLARIN ERIC

Exploring new resources in CLARIN’s Virtual Language Observatory

Since 2017, CLARIN and Europeana have worked together to increase the number of cultural heritage objects available for quick and easy discovery as well as processing by humanities and social sciences scholars. In this post, we take a look at the new resources integrated into CLARIN’s Virtual Language Observatory.

main image

The Virtual Language Observatory 

CLARIN is a research infrastructure that aims to support researchers in the humanities and social sciences by making digital language resources and tools from all over Europe and beyond accessible through a single sign-on online environment. As partners in the Europeana Digital Service Infrastructure (DSI), Europeana and CLARIN are working together to embed cultural heritage content into CLARIN’s infrastructure. Since an initial pilot integration in 2017, CLARIN has regularly updated and extended the selection of cultural heritage objects it includes in its Virtual Language Observatory (VLO). This online search and discovery service focuses on the needs of scholars looking for language resources, and is integrated into the wider CLARIN infrastructure. 

New resources for researchers 

A key part of this integration is improving user access to online analysis and processing possibilities for any resource found through the VLO. Such functionalities are available for a wide variety of cultural heritage resources 'harvested' through Europeana, ranging from renaissance era manuscripts and digitised newspapers to historical children’s books and oral history recordings.

In April 2019, we wrote about the first resource integration. We showed a powerful example of how people can process a language resource directly from their browser with a few clicks after discovering it. At that point, about 135,000 records had been sourced from Europeana and included in the VLO. Since then, we have carried out two additional iterations of selection and integration, resulting in over 275,000 records from Europeana, which is more than any other individual provider of metadata records currently in the VLO. Below, we present two additional examples of resources that are currently available, and demonstrate how they can be processed further.

‘O kimmeryjskich pomnikach w Krymie’

'O kimmeryjskich pomnikach w Krymie', is a Polish book from 1882, provided by the Federacja Bibliotek Cyfrowych as a PDF, with its full text content available as a result of OCR (optical character recognition). As the animation below shows, someone using the VLO can explore processing options by selecting a link to an individual file and processing it with the Language Resource Switchboard. For this record, a variety of interesting natural language processing tools are available, most of them provided by the Polish CLARIN-PL consortium. 

Computational linguists might want to see the result of the various types of linguistic analyses available, while humanities scholars might find it interesting to explore the output of the keyword extractor, which provides a ranked list of topics automatically detected as being relevant to the text. The tool that offers this type of analysis for Polish is ReSpa. It can be started directly from the Switchboard, and by doing so researchers can quickly gain an understanding of the content of a work without even opening it! This can also be helpful to those who don’t read Polish, as the topic list can easily be translated using a generic text translation tool such as Google Translate. For this example, we can find out within a few minutes that, based on the content of the book, its main topic is monuments.

Title: An example of the output of processing by the ReSpa tool for keyword extraction, pasted into Google Translate. Listing of common words within the text and their translation.

Creator: Twan Goosen

Date: 2020

CC BY-SA

‘Een theepartijtje van Mevrouw Poes: eene vertelling uit Katsland’

Our second example is a digitised 19th century children’s book provided by the National Library of the Netherlands: 'Een theepartijtje van Mevrouw Poes: eene vertelling uit Katsland'. A direct link to a PDF is available for this resource. Besides the scans of the rich illustrations and the story, it also encodes the full content of the book as machine readable text. 

By using the Language Resource Switchboard, a user can find out that the Voyant distant reading tool is an available processing option. Once the resource is loaded into Voyant, the text is presented beside various metrics and a set of tools that allow a scholar to carry out quantitative analyses of the terms within the text, as in the example below. 

Title: Word cloud of terms appearing in 'Een theepartijtje van Mevrouw Poes: eene vertelling uit Katsland'.

Creator: Twan Goosen

Date: 2020

CC BY-SA

This corpus has 1 document with 2,836 total words and 1,010 unique word forms. Created 3 seconds ago. Vocabulary density: 0.356. Average words per sentence: 32.2. Most frequent words in the corpus: mevrouw (49); poes (38); mademoiselle (18); theepartijtje (17); monsieur (14). 

Find out more

Some other interesting collections added since our last report that you can now explore via the VLO include:

If you are curious about these and the many other collections available in the Virtual Language Observatory, and would like to explore the tools available for analysing and processing them, visit vlo.clarin.eu, enter some search terms and start exploring!

top