This website uses cookies to ensure you get the best experience. By clicking or navigating the site you agree to allow our collection of information through cookies. More info

portrait of Beth Daley

Beth Daley

Editorial Adviser , Europeana Foundation

portrait of Matteo Romanello

Matteo Romanello

Mining and exploring 200 years of newspapers: the impresso project

With the launch this week of the Europeana Newspapers thematic collection, Dr Matteo Romanello, winner of a Europeana Research Grant in 2018, tells us about impresso - a new project exploring ways to use digitised newspaper data for research.

Hi Matteo! Can you tell us about the impresso project?

Matteo: Impresso (literally ‘what has been printed’) is a collaborative and interdisciplinary research project funded by the Swiss National Science Foundation under the Sinergia funding scheme. The project’s aim is to create a technological framework to extract, process, link, and explore data from print media archives on a large scale. 

The project involves computational linguists, digital humanists, designers, historians, librarians and archivists, who are tackling the challenge of how to enrich, represent, visualise and analyse a large corpus of historical digitised newspapers for research purposes. Partners in this project are EPFL’s DHLAB, the Luxembourg Center for Contemporary and Digital History (C2DH) and the Institute of Computational Linguistics at the University of Zurich. The interdisciplinary nature of impresso is reflected also in the principle of co-design that we apply throughout the project. What it means in practice is that the data we create and the tools for working with digitised newspapers that we are developing are shaped by a constant dialogue between historians, designers, computational linguists and digital humanists.  

As to the conception of and motivation for impresso, prior to it, the DHLAB had been involved in a research project involving the Swiss newspaper Le Temps, aimed at providing access to two digitised newspapers - Journal de Geneve and Gazette de Lausanne (which merged in 1998 to become Le Temps). The outcomes of this project, as well as the challenges that had emerged, laid the ground for impresso. The idea of creating an archive of digitised newspapers lent itself well to be scaled up to include more sources as well as to look beyond national borders. A series of encounters at conferences and workshops between Maud Ehrmann (DHLAB), Lars Wieneke (C2DH), Marten Düring (C2DH) and Simon Clematide (UZH) helped to strengthen and articulate this idea into what became a successful funding proposal.  

How did you get involved with the project? 

My colleague and project coordinator Maud Ehrmann asked me to join the project in the summer of 2017, when an unexpected change in the project team opened up the possibility of having another post-doc researcher to support her in the tasks that the DHLAB was leading. At that time, I was working on Linked Books, another SNF-funded project on citation mining of scholarly literature about the history of Venice. The work on named entity processing and disambiguation that we are carrying out in impresso is at the core of my research interests. There is also a continuity with Linked Books and my previous research on information extraction from large-scale digital archives in the Humanities, with citations (and more generally named entities) being one of my main areas of interest. 

What is the importance of newspaper datasets for historical research?

Historical newspapers are invaluable primary sources for humanities scholars at large, not only historians. In fact, they contain and preserve a kind of fossilised trace of our current and past societies. They record all kinds of events, from war declarations to Saturday evening dancing balls in the countryside, and they document many aspects of day-to-day life and culture. They contain extremely rich and dense information, which is also continuous as in many cases these newspapers have been running for a long time and published on a very regular basis.  

A crucial challenge that we are addressing in impresso is how to devise a tool that supports researchers to work with large archives of digitised newspapers. The tool integrates natural language processing technologies (e.g. named entity processing or topic modelling) to capture the semantics of newspaper contents, in order to make these (enhanced) sources usable for research. An important principle we are following in its design is transparency, meaning we strive to make explicit and visible to users all aspects of the data - or of the processing we perform on the data - that often risk remaining hidden in search interfaces. Information aspects we want to make more transparent include, for example, OCR quality, as well as holes in the data due to damaged digital archives.

How are impresso tools being used? 

Despite the fact that the impresso project is still in the making, its corpus and tools are actively being used both for research and teaching. 

On the research side, Dr. Estelle Bunout (C2DH) - one of the (digital) historians in our project - is working on a case study entitled ‘Resistance to Europe’ which  involves the analysis of debates on the European idea in digitised newspapers from Luxembourg, Switzerland and beyond, with the aim of identifying tensions around the European idea from the late 19th century to 1945. And researchers from our associated partners, the Infoclio association and the University of Lausanne’s History Department, are contributing to the reflection on how to apply impresso tools to historical research questions in the context of concrete use cases. 

Finally, we issued a Call for Associated researchers during the first year of the project in order to extend the circle of historians affiliated to the project. As a result, about 20 historians mainly from Benelux, France, Germany and  Switzerland expressed their interest in both the tools and the collections brought together by impresso and have got involved in the project. Their association entails not only the use of the project’s output but a regular dialogue with the impresso team, via workshops and a final conference aiming at collecting feedback on their use of impresso tools and their research, and at discussing epistemological issues raised by digitised newspapers.

The diversity of topics and methods of the associated researchers reflects the Swiss and Luxembourgish (digitised) newspapers’ allure as historical sources. They include prosopographical research on experts and female war correspondents, as well as on ‘history of thoughts’ such as the rise of liberal internationalism at the end of the 19th century, or banking history. Each of these research topics requires a particular use of the newspapers, a particular way to query them that contributes to fuel the conception of the interaction with the impresso collection. The diverse uses are however made available for all the researchers in the same interface, in an effort to offer a diversification of these interactions and enrich every type of research practice, including also teaching practices, in the spirit of the generous interfaces.

On the teaching side, Martin Grandjean and Sandra Bott have been using part of the impresso corpus in teaching a Digital Humanities/Digital History course, part of the EPFL’s Social and Human Sciences programme. The course focuses on how the big events of the 20th century were covered in the press; digital archives of newspapers provide the students with a rich source of materials on which a range of digital methods and tools can be tested. The same course is planned for next year and it will be based on the impresso interface and tools, thus allowing us to test the strength and weaknesses of these tools specifically in a teaching (rather than research) context.   

In the frame of Ranke2, the platform prepared at the C2DH offering teaching materials on how to practice digital source criticism, the impresso project contributes with the preparation of a module dedicated to the use of digitised newspapers. This module harvests the lessons learned with preparing a transparent interface, adapted to bachelor level and secondary school teaching, bringing the latest trends of research practices to the classrooms. 

Impresso interface (alpha version): display search results. - Matteo Romanello

2019

Italy

Where are you up to in the project - and what is the next step? 

The beta version of the impresso interface was released in May 2019. For now it’s a private release, mostly aimed at getting feedback on the interface design and functionalities from our associated historians. In terms of data, the interface gives access to 22 Swiss newspapers for a total of almost 3.2 million pages, 360,000 newspaper issues, and over 26 million content items (e.g. articles, advertisements, etc.), mostly in French and German.

As for the interface functionalities, the beta release contains all the basic features you expect from a newspaper interface: search, search facets and a viewer which lets you read and explore newspaper articles. Additionally, it provides some more advanced features, like the ability to search for named entities, to use topic models as filters to narrow down search results, and the possibility for the user to create and save collections of items. New functionalities that were added in the latest release include the first version of visual search (ability to filter all available images, by date and newspaper) and the bulk download of metadata.

What will happen next? In the month of July we will release the public version of the interface, with new functionalities as well as new newspaper sources (most notably the digitised materials of the Luxembourg National Library). The best way to follow the project as it continues to develop is to join the impresso mailing list  - and our associated historians’ group - or follow us on Twitter, as there will be a few exciting new developments in the coming months!

top