Studying the impact of digital collections with Wiki-Data

This blog post was written by Marcin Wilkowski from Digital Humanities Laboratory at the University of Warsaw. More: his blog wilkowski.org and Twitter (@marcinwilkowski)

It's clear that in order to study the impact of digital collections, the GLAM sector needs data – data that is useful and easy-to-aggregate. Some basic information can be accessed relatively easy: the number of files made available on digital repositories, and the views/users statistics from Google Analytics or PiWik. These can easily be obtained and presented in annual institution reports. Europeana does this for example via the Statistics Dashboard. But data of this type is limited in its use for analysing impact due to its bespoke nature. It is gathered in relation to a website, while digital content is available in many places on the Internet and is, by its nature, shareable.

For example, the famous “Milkmaid” from the Rijksmuseum’s collection is available on thousands of websites. But when measuring the impact of that digital reproduction with Google Analytics we can only tell how many views it has generated on the museum’s official official site. But what about other places the Milkmaid can be found? Shall we ignore them only because they are unofficial and files are often published in low quality? It seems that tools to measure an images’ reach on the Web do exist ‒ think of popular ones such as the Google Images Search. But even if we were able to find the alternative and unofficial publishers of our digital heritage content, more work needs to be done to measure the impact of those mirror sites. An institution would need further research to find answers on how many people reach the digital reproductions and for what purpose they are being used.

Different version of Vermeer's Milkmaid as found on Google Image search in 2013. By Joris Pekel (CC0).

But there is one popular, widely known and massively used digital space where our content can easily be republished: Wikipedia and its sister projects, notably Wikimedia Commons. Wikimedia projects, with hundreds of millions of unique visitors per month, can be a reliable source of impact data for our GLAM efforts, thanks to the general open policy of Wikipedia and its sister projects (open licenses or public domain on content, and the public domain status of the data itself). It generally means that all public data gathered via Wikipedia and its sister projects should be available via open datasets for free reuse and all research purposes. You can access the content and data from Wikimedia projects on the data dumps page using a special API or even by querying the database with a Quarry interface . So using data generated by Wiki projects as a source of impact data can help our efforts to build a more reliable framework for measuring the impact of digital collections.

But what exactly can be measured?

When our files are copied into Wikimedia Commons, we can find not only how many views they have generated during a period of time, but also for what purposes they have been used within the Wikipedia environment. Even if an institution doesn’t contribute to Wikimedia Commons directly, we still can find out the impact of its digital collections. Tools and resources to do this are:

BaGLAMa 2: BaGLAMa displays page view statistics for pages on Wikipedia and other Wikimedia projects which contain Commons files in a specific category. If your institution publishes collection on Wikimedia Commons within one category, you can check which files have been used to illustrate specific Wikipedia entries, or which are available in other Wikimedia projects (ie. Wikipedia articles in other languages, or sister projects like Wikivoyage or Wikinews). This information helps us to answer the following questions:

- for what purposes is our visual content used?
- how many views does it generate within Wikipedia and other wikiprojects?
- why is our content, available on Wikimedia Commons, not reused on Wikipedia pages?
- what are the differences in usage of our files/media between Wikipedia language versions?

GLAMorous: GLAMorous allows you to check the usage of specific images across different Wikimedia projects. You can check how many times an image is currently published within Wikimedia articles and in which language versions. The data is available also in xml format, so you can easily reuse it for your purposes. Questions that can be answered with that data include:

- how many of our files are used within all Wikimedia projects?
- for what purposes is our visual content used?

You can predict the future impact of your content in Wikimedia Commons and decide which photos or scans are worthy to be added to its repository and you can also check traffic stats of Wikipedia entries by using a tool available at http://stats.grok.se/. If you discover that many users are interested in a certain topic, you can prepare some visual resources to illustrate related articles and upload relevant content to Wikimedia Commons.

But what if your institution cannot publish on Wikimedia Commons because of copyright issues? The Wikimedia environment supports only open or public domain content ‒ this is the basic rule of publishing anything to that repository. If an institution already publishes scans on its own digital library or digital archive, it is possible that some Wikipedia articles can link to these external sources. You can easily check this by using the external link tracker. Such information can support your efforts for making specific content from your institution available on the Commons. And more wikimedia tools for culture sector can be found on GLAM-Wiki page.