Staying persistent: EuropeanaTech community help to pave the way for new Europeana URIs
What happens when you ask a community of over 350 technical experts in digital cultural heritage how they feel about URIs? Well, you get a lot of feedback! Europeana is in the process of creating persistent identifiers (URIs) for the platform and we asked the EuropeanaTech Community to help us decide what that best option is for these.
Why create new URIs?
To start, Europeana gathers information about entities that constitute a 'semantic layer' of concepts, persons and places around its core of cultural objects. For instance, the addition of names, titles, or time periods in different languages i.e. the 4th quarter of the 19th century, [4e quart 19e siècle] (fr); [4-я четверть 19-го века] (ru). Enrichments and extensions like this come in droves and the data is always in flux meaning that concepts, persons, places etc. are always susceptible to change. This can result in broken or inconsistent links which is not very user-friendly. That’s why there is a high demand for secure and persistent URIs.
As Europeana gathers more and more data about these entities, they percolate through the entire ecosystem benefitting services like search, display, and enrichment. But, as always, you can have too much of a good thing; as Europeana grows, information management issues appear.
By gathering this information Europeana is building an internal database of knowledge. These resources can be seen as new entities, or bundles thereof, and are worth sharing with our data re-users, especially through data services such as the API or our linked data at data.europeana.eu.
Image from Phil Archer, Dos and Don'ts for Persistent URIs.
One of the challenges we face is how to give these entities identifiers that are both easy to assign and relatively future-proof. There are a number of best practices, but a couple of options still seemed worth investigating. Europeana is in a unique situation because all the data is second hand, meaning that standard best practices used by the data providers, for instance, may not be most suitable for Europeana. Specifically, we wanted to find out whether it is worth trying to make identifiers that are human-readable, vs. purely numerical ids.
A human-readable label could look like this:
http://data.europeana.eu/agent/johannes_sebastian_bach
A bare numerical identifier could look like this:
http://data.europeana.eu/agent/12345
Both have their pros and cons. So before making such a big decision, we felt it best to consult the EuropeanaTech Community, due to the wealth of expert advice held within, to help us come to a decision which we will then implement in the coming months. The feedback we received was astounding.
Let’s see what the experts had to say, shall we?
"If you are going to work with identifiers in a linked (open) data situation then you need to be certain that the URIs don't change. Although human readable URIs sound like a good idea because we can read them, in my experience will also be an invitation to debate and change... Johannes Sebastian Bach may become Johann Sebastiaan Bach etc. Words change, names change and people have a tendency to change human readable stuff…"
"As a compromise, you could follow the approach of VIAF, where Bach has a numerical identifier (http://viaf.org/viaf/12304462/), but is presented also as http://viaf.org/viaf/12304462/#Bach,_Johann_Sebastian,_1685-1750."
"From a technical perspective, option 1 or 2 make no difference, so its a social engineering thing. If you want to avoid human confusion in the case of changing names, [numerical identifiers are] safer."
A group of men conducting a serious conversation. The Wellcome Library, CC BY.
"Section 2.3.1 of the Architecture of the WWW suggests that [minting both forms of URL] is not good practice because of the additional load it places on consumers of the linked data."
"[Human readable URIs] have the advantage of giving the human user a clue about what is hidden behind the url (this might save developers a lot of time)..."
"Any URI may have to become human-readable at some point, even if it is not meant to, unfortunately. So, I would say that one either should implement dual URIs (numeric and 'meaningful')."
"There is a famous statement by Dan Brickley: 'First rule of namespace URI design: you're more likely to regret things you included, than things you omitted'."
The above are just a small portion of the numerous contributions we received in less than 24 hours, and you can read some more here. Since numbers are sometimes easier to read than graphemes we also held a doodle poll for everyone to vote for either human readable URIs or just numerical URIs. The final tally came to:
Numerical URI: 24
Human readable URI: 5
I don’t know: 2
We have still not chosen a final design for our URIs. However, we now know which option is most favoured, which is most widely used, and some tricks to harness some of the benefits from both options. Perhaps one of the most valuable lessons we learned is that, whenever a question like this appears, the EuropeanaTech Community is on hand to provide the highest quality expert advice available in Europe.
A huge thank you to everyone for your input and stay tuned to see the new Europeana URIs!