Who's Using What: LibreCat

What better way to show off these tools than talk to the developers making use of and developing them?

Who's Using What is a blog series by Gregory Markus from The Netherlands Institute for Sound and Vision as part of EuropeanaTech. The idea behind the series was to raise awareness about the brilliant open source software options now available to institutions, and to encourage collaboration in the digital heritage community. What better way to show off these tools than talk to the developers making use of and developing them? You can also find lots of OS options in the new E uropeanaTech FLOSS Inventory.

Catmandu is an open source toolkit to manipulate (semi-)structured data. It was developed as part of the LibreCat Project - a collaboration between Ghent University, Lund University and Bielefeld University.

At its simplest, Catmandu offers a command line client and a suite of tools to ease the import, storage, retrieval, export and transformation of data. It solves all sorts of problems that institutions encounter in terms of data management, with a firm emphasis on the importance of open source. Developed by librarians for librarians, it attracts an active developer community.

EuropeanaTech caught their presentation late last year at the SWIB Conference and decided to follow up with them to see which tools and components went into Catmandu, and what they have planned for the future.

Answers from Patrick Hostenbach - Ghent University.

LibreCat-Catmandu

1. What open source tools are you currently working with?

We export data from databases such as library catalogues, institutional repositories, web services, relational and nosql databases and data dumps (such as MARC, XML, RDF, JSON, YAML).

This data can be transformed using our Fix language and stored once more in catalogues, databases or data dumps. The process is known as ETL and has existed for many years as a billion dollar market, but, until recently, not many ETL tools targeted at library datasets were available - and certainly not available as open source tools.

The code is programmed in Perl and makes heavy use of existing CPAN modules to process the various data formats. In our projects, we like to work with open source tools as MySQL, PostgreSQL, MongoDB, ElasticSearch, Solr as our backend data stores.

For web development we are using the Perl Dancer and Plack which gives a very easy access to the PSGI framework to create fast lightweight web applications. In our coding community we make use of tools such as GitHub, Travis, CoverAlls to keep our code clean and tidy and help our core maintainer Nicolas Steenlant to build our CPAN releases.

2. What open source tools have you used in the past to develop larger applications?

Our developers are all running GNU/Linux environments such as Debian and CentOS. Then there are tools such as Docker, Vagrant and Puppet to manage installations and automate a lot of the system administration tasks.

For parallel processing we use GNU parallel to process large datasets. We rely very much on great existing open source applications such as Fedora Commons, BlackLight, and Koha to implement front-ends. To create statistics we are using R. To transform unstructured data Pandoc.

There are so many great open source tools available. As coders we don’t favor Perl, Python, Ruby or Java, rather we choose what is the best tool for the job.

3. What are you currently developing?

There are several big projects we are working on currently. We are creating a new version of our institutional repository LibreCat based on Catmandu to add support for research data and more CRIS functionality. The code is available on GitHub and will very soon go into production at Bielefeld University.

Then, we are working with iMinds at Ghent University to create tools for processing and publishing Linked Data Fragments. We think this is a great new technology to publish linked data on the Web. Jakob Voss at GBV is creating a great suite of Perl modules to manipulate RDF. Jonas Smedegaard is pushing Catmandu into Debian.

In the next release of Debian (Ubuntu etc) installing Catmandu will be as simple as a 'apt-get install catmandu'. But, there is so much. If you look at GitHub there are dozens of Catmandu repositories where our community creates new data processing tools.

4. What would you like to see developed?

It would be great if, in the library world, we could exchange more data manipulation recipes. Our intent with Fix language was to create data mappings that can be read by librarians and programmers.

There are so many open source data manipulation tools available these days in Perl, Ruby, Python, Java, NodeJS (See also: http://librecat.org/Catmandu/#related-projects). It would be great to have these projects work together and exchange best practices.

There are already examples how this could be done, by, for instance, combining LODRefine and Catmandu. In our LibreCat institutional repository we have a strategic direction to use more R programming for the statistics of data processing. There is no need to reinvent the wheel of you have such a great open source tool available.

Then in my dream scenario, I would open RStudio, write a little Fix script to manipulate a giant MARC/OAI-PMH/RDF dump and generate cool graphs using R. In the end, we do this all to get the data out, cleaned, on the web so it's freely accessible with open licenses.