Who's Using What - Douglas Duhaime Developer Profile
This week’s “Who’s Using What” spotlight goes to Douglas Duhaime; from the University of Notre Dame. Douglas’s research takes him to intersections of early modern natural philosophy and classical political economy in eighteenth-century literary works. Douglas has been pursuing the relationship between these two facets by using natural language processing techniques, running hand-written scripts on text data from Early English Books Online; (EEBO), Enlightenment Century Collections Online; (EECO) and the Philosophical Transactions; looking to trace patterns in Enlightenment-era literary history.
You can follow Douglas on Twitter, Github, and DHcommons and gather more insight into his work on his website.
1. What open source tools are you currently working with?
I'm currently working on a few different text mining projects, a few of which use libraries like WEKA; and scikit-learn; for analysis and ggplot; or D3; for visualization. Because a lot of my work revolves around natural language processing, tools like the Stanford NLP Pipeline , Princeton's WordNet, and the Snowball stemmer; are standard resources. When working with early modern texts, I've also been using Vard2; and the variant tables in the MorphAdorner; package for orthographical normalization.
What open source tools have you used in the past to develop larger applications?
A few of the tools I've built draw upon Selenium's; browser automation framework--which handles well in Javascript and AJAX-rich environments--and Whoosh's; full text indexing functions. Almost everything I write draws upon Python's Natural Language Tool Kit; and BeautifulSoup at some point. When working on applications that require fuzzy string matching, I like to use difflib
3. What are you currently developing?
I'm working with others in the University of Notre Dame's Text Mining Working Group to develop an open source web service capable of identifying literary allusions in user-provided texts. The prototype runs custom algorithms against an SQLite database with a Django back-end, and uses many of the text processing libraries discussed above.
4. What would you like to see developed?
I would like to see a library capable of estimating the likelihood that a given text contains one or more simple ciphers (substitution or null, for instance). Such a resource would be highly useful in certain contexts. I also eagerly await both the EMOP; team's OCR engine for early modern typography and a package capable of achieving high sentiment classification scores on older prose. Both will be very valuable tools.