Training Named Entity Recognition for Author Names

For my first foray into network analysis, I’m using a test case of one file, the Journal of the American Chemical Society journal, volume 1 (1879).  For those interested, for the purposes of my tests, I took the journal from Hathi-Trust (my collection is available at https://babel.hathitrust.org/shcgi/mb?a=listis;c=1649210391), and I ran a quick and dirty OCR through Adobe.  So, the file’s not great in terms of accuracy, but should be good enough for test purposes.

Purpose:  I’d like to be able to recognize author names and then do some analysis to determine who the network of authors is, and who they may be referring to, not in terms of formal citations, but are they mentioning other scientists within the author network when they discuss their experiments in the Journal.

Test: So, I ran the Stanford NER on this sample volume. After many hours of hammering away, I found that:

  1. It actually does reasonably well with names, I was surprised in some ways how good the results were
  2. It also thinks that some substances are people (eg. Glucose is not a person but Dextro-Glucose is)
  3. English names do fairly well, but names from other languages fare less well (particularly French and German in this corpus)
  4. It also picks up on names that are not authors (eg. Peligot tube) which is good, but also bad if I’m trying to extract author names and references from equipment

Conclusion: These results are not really surprising, but I need to think about the next step for training the NER to do what I need.  If anyone has done some work on training, particularly for author recognition, any advice would be appreciated.

Why a blog on a graduate student project?

In discussing this project with one of my professors, I thought I would provide a brief reflection about why it is important to keep a record of a project like this in a blog format, and, more importantly, why I think it is important for future graduate students, historians and scholars to do the same.

Personally, I think that keeping a record of ongoing projects like this one is important for three reasons:

  1. It is a way to further “scholarly communication” (what my whole topic is about) in some important new ways.
  2. It creates a record of the project for future scholars to follow
  3. It allows others to use my work for future scholarship

“Scholarly communication” at its core is about finding ways to share research with colleagues (and other audiences who may be interested in my work).  Doing this kind of work publicly allows me to

  • find others who are working on similar projects, or even those who might be able to help with technical issues as I play with files.
  • show others who may be doing similar projects what does and does not work (and hopefully avoid my mistakes).
  • build on what I did to refine my ideas and further the debate about my findings.

Electronic methods of dissemination help to perform scholarship in some unique ways, and though there may still be a role for formal published and peer-reviewed articles and books (more on that maybe in a future post), I think there is also value in informal communication in venues like this.

Others may have different opinions, but for all of the reasons above, and maybe more reasons as I do more work on the topic, I think it is important for me to share my nascent ideas on the history of scholarly communication in an informal channel like this one.

First Project: Journal of the American Chemical Society

For my first foray into studying the history of scholarly communication, I’d like to study the history of a particular journal.  Since there has already been some work done on this, I’m starting with the Journal of the American Chemical Society.  Fortunately, the complete run of the journal is available at HathiTrust, and I’ve created a collection of all the issues from 1876 (journal’s founding) until 1922 (last year of copyright).

I would like to

  1. Do a network analysis of the authors in the Journal and see who is writing for it and what relationships exist.
  2. Do some topic modelling to see what these authors are talking about and what, if any, relationship there is between these topics and the network of authors.

Issues to solve:

  1. How to get text out of Hathi-Trust
  2. Once I get text, how to deal with the Proceedings of the American Chemical Society

On issue 1:

I’ve written to Hathi and am trying to get RSync set up.  If anyone has done this before and could help, that would be much appreciated.

On issue 2:

The journal started in 1876 as the Proceedings and became the Journal after about a year.  The Proceedings continued to be published, however.  How do these two differ?  How should I analyze them?

Studying the History of Scholarly Communication

Scholarly communication is becoming increasingly important in modern universities as the larger higher education system (including faculty members, students, publishers, funders, readers, government, and many other stakeholders) grapples with how to disseminate and evaluate scholarship both in print and online.

This page, run by Shawn Martin. IDEASc Fellow at Indiana University, (more on me and this site’s mission at the About section) is dedicated to gathering resources and serving as a central source for materials on the history of scholarly communication.

If anyone out there in internet land is interested in this topic, please follow this page and feel free to get in touch with me.  I’d be interested in hearing your ideas!