American Journal of Science

untitled-1

(Second Edition of the first volume of the journal, available at from Carnegie Mellon’s digital collection)

Prior to the professional scholarly journal system of today, there was only one major journal for American science,  the American Journal of Science which still exists today and is focused on geology.  In the nineteenth century, however, the journal focused on every scientific topic.  The table of contents for the issues of the first volume (pictured) includes:

  1. Mineralogy, Geology & Topography
  2. Botany
  3. Zoology
  4. Fossil Zoology
  5. Mathematics
  6. Miscellaneous
  7. Physics, Mechanics, & Chemistry
  8. Fine Arts
  9. Useful Arts
  10. Agriculture & Economics
  11. Intelligence

Each article is roughly two to three pages and each contains an “intelligence” section which seems to be general news.  This section continues into the twentieth century, when the journal was more focused on geology, but the intelligence section will talk about important findings of Physics & Chemistry, and other scientific areas.

The journal was founded by Benjamin Silliman and later edited by his son. There is a good overview of the foundation of the journal, and of course multiple references to it, but so far I have not been able to find any articles using a computational approach to analyzing its contents.  In particular, I think it would be a great candidate for the topic modelling and query sampling techniques I have used earlier.  I haven’t done much of this in the past (I intended to do so for the Journal of the American Chemical Society), but this journal may even be a good candidate for a network analysis since it would contain a large number of scientists in the United States and potentially would show the network as it was beginning to split into different disciplines.  Fortunately, there is also over 100 years of textual data available for this journal in the public domain, making it a potentially very rich source.  I am going to see if some initial tests may get some interesting results, and I’m looking forward to seeing whether this journal helps understand the professionalization of science and the origins of the scholarly communication system in even more interesting ways than the Journal of the American Chemical Society has done so far.

Training Named Entity Recognition for Author Names

For my first foray into network analysis, I’m using a test case of one file, the Journal of the American Chemical Society journal, volume 1 (1879).  For those interested, for the purposes of my tests, I took the journal from Hathi-Trust (my collection is available at https://babel.hathitrust.org/shcgi/mb?a=listis;c=1649210391), and I ran a quick and dirty OCR through Adobe.  So, the file’s not great in terms of accuracy, but should be good enough for test purposes.

Purpose:  I’d like to be able to recognize author names and then do some analysis to determine who the network of authors is, and who they may be referring to, not in terms of formal citations, but are they mentioning other scientists within the author network when they discuss their experiments in the Journal.

Test: So, I ran the Stanford NER on this sample volume. After many hours of hammering away, I found that:

  1. It actually does reasonably well with names, I was surprised in some ways how good the results were
  2. It also thinks that some substances are people (eg. Glucose is not a person but Dextro-Glucose is)
  3. English names do fairly well, but names from other languages fare less well (particularly French and German in this corpus)
  4. It also picks up on names that are not authors (eg. Peligot tube) which is good, but also bad if I’m trying to extract author names and references from equipment

Conclusion: These results are not really surprising, but I need to think about the next step for training the NER to do what I need.  If anyone has done some work on training, particularly for author recognition, any advice would be appreciated.