For my first foray into network analysis, I’m using a test case of one file, the Journal of the American Chemical Society journal, volume 1 (1879). For those interested, for the purposes of my tests, I took the journal from Hathi-Trust (my collection is available at https://babel.hathitrust.org/shcgi/mb?a=listis;c=1649210391), and I ran a quick and dirty OCR through Adobe. So, the file’s not great in terms of accuracy, but should be good enough for test purposes.
Purpose: I’d like to be able to recognize author names and then do some analysis to determine who the network of authors is, and who they may be referring to, not in terms of formal citations, but are they mentioning other scientists within the author network when they discuss their experiments in the Journal.
Test: So, I ran the Stanford NER on this sample volume. After many hours of hammering away, I found that:
- It actually does reasonably well with names, I was surprised in some ways how good the results were
- It also thinks that some substances are people (eg. Glucose is not a person but Dextro-Glucose is)
- English names do fairly well, but names from other languages fare less well (particularly French and German in this corpus)
- It also picks up on names that are not authors (eg. Peligot tube) which is good, but also bad if I’m trying to extract author names and references from equipment
Conclusion: These results are not really surprising, but I need to think about the next step for training the NER to do what I need. If anyone has done some work on training, particularly for author recognition, any advice would be appreciated.