Creating a Corpus of Articles

I’m now thinking about technological ways for constructing my corpus of the Journal of the American Chemical Society.  Since Hathi-Trust (where I am getting the full text) pulls down texts page by page, I am estimating that it means I will have about 60,000 individual texts.  Some of those pages will be blank, some will have multiple articles, and some will have tables of contents, charts, and other ephemera.

I have decided that separating the texts into individual articles, rather than keeping them as entire (year long) volumes will be the most useful for my future work.   I think that if I want to try to determine a network and see the influence that individual scholars have on the corpus as a whole, it will be a great deal easier with a list of files that are named with authors and perhaps a few words of the title and a date (eg. Smith_Cool-Chemistry_1885).  So, on my test files, I have been using the tables of contents from the pdfs, and then using command lines to merge files.  For instance, if the table of contents says that Smith’s article covers pages 1-20, I am simply going through files 1-20 of my files and merging them.  This gets problematic in several ways.

  1. Image numbers (which the files basically are OCR of images) do not match up to page numbers.
  2. Articles often overlap.  One article may end on page 10 and another start on the same page.  This means I have to go back and copy and paste beginnings of articles into a new file.
  3. There are lots of blank pages and other ephemera (eg. charts that don’t OCR) which ideally I should discard, but it is hard to tell where those things appear just from tables of contents.

In all, this has taken me multiple hours to do, and I have barely scraped the surface of the 60,000+ files I will likely need to do this for.  Anyone know of any potentially better, perhaps automated ways that I could use to accomplish this?

Also, since I know any automation will likely bring in some mistakes, are there ways I can try to correct for those, while realizing of course that probably no process is perfect?

Processing (Continued) and Moving Toward Topic Modelling

I’m still working on the issues of processing the corpus of text that I have, and will hopefully be able to finish that sometime next week, which moves me on to what will be my next step:  topic modelling the corpus of the Journal of the American Chemical Society between 1879 and 1922.

Based on some sample files, there is some good news.  The topics I seem to be getting mention acids, bases, chemical compounds, and the kinds of things I would expect to see in a topic model of a chemistry journal, and there are no extremely strange topics that I would not expect to see.  That, I think tells me that the text will be good enough to move forward and do some good mining.

On a side note with my processing I have also been extracting all of the tables of contents from the journal.  Ideally this should be done automatically but I’ve been doing it manually so that I can put some editorial notes in various parts of my spreadsheet (which I will share when I’ve finished).  For now, the spreadsheet contains a list of all of the officers of the American Chemical Society separated out by year.  Surprising (at least to me) is the fact that there is not as much overlap as I would have expected.  Some officers do continue to serve year after year, but there is actually a fairly high turnover.  New officers seem to come in every year.  The spreadsheet also contains every author in the journal between those years, what articles they’ve published, whether I consider them “prolific” (i.e. published many articles), and if there is any information about them in Wikipedia.  If someone knows of a more comprehensive database, specifically for chemists, let me know; so far, I’m not seeing many of the early authors/officers listed in Wikipedia.  This spreadsheet, I hope, can serve as a guide while I’m processing and hopefully can tell me if I make any significant errors when I start dividing up articles and years in the larger corpus.

All of this is a preface to try and get to the question I’m asking.  What is the network of scientists involved in the journal, and are the officers/editorial board influencing the content in any measurable way?  Originally, I had thought that a spreadsheet like the one I’m creating would help to answer this question.  I had thought that editors of the journal would be some of the most prolific authors, and I thought that there would be a significant continuity of officers over this time period.  I had not anticipated so many unique authors contributing to the journal, nor had I thought that the officers of the association would turn over as frequently.

There may still be a way to get at the question I’m asking, though.  I think that by topic modelling the corpus and seeing if particular authors are tied to particular topics, that may at least help to answer whether specific people have more influence over the journal’s content than others.  Also, I’m sure others have tried to tie Wikipedia information to networks like this.  Like I said, so far I’m not finding many scientists who have Wikipedia entries, though that may change as I move further into the twentieth century.  Perhaps even if I can find authors who have high influence over the corpus and a Wikipedia entry that may tell me something.

In any case, that’ s where I am at the moment, and if there are thoughts about what might be useful to do (before I move into heavy duty processing of lots of files), let me know.


It is interesting how much of the research process (and figuring out the question I want to ask) comes from just structuring the data.  Over the past few days, I’ve been working on a test sample of the first few years of the Journal of the American Chemical Society, for now just the first five years.  I’m still drawing from the same collection (;c=1649210391) and doing some quick and dirty OCR myself to run.  What I found over these few days is the questions I originally intended to ask may need to be different, and second, what those questions are, will have direct impact on how I structure my text files.

Originally, I had hoped to do a network analysis names within the corpus. I had two ways I thought about doing this.  First, I thought about doing Named Entity Recognition looking for author names to see who they were talking about and see if there might be some network to that.  I did a brief blog post about that earlier.  Subject to all of those problems I mentioned in that post, I quickly realized that in order to do that, I need to restructure my data more effectively.  Currently, the Hathi-Trust corpus is set up by year, meaning that I have every volume as a single file.  If I really want to be able to do a network analysis, I will need , I think, to separate out all of the individual articles, so that I can associate Author A with Names A, B, and C.  Doing the analysis the way I did, in big year by year sections, I seem to end up with a whole list of names that it is hard to make sense of (it might be worth doing a topic model on those names, but more on that below).

After thinking through whether it was worth separating out the articles, I tried another  hypothesis that I  based on some of the reading I’d done.  I thought that members of the editorial board would be some of the most prolific authors in the early years of the journal.  In other words, editors wrote many of the articles for the journal themselves.  This does not seem to be the case in the Journal of the American Chemical Society (at least for the first five years).  The editors are not writing that many articles compared to the whole.  I then thought that by extending “editor” to mean one of the officers of the association (like president, secretary, etc.) I might solve my problem.  No luck.  It seems like many of the most prolific authors are also not officers of the association.  So, there doesn’t seem to be much possibility for network analysis.  This might change when I look at the larger corpus, but for now it seems like I need to think about this in some new ways.

After going through all of those steps, my second line of thought was to see if I could measure in some way the number of publications by individual authors.   The immediate problem with this approach is that many of the authors repeat in issue after issue because they are writing about issues in the field or translating articles from foreign journals, or, they have “abstracts” which eventually become a completely different journal.  For my purposes, the same “abstractors” write in every issue.  I decided, therefore, just to exclude those authors initially and focus on completely unique journal articles.  Using the first two years (1879-1880) there are 24 issues, some of which are almost entirely abstracts or translations of articles from foreign journals.  Overall, there are 33 authors for 75  unique articles during those two years.  One author, who incidentally was one of the vice-presidents for both of those years,  accounts for 15 of those articles.  The next top 3 authors account for 5 articles (for two of them) and seven for the other.  One of those authors is an officer of the association and the others are not.  Overall that means that for the first two years, 32 of the articles were published by just 3 authors, two of whom were officers of the association.  The remaining articles articles are largely written by separate authors, some, but few, of whom are officers of the association.

Obviously this is quite a small sample from the 40 years I’m interested in, but I think it shows that the situation is somewhat more complex than I had originally thought.  I also need to think more about how to deal with how these other areas of the journal (translations, abstracts, even the “proceedings” I talked about in earlier posts) should be analyzed.

All of this gets to my current thoughts about how I might prepare the corpus for analysis.

  • If I intend to keep the corpus year by year, then I think it may be necessary to do a kind of topic-model by author name, to see if certain authors seem to influence certain years more than others.
  • Another way to work with author topic models would be to separate out the articles so I could measure whether certain authors talk about certain topics (either they mention certain authors, or they talk about certain chemicals).
  • Related to that problem, however, if I break out by articles, to try and figure out how I want to deal with the different kinds of articles in the corpus (eg. proceedings, translations, abstracts, unique articles).
  • A third way to try network analysis of authors would be to pull just the tables of contents for all of these years and do some analysis of authors, titles, and years (independent of the full text of the actual articles).

Each of these ways has implications for how I do the initial processing of my corpus.  Do I separate out the articles?  Do I keep the text together as years?  Do I need to do both?  Should I separate out tables of contents and analyze those separately?  What can I accomplish in the course of a semester?

It may be possible over the course of the entire project to do everything I mention, but I would like to show some proofs of concept at least by the end of the semester.  If anyone has experience doing this or thoughts about processing a corpus of text for analysis, thoughts would be appreciated.  In any case, I suspect it is a good thing that I’m refining my question, but how best to strategically divide the corpus was not something I had thought about all that much.  Processing is turning out to be a more complicated step than I had thought.

Training Named Entity Recognition for Author Names

For my first foray into network analysis, I’m using a test case of one file, the Journal of the American Chemical Society journal, volume 1 (1879).  For those interested, for the purposes of my tests, I took the journal from Hathi-Trust (my collection is available at;c=1649210391), and I ran a quick and dirty OCR through Adobe.  So, the file’s not great in terms of accuracy, but should be good enough for test purposes.

Purpose:  I’d like to be able to recognize author names and then do some analysis to determine who the network of authors is, and who they may be referring to, not in terms of formal citations, but are they mentioning other scientists within the author network when they discuss their experiments in the Journal.

Test: So, I ran the Stanford NER on this sample volume. After many hours of hammering away, I found that:

  1. It actually does reasonably well with names, I was surprised in some ways how good the results were
  2. It also thinks that some substances are people (eg. Glucose is not a person but Dextro-Glucose is)
  3. English names do fairly well, but names from other languages fare less well (particularly French and German in this corpus)
  4. It also picks up on names that are not authors (eg. Peligot tube) which is good, but also bad if I’m trying to extract author names and references from equipment

Conclusion: These results are not really surprising, but I need to think about the next step for training the NER to do what I need.  If anyone has done some work on training, particularly for author recognition, any advice would be appreciated.