Creating a Corpus of Articles

I’m now thinking about technological ways for constructing my corpus of the Journal of the American Chemical Society.  Since Hathi-Trust (where I am getting the full text) pulls down texts page by page, I am estimating that it means I will have about 60,000 individual texts.  Some of those pages will be blank, some will have multiple articles, and some will have tables of contents, charts, and other ephemera.

I have decided that separating the texts into individual articles, rather than keeping them as entire (year long) volumes will be the most useful for my future work.   I think that if I want to try to determine a network and see the influence that individual scholars have on the corpus as a whole, it will be a great deal easier with a list of files that are named with authors and perhaps a few words of the title and a date (eg. Smith_Cool-Chemistry_1885).  So, on my test files, I have been using the tables of contents from the pdfs, and then using command lines to merge files.  For instance, if the table of contents says that Smith’s article covers pages 1-20, I am simply going through files 1-20 of my files and merging them.  This gets problematic in several ways.

  1. Image numbers (which the files basically are OCR of images) do not match up to page numbers.
  2. Articles often overlap.  One article may end on page 10 and another start on the same page.  This means I have to go back and copy and paste beginnings of articles into a new file.
  3. There are lots of blank pages and other ephemera (eg. charts that don’t OCR) which ideally I should discard, but it is hard to tell where those things appear just from tables of contents.

In all, this has taken me multiple hours to do, and I have barely scraped the surface of the 60,000+ files I will likely need to do this for.  Anyone know of any potentially better, perhaps automated ways that I could use to accomplish this?

Also, since I know any automation will likely bring in some mistakes, are there ways I can try to correct for those, while realizing of course that probably no process is perfect?


