It is interesting how much of the research process (and figuring out the question I want to ask) comes from just structuring the data. Over the past few days, I’ve been working on a test sample of the first few years of the Journal of the American Chemical Society, for now just the first five years. I’m still drawing from the same collection (https://babel.hathitrust.org/shcgi/mb?a=listis;c=1649210391) and doing some quick and dirty OCR myself to run. What I found over these few days is the questions I originally intended to ask may need to be different, and second, what those questions are, will have direct impact on how I structure my text files.
Originally, I had hoped to do a network analysis names within the corpus. I had two ways I thought about doing this. First, I thought about doing Named Entity Recognition looking for author names to see who they were talking about and see if there might be some network to that. I did a brief blog post about that earlier. Subject to all of those problems I mentioned in that post, I quickly realized that in order to do that, I need to restructure my data more effectively. Currently, the Hathi-Trust corpus is set up by year, meaning that I have every volume as a single file. If I really want to be able to do a network analysis, I will need , I think, to separate out all of the individual articles, so that I can associate Author A with Names A, B, and C. Doing the analysis the way I did, in big year by year sections, I seem to end up with a whole list of names that it is hard to make sense of (it might be worth doing a topic model on those names, but more on that below).
After thinking through whether it was worth separating out the articles, I tried another hypothesis that I based on some of the reading I’d done. I thought that members of the editorial board would be some of the most prolific authors in the early years of the journal. In other words, editors wrote many of the articles for the journal themselves. This does not seem to be the case in the Journal of the American Chemical Society (at least for the first five years). The editors are not writing that many articles compared to the whole. I then thought that by extending “editor” to mean one of the officers of the association (like president, secretary, etc.) I might solve my problem. No luck. It seems like many of the most prolific authors are also not officers of the association. So, there doesn’t seem to be much possibility for network analysis. This might change when I look at the larger corpus, but for now it seems like I need to think about this in some new ways.
After going through all of those steps, my second line of thought was to see if I could measure in some way the number of publications by individual authors. The immediate problem with this approach is that many of the authors repeat in issue after issue because they are writing about issues in the field or translating articles from foreign journals, or, they have “abstracts” which eventually become a completely different journal. For my purposes, the same “abstractors” write in every issue. I decided, therefore, just to exclude those authors initially and focus on completely unique journal articles. Using the first two years (1879-1880) there are 24 issues, some of which are almost entirely abstracts or translations of articles from foreign journals. Overall, there are 33 authors for 75 unique articles during those two years. One author, who incidentally was one of the vice-presidents for both of those years, accounts for 15 of those articles. The next top 3 authors account for 5 articles (for two of them) and seven for the other. One of those authors is an officer of the association and the others are not. Overall that means that for the first two years, 32 of the articles were published by just 3 authors, two of whom were officers of the association. The remaining articles articles are largely written by separate authors, some, but few, of whom are officers of the association.
Obviously this is quite a small sample from the 40 years I’m interested in, but I think it shows that the situation is somewhat more complex than I had originally thought. I also need to think more about how to deal with how these other areas of the journal (translations, abstracts, even the “proceedings” I talked about in earlier posts) should be analyzed.
All of this gets to my current thoughts about how I might prepare the corpus for analysis.
- If I intend to keep the corpus year by year, then I think it may be necessary to do a kind of topic-model by author name, to see if certain authors seem to influence certain years more than others.
- Another way to work with author topic models would be to separate out the articles so I could measure whether certain authors talk about certain topics (either they mention certain authors, or they talk about certain chemicals).
- Related to that problem, however, if I break out by articles, to try and figure out how I want to deal with the different kinds of articles in the corpus (eg. proceedings, translations, abstracts, unique articles).
- A third way to try network analysis of authors would be to pull just the tables of contents for all of these years and do some analysis of authors, titles, and years (independent of the full text of the actual articles).
Each of these ways has implications for how I do the initial processing of my corpus. Do I separate out the articles? Do I keep the text together as years? Do I need to do both? Should I separate out tables of contents and analyze those separately? What can I accomplish in the course of a semester?
It may be possible over the course of the entire project to do everything I mention, but I would like to show some proofs of concept at least by the end of the semester. If anyone has experience doing this or thoughts about processing a corpus of text for analysis, thoughts would be appreciated. In any case, I suspect it is a good thing that I’m refining my question, but how best to strategically divide the corpus was not something I had thought about all that much. Processing is turning out to be a more complicated step than I had thought.