Professionalization and Combining Methods

In an earlier post I discussed some topic modeling I did on the Journal of the American Chemical Society (JACS).  That research showed that post 1892 (about 11 years after the journal begins publishing in 1879), there appeared to be a significant increase in discussion of methodology, society business, and other topics not directly associated with chemistry experiments.  Though I thought this was an interesting finding, at the same time I thought that it was best not to make too much out of this result.

Why should I not treat the results of this topic model as significant?  Topic modeling is, after all, an abstraction of the data.  I had the full text of all material from  JACS, and I then asked a computer to find which words had a statistically significant probability of appearing next to each other.  After doing that, I then categorized the data into “unexpected” topics (or topics on methods, society business, etc.) and “expected” topics (chemistry experiments of various kinds).  So, in essence I was dealing with an abstraction of an abstraction.  Thus, it seemed best not to say that this was a significant result when in reality it could have just been an artifact of my categorization of topic models.

I am beginning to change my mind on my earlier instinct, however.  Why? Just recently, I completed some additional statistical tests.  Recently, I created an additional data set comprising a sample of words from these topic models.  It contained 74 words which I thought might best signify discussion of “unexpected”/non chemistry topics.  I included words such as president, committee, election which would likely only show up in discussions of society business.  I also a few words like method which admittedly could appear both in chemistry articles and in articles about methodology of chemistry.  I then created a word frequency list for all of these words and subdivided them into two groups.  One group contained the 11 years prior to 1892 (from the journal’s beginning in 1879).  The other group contained the 11 years from 1892 to 1903.  My hope was to see if there was any kind of significant difference in these word frequencies right around the year (1892) my earlier graph showed that “unexpected topics were increasing.

Using SPSS, I compared these two groups using a dependent t-test.  My t-critical value (the number that determines whether the test was statistically significant) was 1.6.  My t-calculated (the number that measures whether the means of the two groups are statistically different from each other) was 7.6 with an effect size (measure of magnitude between two means) of 0.89.  Therefore I can say that there is actually quite a significant difference between the word frequencies of these two groups.  Word frequencies for words about society business and methods increase significantly post 1892.

What does all of this statistical work really do for me?  First, I think that these statistical tests show that the topic models (and my categorizations) actually did show that something important was happening in the journal.  Indeed it seems that the journal is publishing more about methods and society business after 1892.  Furthermore, I think that combining methods like topic modeling and statistical methods can prove quite useful.  Nonetheless, I think that traditional humanistic methods can also be important.  My next step will be to go back to the articles where these words appear and see what they are talking about.  So, these other computational and quantitative methods helped me to discover a pattern in the journals that otherwise I would likely never have noticed.  I look forward to seeing where this research goes.

American Journal of Science


(Second Edition of the first volume of the journal, available at from Carnegie Mellon’s digital collection)

Prior to the professional scholarly journal system of today, there was only one major journal for American science,  the American Journal of Science which still exists today and is focused on geology.  In the nineteenth century, however, the journal focused on every scientific topic.  The table of contents for the issues of the first volume (pictured) includes:

  1. Mineralogy, Geology & Topography
  2. Botany
  3. Zoology
  4. Fossil Zoology
  5. Mathematics
  6. Miscellaneous
  7. Physics, Mechanics, & Chemistry
  8. Fine Arts
  9. Useful Arts
  10. Agriculture & Economics
  11. Intelligence

Each article is roughly two to three pages and each contains an “intelligence” section which seems to be general news.  This section continues into the twentieth century, when the journal was more focused on geology, but the intelligence section will talk about important findings of Physics & Chemistry, and other scientific areas.

The journal was founded by Benjamin Silliman and later edited by his son. There is a good overview of the foundation of the journal, and of course multiple references to it, but so far I have not been able to find any articles using a computational approach to analyzing its contents.  In particular, I think it would be a great candidate for the topic modelling and query sampling techniques I have used earlier.  I haven’t done much of this in the past (I intended to do so for the Journal of the American Chemical Society), but this journal may even be a good candidate for a network analysis since it would contain a large number of scientists in the United States and potentially would show the network as it was beginning to split into different disciplines.  Fortunately, there is also over 100 years of textual data available for this journal in the public domain, making it a potentially very rich source.  I am going to see if some initial tests may get some interesting results, and I’m looking forward to seeing whether this journal helps understand the professionalization of science and the origins of the scholarly communication system in even more interesting ways than the Journal of the American Chemical Society has done so far.

Query Sampling Results

After finishing my first test run of query sampling the Journal of the American Chemical Society (JACS) against the Stanford Encyclopedia of Philosophy (SEP), I’m not sure that I can say much meaningful other than there are some potentially interesting questions to ask when I am able to get the data cleaned up more.

The top articles in the query sampling were:

  1. Philosophy of Chemistry
  2. Chaos
  3. Reductionism in Biology
  4. Mechanisms in Science
  5. Models in Science

Article 1 of the SEP at least shows that the query sampling recognized that the articles in JACS were about Chemistry.  Articles 4 & 5 of the SEP may show a recognition that the JACS articles also discuss methodological issues.  Articles 2 & 3 of the SEP are to me the most mysterious.  Article 3 of the SEP may show that the query sampling is picking up on terminology within chemistry (the article is largely about how biology can be reduced to chemistry).  Article 2  of the SEP also discusses positivism and unpredictability within complex systems so again may be picking up on what is largely the experimental procedures within this data.

Also, I tried to see if I could confirm some trends that the query sampling showed with some topic modelling from the InPho Topic Explorer.  For example, here is a quick visualization for the trend (year by year) of the topic for Life.  A score of 10 would mean that “Life” is the number 1 article for that year, a score of 0 would mean that the article does not show up at all.

Picture1  So, “Life does appear as the number 2 article for a few years, but then significantly drops off and by the after 1900 or so becomes an unimportant topic according to this data.

If we do a topic model on words like “organic and protein” which might signify discussion of life, we get this


The top of the graph shows the years when the topic of “life” is most prevalent, in this case 1900, and this graph at least does not seem to reflect the same trends as the earlier graph.

One of the big problems here I think is the fact that I only have the data broken out by year.  When I am able to slice off finer chunks of data (like just the methodology articles for certain years), I think I may be able to get more interesting results.  Another problem is the fact that the SEP does not talk much about chemistry, so it might also be interesting to compare this data with other subjects, like physics, that are better covered.  Do physics show similar haphazard trends or do they reflect the historiography of the field better?

In all, I think this is an interesting proof of concept, but would be significantly more interesting with cleaner data and perhaps some comparisons of different subjects.

Visually Representing History

I’m beginning to think now about some visual ways of representing what is happening in the Journal of the American Chemical Society.  This presents some interesting challenges.  Though there is some historiography about the society, for the most part, what I have been able to find is in a history of the society written in 1952 in celebration of its 75th anniversary.  From a historian’s point of view, this history has multiple methodological problems (technological determinism, whiggism, full of details without much analysis, and I could go on).  It is however what I have.  Thus, a way I am thinking about representing my topic models is by using this (admittedly problematic) history in order to create a kind of timeline. To put this another way, this history presents a narrative of topical progress between the founding of the society and 1952, or, at least it states what certain chemists thought was topical progress.  My data shows reality (at least within the flagship journal).  So, I think it would be interesting to construct some topics from the 1952 history and then see whether the journal does or doesn’t reflect that reality.

For those of you more visually oriented, as I was trying to think about how to do this, I went to Ted Underwood’s site to some pages about methodology.  I think my visualization might look something like this:


The line  (thought it probably wouldn’t be as curved as this one) would represent the topics that the 1952 history says are happening over a set number of years (in my case 1879 – 1922).  The dots would represent the actual topics.

This visualization would, I think, show whether the 1952 history is accurately reflecting the topics in the Journal itself.

Tied to this, I am also doing query sampling of the same data against the Stanford Encyclopedia of Philosophy (another blog post on that later).  What I hope to answer is a related with a different method.  Does a standard reference work in the field reflect what is actually happening?  So far, not surprisingly, the query sampling shows that my corpus is mostly related to chemistry.  I suspect though that an analysis of the secondary entries will be what is most interesting.

Do Digital Methods Change History?

Obviously I’m being provocative with the title, but hence my point which I’ll get to a bit later.  Also, of course I realize that this is a topic that has been discussed to death in multiple journals, but I’m really just trying to reflect on larger issues here as they apply to my own work.  So, on to the meat of what (I apologize) will be a fairly long blog post.  As I’ve been thinking about my own digital project, I wanted to take a step back and think about how I initially approached the project, where I am now, how things have changed, and what exactly what it all means, especially for people like me who are doing both history and information science (and thus am at a kind of fringe between the two groups).

I started out my project on the Journal of the American Chemical Society thinking about the books I had read on the history of scholarly communication (which admittedly is a relatively small literature).  In those works, authors seemed to agree that the late nineteenth/early twentieth century was a time of transition in scholarly communication.  The system was moving from a “republic of letters”in which individual scholars communicated to each other via a correspondence network.  Because of the explosion of scientific information, the changes in American higher education (and presumably change in other countries), industrialization, and any number of other factors to complicated to get into for a blog post, the system of scholarly communication switched to a journal system.  However, according to historians of scholarly communication, scientists and academics still relied on that earlier method of identifying journals by using eminent professionals within their fields.  In other words, the editorial board was the primary way to determine what journals were important.  If Professor Jones was on the editorial board and was picking the articles that got into the journal, it must be good.  Over time, people began to identify less with Professor Jones, and with the journal itself, and thus the system changed.

This makes for a good story, but at least with my journal it appears to be wrong.  I originally thought that the authors would be the most prolific contributors or that you would see particular prominent authors (whom I could hopefully identify) appearing within the pages of the journal.  I did not find that happening.  There are large numbers of authors who publish in the Journal of the American Chemical Society who appear only once and, at least based on limited research, I cannot figure out who they were.  Second, if that didn’t work, perhaps I could determine who the author network is by the topics they were discussing.  I’m still working on this, but it seems the topics are chaotic and the authors who talk about certain topics are also chaotic.  So, when I was hoping to try and find an author network via topics, that method also failed (although like I said I’m still working on it, so maybe there’s hope).

I hesitate to come to a grand conclusion about digital methods in history based on very preliminary research, but here goes (with all of the appropriate disclaimers and calls for civilized discussion).  Through digital methods have I actually disproved earlier scholars theories?  Or, have I just not read enough?  Or, could I have come to the same conclusions by just reading all of the journals in question?  What does all of this have to do with digital history?

I can offer only some preliminary thoughts to all of those questions, but I want to begin to answer those questions by asking some different questions.  Are digital methods simply a tool for finding interesting things to investigate in more detail (with more traditional historical research)? Alternatively, are digital methods a way to prove hypotheses created through more traditional historical research?  In other words, is it easier to do a small scale historical study, come up with a hypothesis, and then test that hypothesis against something like the Hathi-Trust corpus, thus making your hypothesis have more impact if you can say that I discovered something that is true across thousands of books and sub-disciplines?  Or, the last question, are digital methods something else?

Personally, I think digital methods are something else.  Relating my work to these larger questions about the Journal of the American Chemical Society, I think I probably could have come to these conclusions via some other means (like more extensive reading).  On the other hand, I think the methods helped me think through these issues in ways that do make me ask different questions than if I had just used more traditional historical research.  What do I mean by that?

I approached this topic with a particular hypothesis (editors are significantly influencing journal content), and particular digital methods I wanted to use (network analysis).  Had I just started doing lots of reading and manually mapping out the network, it would have taken me a great deal of time and would have simply been a slower way of arriving at the same conclusion.  Score one for the digital, its faster.  Having said that, there are other digital methods out there that traditional scholarship would probably not have allowed (i.e. topic modelling).  Basically I was able to further disprove my original hypothesis via topic modelling because there does not seem to be (at least as far as I can tell) a connection between topics and authors.  Score two, digital, it provides more ways to disprove bad hypotheses.  Finally, though I’m still working on getting a larger corpus, at least theoretically I can test these hypotheses across roughly fifty years of journal issues and thousands of pages all within a matter of minutes.  Furthermore, as I move forward with this research, I will be able to test the same hypotheses against other corpora (like other journals).  Score three for the digital, it can scale well.  Despite those advantages, I will probably still have to resort to old-fashioned manuscript studies of the journal editors and closer reading within particular parts of the journal to understand what is happening.  Score one traditional, you still have to do it (and since that is still a lot of work it should probably count for more than just one score point).

Coming to the end of what has been a long rambling blog post.  Here’s my take on my project and its relationship to the larger debate of whether digital methods change history.  I believe that they do.  I want to add a caveat to that, though.  I think the two methods (traditional research and digital) are complementary.  First, I think digital methods are a great way of scanning a large corpus, testing hypotheses (perhaps even quirky or strange ones) very quickly.  Doing this can allow historians to find anomalies or places to look for further research.  Also, if one has a hypothesis that has already been proven via a smaller scale historical research study, digital methods could be a great way to see if that hypothesis is true more broadly.  Thus, digital methods can be either a way to form a hypothesis or to further prove one.  This is not a particularly controversial point (at least I don’t think so), but for scholarly communication, I think that it is a highly relevant one.

As we think about ways to talk about the research process and how it works, particularly in the future, we need to find ways to integrate the kinds of exploration that I have discussed here, along with new ways of showing what scholars have done, how they have changed course, and why they are thinking of doing things differently.  Traditional scholarly communication, particularly in history, has not done this.  When we publish a finished (often print) monograph with our arguments, people don’t see the ways that we have changed course, different ways that we formulated our process, and how history is just as much a journey as it is a final result.  I think students often don’t understand this.  They assume things happened in a certain way and that historians have most (if not all of the answers).  Perhaps digital methods and more importantly the ways we document and disseminate them, can change the way we think about and communicate history in the future.

Now I’m done, and I look forward to hearing if others have better ways of talking about this than my ramblings.

Training Named Entity Recognition for Author Names

For my first foray into network analysis, I’m using a test case of one file, the Journal of the American Chemical Society journal, volume 1 (1879).  For those interested, for the purposes of my tests, I took the journal from Hathi-Trust (my collection is available at;c=1649210391), and I ran a quick and dirty OCR through Adobe.  So, the file’s not great in terms of accuracy, but should be good enough for test purposes.

Purpose:  I’d like to be able to recognize author names and then do some analysis to determine who the network of authors is, and who they may be referring to, not in terms of formal citations, but are they mentioning other scientists within the author network when they discuss their experiments in the Journal.

Test: So, I ran the Stanford NER on this sample volume. After many hours of hammering away, I found that:

  1. It actually does reasonably well with names, I was surprised in some ways how good the results were
  2. It also thinks that some substances are people (eg. Glucose is not a person but Dextro-Glucose is)
  3. English names do fairly well, but names from other languages fare less well (particularly French and German in this corpus)
  4. It also picks up on names that are not authors (eg. Peligot tube) which is good, but also bad if I’m trying to extract author names and references from equipment

Conclusion: These results are not really surprising, but I need to think about the next step for training the NER to do what I need.  If anyone has done some work on training, particularly for author recognition, any advice would be appreciated.