Do Digital Methods Change History?

Obviously I’m being provocative with the title, but hence my point which I’ll get to a bit later.  Also, of course I realize that this is a topic that has been discussed to death in multiple journals, but I’m really just trying to reflect on larger issues here as they apply to my own work.  So, on to the meat of what (I apologize) will be a fairly long blog post.  As I’ve been thinking about my own digital project, I wanted to take a step back and think about how I initially approached the project, where I am now, how things have changed, and what exactly what it all means, especially for people like me who are doing both history and information science (and thus am at a kind of fringe between the two groups).

I started out my project on the Journal of the American Chemical Society thinking about the books I had read on the history of scholarly communication (which admittedly is a relatively small literature).  In those works, authors seemed to agree that the late nineteenth/early twentieth century was a time of transition in scholarly communication.  The system was moving from a “republic of letters”in which individual scholars communicated to each other via a correspondence network.  Because of the explosion of scientific information, the changes in American higher education (and presumably change in other countries), industrialization, and any number of other factors to complicated to get into for a blog post, the system of scholarly communication switched to a journal system.  However, according to historians of scholarly communication, scientists and academics still relied on that earlier method of identifying journals by using eminent professionals within their fields.  In other words, the editorial board was the primary way to determine what journals were important.  If Professor Jones was on the editorial board and was picking the articles that got into the journal, it must be good.  Over time, people began to identify less with Professor Jones, and with the journal itself, and thus the system changed.

This makes for a good story, but at least with my journal it appears to be wrong.  I originally thought that the authors would be the most prolific contributors or that you would see particular prominent authors (whom I could hopefully identify) appearing within the pages of the journal.  I did not find that happening.  There are large numbers of authors who publish in the Journal of the American Chemical Society who appear only once and, at least based on limited research, I cannot figure out who they were.  Second, if that didn’t work, perhaps I could determine who the author network is by the topics they were discussing.  I’m still working on this, but it seems the topics are chaotic and the authors who talk about certain topics are also chaotic.  So, when I was hoping to try and find an author network via topics, that method also failed (although like I said I’m still working on it, so maybe there’s hope).

I hesitate to come to a grand conclusion about digital methods in history based on very preliminary research, but here goes (with all of the appropriate disclaimers and calls for civilized discussion).  Through digital methods have I actually disproved earlier scholars theories?  Or, have I just not read enough?  Or, could I have come to the same conclusions by just reading all of the journals in question?  What does all of this have to do with digital history?

I can offer only some preliminary thoughts to all of those questions, but I want to begin to answer those questions by asking some different questions.  Are digital methods simply a tool for finding interesting things to investigate in more detail (with more traditional historical research)? Alternatively, are digital methods a way to prove hypotheses created through more traditional historical research?  In other words, is it easier to do a small scale historical study, come up with a hypothesis, and then test that hypothesis against something like the Hathi-Trust corpus, thus making your hypothesis have more impact if you can say that I discovered something that is true across thousands of books and sub-disciplines?  Or, the last question, are digital methods something else?

Personally, I think digital methods are something else.  Relating my work to these larger questions about the Journal of the American Chemical Society, I think I probably could have come to these conclusions via some other means (like more extensive reading).  On the other hand, I think the methods helped me think through these issues in ways that do make me ask different questions than if I had just used more traditional historical research.  What do I mean by that?

I approached this topic with a particular hypothesis (editors are significantly influencing journal content), and particular digital methods I wanted to use (network analysis).  Had I just started doing lots of reading and manually mapping out the network, it would have taken me a great deal of time and would have simply been a slower way of arriving at the same conclusion.  Score one for the digital, its faster.  Having said that, there are other digital methods out there that traditional scholarship would probably not have allowed (i.e. topic modelling).  Basically I was able to further disprove my original hypothesis via topic modelling because there does not seem to be (at least as far as I can tell) a connection between topics and authors.  Score two, digital, it provides more ways to disprove bad hypotheses.  Finally, though I’m still working on getting a larger corpus, at least theoretically I can test these hypotheses across roughly fifty years of journal issues and thousands of pages all within a matter of minutes.  Furthermore, as I move forward with this research, I will be able to test the same hypotheses against other corpora (like other journals).  Score three for the digital, it can scale well.  Despite those advantages, I will probably still have to resort to old-fashioned manuscript studies of the journal editors and closer reading within particular parts of the journal to understand what is happening.  Score one traditional, you still have to do it (and since that is still a lot of work it should probably count for more than just one score point).

Coming to the end of what has been a long rambling blog post.  Here’s my take on my project and its relationship to the larger debate of whether digital methods change history.  I believe that they do.  I want to add a caveat to that, though.  I think the two methods (traditional research and digital) are complementary.  First, I think digital methods are a great way of scanning a large corpus, testing hypotheses (perhaps even quirky or strange ones) very quickly.  Doing this can allow historians to find anomalies or places to look for further research.  Also, if one has a hypothesis that has already been proven via a smaller scale historical research study, digital methods could be a great way to see if that hypothesis is true more broadly.  Thus, digital methods can be either a way to form a hypothesis or to further prove one.  This is not a particularly controversial point (at least I don’t think so), but for scholarly communication, I think that it is a highly relevant one.

As we think about ways to talk about the research process and how it works, particularly in the future, we need to find ways to integrate the kinds of exploration that I have discussed here, along with new ways of showing what scholars have done, how they have changed course, and why they are thinking of doing things differently.  Traditional scholarly communication, particularly in history, has not done this.  When we publish a finished (often print) monograph with our arguments, people don’t see the ways that we have changed course, different ways that we formulated our process, and how history is just as much a journey as it is a final result.  I think students often don’t understand this.  They assume things happened in a certain way and that historians have most (if not all of the answers).  Perhaps digital methods and more importantly the ways we document and disseminate them, can change the way we think about and communicate history in the future.

Now I’m done, and I look forward to hearing if others have better ways of talking about this than my ramblings.

Analyzing Some Preliminary Topic Modelling

I’ve started some work on topic modelling, using at the moment two programs: the InPho Topic Explorer (http://inphodata.cogs.indiana.edu/) from Indiana University’s Cognitive Science Program and Mallet (http://mallet.cs.umass.edu/index.php).  I’m also open to other suggestions for potentially trying out different methods, but this gets me started.  Right now, I’m still using some sample files, but look forward to trying these out on the entire corpus of the Journal of the American Chemical Society once I have the data.

From InPho, the topics I get are

Topic 0 per, found, precipitate, sulphuric, weight, substance, soluble, liquid, thus, total, nitric, hydrochloric, sulphate, value, gas, described, second, grams, oxygen, follows
Topic 1 per, new, chemical, iron, copper, ore, action, author, cent, review, germany, oil, see, assignor, research, richards, york, parts, zinc, analyses
Topic 2 gram, cent, per, sulphur, gave, method, methods, ten, hydrochloric, determinations, steel, tungsten, proteid, ammonium, cement, weighed, matter, oil, sulphide, aluminum
Topic 3 calc, sulfate, yield, secretary, sec, since, ave, held, electrons, prepared, system, boiling, atoms, product, curve, series, benzene, sulfuric, conductivity, lewis
Topic 4 sugar, action, etc, author, c.c, upon, heated, acids, oil, chem, grms, abstracts, obtained, liquid, air, oxygen, gas, ozone, lime, soda
Topic 5 new, form, table, per, weight, meeting, cent, slightly, der, book, tube, hydrogen, conductivity, west, sulphate, sec, sulphuric, estimation, oxygen, magnesium
Topic 6 section, city, ave, american, mass, chicago, william, water, charles, sodium, john, report, society, chemistry, washington, steel, philadelphia, members, sulphur, university
Topic 7 hydrogen, section, equation, concentration, ion, measurements, mercury, therefore, university, content, ann, curves, increase, theory, concentrations, points, derivatives, velocity, ions, room
Topic 8 cent, per, found, made, first, grams, soluble, sulphate, added, precipitate, small, containing, weight, much, sulphuric, gas, ether, would, form, substance
Topic 9 found, liquid, substance, ether, soluble, hydrochloric, thus, precipitate, gas, described, alcohol, value, second, alkali, glass, oxygen, acid, ethyl, material, specific
Topic 10 cent, much, hydrogen, grams, containing, ether, experiment, dried, fact, even, upon, form, place, separated, treated, various, following, soil, shall, magnesium
Topic 11 made, first, would, small, added, shown, pure, could, must, use, well, due, part, dioxide, order, table, gives, organic, precipitated, concentrated
Topic 12 acid, solution, water, one, results, method, may, two, chloride, used, potassium, alcohol, also, sodium, amount, temperature, time, present, experiments, salt
Topic 13 obtained, given, upon, nitrogen, following, mixture, chemical, crystals, chemistry, copper, conditions, heating, preparation, dilute, oxide, dry, case, error, known, iron
Topic 14 form, value, chair, constant, sulfuric, fig, table, chloride, temperature, experimental, solid, mixture, sulfide, subs, values, heat, carbon, hydroxide, slightly, journal
Topic 15 weight, values, reaction, solutions, point, much, concentration, pressure, fact, even, containing, temperature, equilibrium, sodium, experiment, dried, acids, melting, case, iodide
Topic 16 per, acids, heated, calculated, laboratory, journal, hydroxide, society, true, book, contain, cause, values, received, satisfactory, showing, special, year, manganese, show
Topic 17 form, table, constant, sulfuric, two, solution, slightly, value, cell, grams, surface, gave, solid, meeting, negative, phase, experimental, fig, hydroxide, however
Topic 18 grams, cent, gram, book, value, per, der, calculated, general, values, physical, und, sulfate, constant, inorganic, normal, die, ion, hydrazine, salts
Topic 19 water, new, milk, meeting, chemical, process, one, iron, apparatus, york, carbon, society, mass, read, fat, analysis, lime, secretary, furnace, use

From Mallet I get

topicId words..
1 chemical chemistry society american journal dr book work committee general
2 solutions concentration solution salts salt ion solubility ions conductivity cell
3 acid ch acids nh ester ii methyl ethyl obtained acetic
4 sugar experiments time action effect reaction starch amount rate power
5 theory atoms number surface oxygen molecules form hydrogen energy case
6 color oil red obtained mixture liquid yellow white water small
7 compounds compound reaction action group bromine chloride carbon derivatives formed
8 cc solution acid water added precipitate hydrochloric dissolved filtered excess
9 cent nitrogen gram results grams total amount weight sample found
10 alcohol water ether soluble solution found salt crystals melting acid
11 chem milk fat oil extract protein oils ash composition soc
12 method results made methods determination error obtained standard determinations found
13 work present paper fact great part study made question view
14 chloride potassium sodium silver solution acid ammonia ammonium nitrate oxide
15 analysis determination water soil organic matter plant chemical analyses methods
16 temperature values table pressure point heat equation constant data vapor
17 weight io atomic oo ii lead arsenic separation series oxide
18 st section meeting city mass pa secretary ave university york
19 iron process der steel copper coal gas gold ore furnace
20 tube air gas apparatus glass water temperature platinum liquid mercury

As I said in a previous post, many of these results are not surprising.  There are, however, some topics that merit some further analysis.

In InPho some names come up in topic 1 and 3 (richards and lewis) which could signal some important people in the field and worth seeing if they show up in other more focused contexts.  Also topic 6, which has the words “section, city, ave, american, mass, chicago, william, water, charles, sodium, john, report, society, chemistry, washington, steel, philadelphia, members, sulphur, university” has very few words that have much to do with chemistry (aside from sodium, chemistry, and sulphur).  My guess is that these have to do with meetings or with business of the society.  Unhelpfully, the names in this topic seem to be first names, which might just indicate that there are lots of people named Charles and John.  I also wonder whether this might just be a catchall topic.  For instance the words American, Society, and Report would, I think, come up in almost any issue of the journal.

In Mallet, the topics that come up are actually quite different, though substantively similar I think.  The first topic is, I think, a catchall with words that probably show up in every journal article, like “American Chemical Society.”  Nonetheless, I also think that “committee” is an interesting one in that topic, probably reflecting the reports of various committees that show up in the sections of the journal that report the activities of the society.  Topic 13 is also interesting with “work present paper fact great part study made question view” that do not seem to be discussing chemistry directly but rather seem to be talking about the work of doing chemistry (like publishing).

In addition to thinking about the network of people and how they are institutionalizing the journal, I also think it might be interesting to look at these topics as a kind of window into the philosophy of chemistry.  In other words, what are chemists talking about, and more importantly, does this match up with what historians and philosophers of science say what was going on in the nineteenth century.  It might be interesting to see if I can match up concepts from the Stanford Encyclopedia of Philsophy (http://plato.stanford.edu/entries/chemistry/) with the topic models I’m getting.  More on that in a later post.

In all, I think that these are some exciting preliminary results, and I look forward to doing some more in-depth topic modelling to see whether things change over time, see if I can understand more about the individual authors (and who is publishing on certain topics), and also see whether historians of chemistry are accurately understanding the topics of the day (at least in terms of what the flagship journal seems to think important).

Creating a Corpus of Articles

I’m now thinking about technological ways for constructing my corpus of the Journal of the American Chemical Society.  Since Hathi-Trust (where I am getting the full text) pulls down texts page by page, I am estimating that it means I will have about 60,000 individual texts.  Some of those pages will be blank, some will have multiple articles, and some will have tables of contents, charts, and other ephemera.

I have decided that separating the texts into individual articles, rather than keeping them as entire (year long) volumes will be the most useful for my future work.   I think that if I want to try to determine a network and see the influence that individual scholars have on the corpus as a whole, it will be a great deal easier with a list of files that are named with authors and perhaps a few words of the title and a date (eg. Smith_Cool-Chemistry_1885).  So, on my test files, I have been using the tables of contents from the pdfs, and then using command lines to merge files.  For instance, if the table of contents says that Smith’s article covers pages 1-20, I am simply going through files 1-20 of my files and merging them.  This gets problematic in several ways.

  1. Image numbers (which the files basically are OCR of images) do not match up to page numbers.
  2. Articles often overlap.  One article may end on page 10 and another start on the same page.  This means I have to go back and copy and paste beginnings of articles into a new file.
  3. There are lots of blank pages and other ephemera (eg. charts that don’t OCR) which ideally I should discard, but it is hard to tell where those things appear just from tables of contents.

In all, this has taken me multiple hours to do, and I have barely scraped the surface of the 60,000+ files I will likely need to do this for.  Anyone know of any potentially better, perhaps automated ways that I could use to accomplish this?

Also, since I know any automation will likely bring in some mistakes, are there ways I can try to correct for those, while realizing of course that probably no process is perfect?

Processing (Continued) and Moving Toward Topic Modelling

I’m still working on the issues of processing the corpus of text that I have, and will hopefully be able to finish that sometime next week, which moves me on to what will be my next step:  topic modelling the corpus of the Journal of the American Chemical Society between 1879 and 1922.

Based on some sample files, there is some good news.  The topics I seem to be getting mention acids, bases, chemical compounds, and the kinds of things I would expect to see in a topic model of a chemistry journal, and there are no extremely strange topics that I would not expect to see.  That, I think tells me that the text will be good enough to move forward and do some good mining.

On a side note with my processing I have also been extracting all of the tables of contents from the journal.  Ideally this should be done automatically but I’ve been doing it manually so that I can put some editorial notes in various parts of my spreadsheet (which I will share when I’ve finished).  For now, the spreadsheet contains a list of all of the officers of the American Chemical Society separated out by year.  Surprising (at least to me) is the fact that there is not as much overlap as I would have expected.  Some officers do continue to serve year after year, but there is actually a fairly high turnover.  New officers seem to come in every year.  The spreadsheet also contains every author in the journal between those years, what articles they’ve published, whether I consider them “prolific” (i.e. published many articles), and if there is any information about them in Wikipedia.  If someone knows of a more comprehensive database, specifically for chemists, let me know; so far, I’m not seeing many of the early authors/officers listed in Wikipedia.  This spreadsheet, I hope, can serve as a guide while I’m processing and hopefully can tell me if I make any significant errors when I start dividing up articles and years in the larger corpus.

All of this is a preface to try and get to the question I’m asking.  What is the network of scientists involved in the journal, and are the officers/editorial board influencing the content in any measurable way?  Originally, I had thought that a spreadsheet like the one I’m creating would help to answer this question.  I had thought that editors of the journal would be some of the most prolific authors, and I thought that there would be a significant continuity of officers over this time period.  I had not anticipated so many unique authors contributing to the journal, nor had I thought that the officers of the association would turn over as frequently.

There may still be a way to get at the question I’m asking, though.  I think that by topic modelling the corpus and seeing if particular authors are tied to particular topics, that may at least help to answer whether specific people have more influence over the journal’s content than others.  Also, I’m sure others have tried to tie Wikipedia information to networks like this.  Like I said, so far I’m not finding many scientists who have Wikipedia entries, though that may change as I move further into the twentieth century.  Perhaps even if I can find authors who have high influence over the corpus and a Wikipedia entry that may tell me something.

In any case, that’ s where I am at the moment, and if there are thoughts about what might be useful to do (before I move into heavy duty processing of lots of files), let me know.

Alchemists Don’t Share?

I’ve been following some of Cory Doctorow’s talks about open access lately (eg.https://opensource.com/life/16/1/cory-doctorow-predict-future-influence-it or YouTube video (https://www.youtube.com/watch?v=ln7U_Bm3S_Q), great stuff.  I’m wondering as a historian, however, about this origin story that “science” is born when practitioners openly disclose their results.

In principle I agree with Doctorow (at least in saying that we should openly disclose results, otherwise I wouldn’t be blogging about this openly on the web), but I wonder where the story came from and when we started telling it.  More importantly, as a scholar interested in the institutionalization of academic publishing, I’m interested in seeing how alchemists  did (or did not share) as compared to early scientists.  If every alchemist drank mercury for instance, you would think alchemy would die off as a profession rather quickly.  Also, modern scientists don’t share everything, particularly if the results of their work can be commercialized.

Pamela Long for instance talks about how authors thought about disclosing their results between the Classical age and the Renaissance.  I’m more interested in the late nineteenth and twentieth centuries, but I wonder if the story may be more complicated.

Processing

It is interesting how much of the research process (and figuring out the question I want to ask) comes from just structuring the data.  Over the past few days, I’ve been working on a test sample of the first few years of the Journal of the American Chemical Society, for now just the first five years.  I’m still drawing from the same collection (https://babel.hathitrust.org/shcgi/mb?a=listis;c=1649210391) and doing some quick and dirty OCR myself to run.  What I found over these few days is the questions I originally intended to ask may need to be different, and second, what those questions are, will have direct impact on how I structure my text files.

Originally, I had hoped to do a network analysis names within the corpus. I had two ways I thought about doing this.  First, I thought about doing Named Entity Recognition looking for author names to see who they were talking about and see if there might be some network to that.  I did a brief blog post about that earlier.  Subject to all of those problems I mentioned in that post, I quickly realized that in order to do that, I need to restructure my data more effectively.  Currently, the Hathi-Trust corpus is set up by year, meaning that I have every volume as a single file.  If I really want to be able to do a network analysis, I will need , I think, to separate out all of the individual articles, so that I can associate Author A with Names A, B, and C.  Doing the analysis the way I did, in big year by year sections, I seem to end up with a whole list of names that it is hard to make sense of (it might be worth doing a topic model on those names, but more on that below).

After thinking through whether it was worth separating out the articles, I tried another  hypothesis that I  based on some of the reading I’d done.  I thought that members of the editorial board would be some of the most prolific authors in the early years of the journal.  In other words, editors wrote many of the articles for the journal themselves.  This does not seem to be the case in the Journal of the American Chemical Society (at least for the first five years).  The editors are not writing that many articles compared to the whole.  I then thought that by extending “editor” to mean one of the officers of the association (like president, secretary, etc.) I might solve my problem.  No luck.  It seems like many of the most prolific authors are also not officers of the association.  So, there doesn’t seem to be much possibility for network analysis.  This might change when I look at the larger corpus, but for now it seems like I need to think about this in some new ways.

After going through all of those steps, my second line of thought was to see if I could measure in some way the number of publications by individual authors.   The immediate problem with this approach is that many of the authors repeat in issue after issue because they are writing about issues in the field or translating articles from foreign journals, or, they have “abstracts” which eventually become a completely different journal.  For my purposes, the same “abstractors” write in every issue.  I decided, therefore, just to exclude those authors initially and focus on completely unique journal articles.  Using the first two years (1879-1880) there are 24 issues, some of which are almost entirely abstracts or translations of articles from foreign journals.  Overall, there are 33 authors for 75  unique articles during those two years.  One author, who incidentally was one of the vice-presidents for both of those years,  accounts for 15 of those articles.  The next top 3 authors account for 5 articles (for two of them) and seven for the other.  One of those authors is an officer of the association and the others are not.  Overall that means that for the first two years, 32 of the articles were published by just 3 authors, two of whom were officers of the association.  The remaining articles articles are largely written by separate authors, some, but few, of whom are officers of the association.

Obviously this is quite a small sample from the 40 years I’m interested in, but I think it shows that the situation is somewhat more complex than I had originally thought.  I also need to think more about how to deal with how these other areas of the journal (translations, abstracts, even the “proceedings” I talked about in earlier posts) should be analyzed.

All of this gets to my current thoughts about how I might prepare the corpus for analysis.

  • If I intend to keep the corpus year by year, then I think it may be necessary to do a kind of topic-model by author name, to see if certain authors seem to influence certain years more than others.
  • Another way to work with author topic models would be to separate out the articles so I could measure whether certain authors talk about certain topics (either they mention certain authors, or they talk about certain chemicals).
  • Related to that problem, however, if I break out by articles, to try and figure out how I want to deal with the different kinds of articles in the corpus (eg. proceedings, translations, abstracts, unique articles).
  • A third way to try network analysis of authors would be to pull just the tables of contents for all of these years and do some analysis of authors, titles, and years (independent of the full text of the actual articles).

Each of these ways has implications for how I do the initial processing of my corpus.  Do I separate out the articles?  Do I keep the text together as years?  Do I need to do both?  Should I separate out tables of contents and analyze those separately?  What can I accomplish in the course of a semester?

It may be possible over the course of the entire project to do everything I mention, but I would like to show some proofs of concept at least by the end of the semester.  If anyone has experience doing this or thoughts about processing a corpus of text for analysis, thoughts would be appreciated.  In any case, I suspect it is a good thing that I’m refining my question, but how best to strategically divide the corpus was not something I had thought about all that much.  Processing is turning out to be a more complicated step than I had thought.