Professionalization and Combining Methods

In an earlier post I discussed some topic modeling I did on the Journal of the American Chemical Society (JACS).  That research showed that post 1892 (about 11 years after the journal begins publishing in 1879), there appeared to be a significant increase in discussion of methodology, society business, and other topics not directly associated with chemistry experiments.  Though I thought this was an interesting finding, at the same time I thought that it was best not to make too much out of this result.

Why should I not treat the results of this topic model as significant?  Topic modeling is, after all, an abstraction of the data.  I had the full text of all material from  JACS, and I then asked a computer to find which words had a statistically significant probability of appearing next to each other.  After doing that, I then categorized the data into “unexpected” topics (or topics on methods, society business, etc.) and “expected” topics (chemistry experiments of various kinds).  So, in essence I was dealing with an abstraction of an abstraction.  Thus, it seemed best not to say that this was a significant result when in reality it could have just been an artifact of my categorization of topic models.

I am beginning to change my mind on my earlier instinct, however.  Why? Just recently, I completed some additional statistical tests.  Recently, I created an additional data set comprising a sample of words from these topic models.  It contained 74 words which I thought might best signify discussion of “unexpected”/non chemistry topics.  I included words such as president, committee, election which would likely only show up in discussions of society business.  I also a few words like method which admittedly could appear both in chemistry articles and in articles about methodology of chemistry.  I then created a word frequency list for all of these words and subdivided them into two groups.  One group contained the 11 years prior to 1892 (from the journal’s beginning in 1879).  The other group contained the 11 years from 1892 to 1903.  My hope was to see if there was any kind of significant difference in these word frequencies right around the year (1892) my earlier graph showed that “unexpected topics were increasing.

Using SPSS, I compared these two groups using a dependent t-test.  My t-critical value (the number that determines whether the test was statistically significant) was 1.6.  My t-calculated (the number that measures whether the means of the two groups are statistically different from each other) was 7.6 with an effect size (measure of magnitude between two means) of 0.89.  Therefore I can say that there is actually quite a significant difference between the word frequencies of these two groups.  Word frequencies for words about society business and methods increase significantly post 1892.

What does all of this statistical work really do for me?  First, I think that these statistical tests show that the topic models (and my categorizations) actually did show that something important was happening in the journal.  Indeed it seems that the journal is publishing more about methods and society business after 1892.  Furthermore, I think that combining methods like topic modeling and statistical methods can prove quite useful.  Nonetheless, I think that traditional humanistic methods can also be important.  My next step will be to go back to the articles where these words appear and see what they are talking about.  So, these other computational and quantitative methods helped me to discover a pattern in the journals that otherwise I would likely never have noticed.  I look forward to seeing where this research goes.

American Journal of Science

untitled-1

(Second Edition of the first volume of the journal, available at from Carnegie Mellon’s digital collection)

Prior to the professional scholarly journal system of today, there was only one major journal for American science,  the American Journal of Science which still exists today and is focused on geology.  In the nineteenth century, however, the journal focused on every scientific topic.  The table of contents for the issues of the first volume (pictured) includes:

  1. Mineralogy, Geology & Topography
  2. Botany
  3. Zoology
  4. Fossil Zoology
  5. Mathematics
  6. Miscellaneous
  7. Physics, Mechanics, & Chemistry
  8. Fine Arts
  9. Useful Arts
  10. Agriculture & Economics
  11. Intelligence

Each article is roughly two to three pages and each contains an “intelligence” section which seems to be general news.  This section continues into the twentieth century, when the journal was more focused on geology, but the intelligence section will talk about important findings of Physics & Chemistry, and other scientific areas.

The journal was founded by Benjamin Silliman and later edited by his son. There is a good overview of the foundation of the journal, and of course multiple references to it, but so far I have not been able to find any articles using a computational approach to analyzing its contents.  In particular, I think it would be a great candidate for the topic modelling and query sampling techniques I have used earlier.  I haven’t done much of this in the past (I intended to do so for the Journal of the American Chemical Society), but this journal may even be a good candidate for a network analysis since it would contain a large number of scientists in the United States and potentially would show the network as it was beginning to split into different disciplines.  Fortunately, there is also over 100 years of textual data available for this journal in the public domain, making it a potentially very rich source.  I am going to see if some initial tests may get some interesting results, and I’m looking forward to seeing whether this journal helps understand the professionalization of science and the origins of the scholarly communication system in even more interesting ways than the Journal of the American Chemical Society has done so far.

Do Journals Help Understand Professionalization?

I have been working on trying to make sense of my data about the history of the Journal of the American Chemical Society, and after doing some rough categorization of “expected” (meaning topics that my  historiography on the American Chemical Society specifically mentions) and “unexpected” topics (which are not specifically mentioned in the historiography).  Here is a graph of the expected and unexpected topics year by year.

expected_vs_unexpected_yearly2

Roughly 20% of the time (sometimes less, and occasionally more), the topics in the journal are discussing the kinds of issues the historiography describes.  Thus, one might conclude that the historiography does a fairly good job of understanding the history of the field.  What is interesting, however, is how widely divergent the unexpected topics are.    There is tremendous variation which opens up another question.  might there be some external influence that is causing this variation?  The historiography of the society also divides its timelines of the journal by editor.  Therefore, I decided to see what the expected vs. unexpected topics looked like if you viewed them by editorial years (note that there is some overlap between editors)

expected_vs_unexpected_editor3

Here it appears that the number of unexpected nearly doubles during later years.  This might indicate that the journal is indeed influenced by particular editorial policies.  If one breaks out the unexpected topics from these later years, the division looks something like this.

unexpected_breakdown2

Aside from the appearance of one article on chemistry education, the division of unexpected topics seems to be primarily society business (eg. who is being elected president, who are presiding officers, or where the annual meeting should be held), and methodology (eg. what is chemistry, what kinds of experimental procedures are acceptable).  Generally speaking, when looking at influence of the editorial board, it seems that methdology (somewhat surprisingly) takes a leading role.

I want to return to my original question, however: Do Journals Help Understand Professionalization? (at least in this particular case of one professional society).  On the one hand, I am inclined to say yes.  We actually see how the writers for the journal are deciding on what chemistry is, and though one might think that this would be a more important issue at the start of a journal’s lifetime, the major debates seem to be happening thirty to forty years after the journal’s foundation.

On the other hand, I am also inclined to question this data.  Though I am confident that I have broken down the topics that are not dealing with chemical experiments and reasonable confident that I have been able to separate out society business and methodology topics, I am somewhat skeptical of what this really shows.  I used Mallet to create topic models for years in a journal.  In essence this is showing me not article by article what is being discussed, but rather topics, or general ideas being discussed in the journal.  Thus, my topic models are a somewhat generalized overview of the journal data itself. Furthermore, I have abstracted out that data into even higher topics of expected and unexpected.  Finally, when I divided by editors, certain trends seemed to become even greater (like the doubling of unexpected topics).

What does all of this mean?  I am trying to determine how a professional society defined itself by using their means of communication, the journal.  There could be multiple ways of doing this, and I experimented with topic modelling.  I am very happy that I was able to find some interesting trends, but I wonder how much the generalization of particular articles (and my interpretations on top of that), may be skewing things even more.  As historians think about using big data for interpretation, I think this question becomes even more important.  When we choose a certain method that a computer then models what we’re seeing, and we’re not doing the traditional close readings that historians do, does that methodology then skew our results.  Furthermore, if we then apply our findings to policy or other practical ends, what are the implications of that?  I don’t pretend to have the answers to these questions in a blog post, but I think we nee to be asking these questions, and also thinking about ways of discussing how our data were manipulated in order to make it clear to our colleagues critiquing our work and also to our readers, who may not understand the particularities of dealing with statistical topic modeling or working with historical data sets.

Computational vs. Traditional Methods

After doing some further work on my topic models for the Journal of the American Chemical Society I began to think a bit about the methods we use for doing history and how computational analyses fit into this.  I know I’m not the first to think about these issues, and, in fact, I have already thought about them briefly just in this project.  Nonetheless, at the risk of excessive navel gazing, what I found interesting is that even with extensive topic models for every year, I still had to go topic by topic and do some organization of the topics according to my interpretation of a history of the American Chemical Society in a long, cumbersome, and manual process.  Thus what started out as a somewhat quantitative project based on numbers of words and putting them into topics became an extremely qualitative and very subjective analysis.  Once I have had a chance to clean up my spreadsheets, I can share them on this blog.  Suffice it to say, however, that in the end I still had to resort to the same kinds of methods that historians and social scientists use when analyzing data, that is categorizing things in a way that makes sense to an individual scholar.  When I was thinking about how to analyze this data, I thought about all kinds of ways of trying to come up with statistical methods or writing scripts that could do the tasks I wanted to do more objectively.  In the end though I could not think about a way of doing those things that ultimately would get at the question I wanted to answer.

My goal with this project has changed over time.  Originally I wanted to determine if there was a network between editors and authors.  There was no meaningful one that I could find.  My goal now was to determine what these topics I had meant, and whether they reflected conventional historiography about the history of the American Chemical Society.  I couldn’t topic model a history book in any meaningful way I could think of and even if I did, trying to measure those topics against the journal topics would kind of be like measuring apples against oranges.  Therefore, I decided to read the history book (or at least the bits relevant to the journal), create my own topic categories, and then manually assign each of the topics Mallet so kindly found for me to a category.

Admittedly doing this work over such a large corpus would have been impossible without computational methods.  In the long run,though, I still resorted to the old fashioned way of making sense out of this information.  Do all computational methods in the end come down to human sense-making?  Perhaps they do, but as we think about how quantitative and qualitative methods interact, this seems to me an interesting example.  Mining a corpus of textual articles is certainly quantitative.  In the end, however, it took qualitative analysis to really attempt to understand what was going on.

Finding Patterns in Scientific Journals

The last topic models I showed for the Journal of the American Chemical Society showed topics across the entire corpus I have (all issues between 1879 and 1922). Now, I have been working on seeing if there are any patterns in the topics from year to year.  Since I ran a 20 topic model using Mallet, the list is quite long, so I created another page for those who want to look at the original data.  For now, I’ll just summarize what I think is happening.  First a few general points.

  • The word molecule first appears within a topic in 1883 and in 1891 it appears in three different topics.  It continues to appear throughout the corpus but not regularly.
  • The word atom first appears in a topic for 1880 and seems to appear more regularly than molecule.
  • The word patent first appears in 1884 but then does not show up all that frequently (only 6 times within the topic models and only until 1892).
  • Many years, though not all, also have topics that seem to pertain to the business of the society with words like journal, meeting, or city names.  Interestingly this was also one of the topics in the overall model, but it is interesting to see how the topic seems to be more dominant in some years than in others.
  • The word method shows up in the topics practically every year and seems to appear more frequently in the earlier years of the journal.

These are just some general observations from admittedly someone who is not trained as a chemist.  There may be other interesting issues that might be clearer to a trained eye.  For my next steps on this project I intend to look at two sources on the history of Chemistry:

There may be other sources, but I think I can at least try to show a proof of concept on these two. Hopefully, there is some way to measure what the topic models are showing against what these more general histories say is happening in the history of the society and in chemistry more generally.

Analyzing Some Preliminary Topic Modelling

I’ve started some work on topic modelling, using at the moment two programs: the InPho Topic Explorer (http://inphodata.cogs.indiana.edu/) from Indiana University’s Cognitive Science Program and Mallet (http://mallet.cs.umass.edu/index.php).  I’m also open to other suggestions for potentially trying out different methods, but this gets me started.  Right now, I’m still using some sample files, but look forward to trying these out on the entire corpus of the Journal of the American Chemical Society once I have the data.

From InPho, the topics I get are

Topic 0 per, found, precipitate, sulphuric, weight, substance, soluble, liquid, thus, total, nitric, hydrochloric, sulphate, value, gas, described, second, grams, oxygen, follows
Topic 1 per, new, chemical, iron, copper, ore, action, author, cent, review, germany, oil, see, assignor, research, richards, york, parts, zinc, analyses
Topic 2 gram, cent, per, sulphur, gave, method, methods, ten, hydrochloric, determinations, steel, tungsten, proteid, ammonium, cement, weighed, matter, oil, sulphide, aluminum
Topic 3 calc, sulfate, yield, secretary, sec, since, ave, held, electrons, prepared, system, boiling, atoms, product, curve, series, benzene, sulfuric, conductivity, lewis
Topic 4 sugar, action, etc, author, c.c, upon, heated, acids, oil, chem, grms, abstracts, obtained, liquid, air, oxygen, gas, ozone, lime, soda
Topic 5 new, form, table, per, weight, meeting, cent, slightly, der, book, tube, hydrogen, conductivity, west, sulphate, sec, sulphuric, estimation, oxygen, magnesium
Topic 6 section, city, ave, american, mass, chicago, william, water, charles, sodium, john, report, society, chemistry, washington, steel, philadelphia, members, sulphur, university
Topic 7 hydrogen, section, equation, concentration, ion, measurements, mercury, therefore, university, content, ann, curves, increase, theory, concentrations, points, derivatives, velocity, ions, room
Topic 8 cent, per, found, made, first, grams, soluble, sulphate, added, precipitate, small, containing, weight, much, sulphuric, gas, ether, would, form, substance
Topic 9 found, liquid, substance, ether, soluble, hydrochloric, thus, precipitate, gas, described, alcohol, value, second, alkali, glass, oxygen, acid, ethyl, material, specific
Topic 10 cent, much, hydrogen, grams, containing, ether, experiment, dried, fact, even, upon, form, place, separated, treated, various, following, soil, shall, magnesium
Topic 11 made, first, would, small, added, shown, pure, could, must, use, well, due, part, dioxide, order, table, gives, organic, precipitated, concentrated
Topic 12 acid, solution, water, one, results, method, may, two, chloride, used, potassium, alcohol, also, sodium, amount, temperature, time, present, experiments, salt
Topic 13 obtained, given, upon, nitrogen, following, mixture, chemical, crystals, chemistry, copper, conditions, heating, preparation, dilute, oxide, dry, case, error, known, iron
Topic 14 form, value, chair, constant, sulfuric, fig, table, chloride, temperature, experimental, solid, mixture, sulfide, subs, values, heat, carbon, hydroxide, slightly, journal
Topic 15 weight, values, reaction, solutions, point, much, concentration, pressure, fact, even, containing, temperature, equilibrium, sodium, experiment, dried, acids, melting, case, iodide
Topic 16 per, acids, heated, calculated, laboratory, journal, hydroxide, society, true, book, contain, cause, values, received, satisfactory, showing, special, year, manganese, show
Topic 17 form, table, constant, sulfuric, two, solution, slightly, value, cell, grams, surface, gave, solid, meeting, negative, phase, experimental, fig, hydroxide, however
Topic 18 grams, cent, gram, book, value, per, der, calculated, general, values, physical, und, sulfate, constant, inorganic, normal, die, ion, hydrazine, salts
Topic 19 water, new, milk, meeting, chemical, process, one, iron, apparatus, york, carbon, society, mass, read, fat, analysis, lime, secretary, furnace, use

From Mallet I get

topicId words..
1 chemical chemistry society american journal dr book work committee general
2 solutions concentration solution salts salt ion solubility ions conductivity cell
3 acid ch acids nh ester ii methyl ethyl obtained acetic
4 sugar experiments time action effect reaction starch amount rate power
5 theory atoms number surface oxygen molecules form hydrogen energy case
6 color oil red obtained mixture liquid yellow white water small
7 compounds compound reaction action group bromine chloride carbon derivatives formed
8 cc solution acid water added precipitate hydrochloric dissolved filtered excess
9 cent nitrogen gram results grams total amount weight sample found
10 alcohol water ether soluble solution found salt crystals melting acid
11 chem milk fat oil extract protein oils ash composition soc
12 method results made methods determination error obtained standard determinations found
13 work present paper fact great part study made question view
14 chloride potassium sodium silver solution acid ammonia ammonium nitrate oxide
15 analysis determination water soil organic matter plant chemical analyses methods
16 temperature values table pressure point heat equation constant data vapor
17 weight io atomic oo ii lead arsenic separation series oxide
18 st section meeting city mass pa secretary ave university york
19 iron process der steel copper coal gas gold ore furnace
20 tube air gas apparatus glass water temperature platinum liquid mercury

As I said in a previous post, many of these results are not surprising.  There are, however, some topics that merit some further analysis.

In InPho some names come up in topic 1 and 3 (richards and lewis) which could signal some important people in the field and worth seeing if they show up in other more focused contexts.  Also topic 6, which has the words “section, city, ave, american, mass, chicago, william, water, charles, sodium, john, report, society, chemistry, washington, steel, philadelphia, members, sulphur, university” has very few words that have much to do with chemistry (aside from sodium, chemistry, and sulphur).  My guess is that these have to do with meetings or with business of the society.  Unhelpfully, the names in this topic seem to be first names, which might just indicate that there are lots of people named Charles and John.  I also wonder whether this might just be a catchall topic.  For instance the words American, Society, and Report would, I think, come up in almost any issue of the journal.

In Mallet, the topics that come up are actually quite different, though substantively similar I think.  The first topic is, I think, a catchall with words that probably show up in every journal article, like “American Chemical Society.”  Nonetheless, I also think that “committee” is an interesting one in that topic, probably reflecting the reports of various committees that show up in the sections of the journal that report the activities of the society.  Topic 13 is also interesting with “work present paper fact great part study made question view” that do not seem to be discussing chemistry directly but rather seem to be talking about the work of doing chemistry (like publishing).

In addition to thinking about the network of people and how they are institutionalizing the journal, I also think it might be interesting to look at these topics as a kind of window into the philosophy of chemistry.  In other words, what are chemists talking about, and more importantly, does this match up with what historians and philosophers of science say what was going on in the nineteenth century.  It might be interesting to see if I can match up concepts from the Stanford Encyclopedia of Philsophy (http://plato.stanford.edu/entries/chemistry/) with the topic models I’m getting.  More on that in a later post.

In all, I think that these are some exciting preliminary results, and I look forward to doing some more in-depth topic modelling to see whether things change over time, see if I can understand more about the individual authors (and who is publishing on certain topics), and also see whether historians of chemistry are accurately understanding the topics of the day (at least in terms of what the flagship journal seems to think important).

Processing (Continued) and Moving Toward Topic Modelling

I’m still working on the issues of processing the corpus of text that I have, and will hopefully be able to finish that sometime next week, which moves me on to what will be my next step:  topic modelling the corpus of the Journal of the American Chemical Society between 1879 and 1922.

Based on some sample files, there is some good news.  The topics I seem to be getting mention acids, bases, chemical compounds, and the kinds of things I would expect to see in a topic model of a chemistry journal, and there are no extremely strange topics that I would not expect to see.  That, I think tells me that the text will be good enough to move forward and do some good mining.

On a side note with my processing I have also been extracting all of the tables of contents from the journal.  Ideally this should be done automatically but I’ve been doing it manually so that I can put some editorial notes in various parts of my spreadsheet (which I will share when I’ve finished).  For now, the spreadsheet contains a list of all of the officers of the American Chemical Society separated out by year.  Surprising (at least to me) is the fact that there is not as much overlap as I would have expected.  Some officers do continue to serve year after year, but there is actually a fairly high turnover.  New officers seem to come in every year.  The spreadsheet also contains every author in the journal between those years, what articles they’ve published, whether I consider them “prolific” (i.e. published many articles), and if there is any information about them in Wikipedia.  If someone knows of a more comprehensive database, specifically for chemists, let me know; so far, I’m not seeing many of the early authors/officers listed in Wikipedia.  This spreadsheet, I hope, can serve as a guide while I’m processing and hopefully can tell me if I make any significant errors when I start dividing up articles and years in the larger corpus.

All of this is a preface to try and get to the question I’m asking.  What is the network of scientists involved in the journal, and are the officers/editorial board influencing the content in any measurable way?  Originally, I had thought that a spreadsheet like the one I’m creating would help to answer this question.  I had thought that editors of the journal would be some of the most prolific authors, and I thought that there would be a significant continuity of officers over this time period.  I had not anticipated so many unique authors contributing to the journal, nor had I thought that the officers of the association would turn over as frequently.

There may still be a way to get at the question I’m asking, though.  I think that by topic modelling the corpus and seeing if particular authors are tied to particular topics, that may at least help to answer whether specific people have more influence over the journal’s content than others.  Also, I’m sure others have tried to tie Wikipedia information to networks like this.  Like I said, so far I’m not finding many scientists who have Wikipedia entries, though that may change as I move further into the twentieth century.  Perhaps even if I can find authors who have high influence over the corpus and a Wikipedia entry that may tell me something.

In any case, that’ s where I am at the moment, and if there are thoughts about what might be useful to do (before I move into heavy duty processing of lots of files), let me know.