Processing (Continued) and Moving Toward Topic Modelling

I’m still working on the issues of processing the corpus of text that I have, and will hopefully be able to finish that sometime next week, which moves me on to what will be my next step:  topic modelling the corpus of the Journal of the American Chemical Society between 1879 and 1922.

Based on some sample files, there is some good news.  The topics I seem to be getting mention acids, bases, chemical compounds, and the kinds of things I would expect to see in a topic model of a chemistry journal, and there are no extremely strange topics that I would not expect to see.  That, I think tells me that the text will be good enough to move forward and do some good mining.

On a side note with my processing I have also been extracting all of the tables of contents from the journal.  Ideally this should be done automatically but I’ve been doing it manually so that I can put some editorial notes in various parts of my spreadsheet (which I will share when I’ve finished).  For now, the spreadsheet contains a list of all of the officers of the American Chemical Society separated out by year.  Surprising (at least to me) is the fact that there is not as much overlap as I would have expected.  Some officers do continue to serve year after year, but there is actually a fairly high turnover.  New officers seem to come in every year.  The spreadsheet also contains every author in the journal between those years, what articles they’ve published, whether I consider them “prolific” (i.e. published many articles), and if there is any information about them in Wikipedia.  If someone knows of a more comprehensive database, specifically for chemists, let me know; so far, I’m not seeing many of the early authors/officers listed in Wikipedia.  This spreadsheet, I hope, can serve as a guide while I’m processing and hopefully can tell me if I make any significant errors when I start dividing up articles and years in the larger corpus.

All of this is a preface to try and get to the question I’m asking.  What is the network of scientists involved in the journal, and are the officers/editorial board influencing the content in any measurable way?  Originally, I had thought that a spreadsheet like the one I’m creating would help to answer this question.  I had thought that editors of the journal would be some of the most prolific authors, and I thought that there would be a significant continuity of officers over this time period.  I had not anticipated so many unique authors contributing to the journal, nor had I thought that the officers of the association would turn over as frequently.

There may still be a way to get at the question I’m asking, though.  I think that by topic modelling the corpus and seeing if particular authors are tied to particular topics, that may at least help to answer whether specific people have more influence over the journal’s content than others.  Also, I’m sure others have tried to tie Wikipedia information to networks like this.  Like I said, so far I’m not finding many scientists who have Wikipedia entries, though that may change as I move further into the twentieth century.  Perhaps even if I can find authors who have high influence over the corpus and a Wikipedia entry that may tell me something.

In any case, that’ s where I am at the moment, and if there are thoughts about what might be useful to do (before I move into heavy duty processing of lots of files), let me know.


One thought on “Processing (Continued) and Moving Toward Topic Modelling

  1. In terms of approach, there are a few things that might work, some of which we’ve talked about. In particular, an article-based topic model that is then used to create a network of similar topics could go hand in hand with an automatically generated NER network of those same articles. Yoking them together shores up the value of one or the other. The additional value of an NER network is that recognized named entities like “Bunsen Burner” will stand out more in the network-analysis phase, which in turn will support the development of a specialized training corpus for projects that look at a broader set of chemistry journals.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s