Managing Big Data – Again

I was reading the recent Distillations magazine from the Chemical Heritage Foundation and saw an article on Information Overload.  It reminded me of the post I wrote a while ago on big data in the 19th century, along with multiple posts about the American Chemical Society and Libraries in the 19th century.  Sarah Everts, the author of the information overload article, rightly points out that having to manage vast amounts of data is not necessarily a new problem, as multiple other authors have pointed out.  She concludes by asking “how should we collect this metadata intelligently and in useful moderation when we don’t even know what research questions will be interesting to future generations of scientists?” and suggests that “modern data curators may wish to learn from the classical collectors: natural-history museums.”  She also discusses the importance of metadata in order to facilitate such management.

I wholeheartedly agree with all of Everts’ conclusions, but think that it is also important to look at two other organizations that are particularly relevant to scholarly communication: libraries and scholarly societies.  Both of these groups are also essential to managing information overload, and, I think, form a mutual dependency (similar in some ways to the mutual dependencies created by academic journals).  Additionally, I think that there is a social dimension to both libraries and scholarly societies (as well as to natural-history museums) that underlie much of what Everts is discussing.  Interestingly, in the case of libraries and scholars, there is a kind of divide between the two groups that provides an interesting twist on Everts’ argument.

So far in my own work I have been focusing largely on the history of “big data” in the nineteenth century, particularly as it relates to the American Chemical Society.  Other historians of science have looked more broadly at such issues, however.  For example, Alex Csiszar has argued that “The key point was not the increasing volume of papers coming into print” which is usually the argument one hears in modern discussions of information overload.  Rather, according to Csiszar, scientists in the nineteenth century attempted to replicate social organizations that were “safeguarding scientific value that had once been the putative territory of the societies and academies.”  I have found similar patterns in my work.  Certainly J. Lawrence Smith of the American Association for the Advancement of Science, and later the American Chemical Society, argued that research should be “pure” and free from interference of the outside forces Csiszar discusses.

What does this have to do with libraries?  During the nineteenth century, libraries were also transitioning.  My somewhat ancillary study of Theophilus Wylie the first librarian at Indiana University demonstrates this fairly well.  Wylie argued for a library that reflected the educational curriculum of the university, and also represented a tradition in which academics, not professional librarians, managed collections.  Universities, however, were changing to meet the needs for professional education.  Libraries changed with universities, and increasingly focused on becoming complete collections of all published work.  Thus, there was a tension between the two organizations.  On the one hand scholarly societies were struggling to maintain a social order that differentiated “pure” research from the vast amount of unscientific periodical literature available.  Libraries on the other hand tried to collect everything and provide tools for their patrons to navigate this sea of information.

Therefore, at least in the late nineteenth century, there were two ways of creating order out of the chaos brought on by information overload.  First, there was the scholarly method of using social organization (and eventually peer review and the other mechanisms that came with it).  Second, there was a set of methods in libraries that relied on specialists and classification systems to help library users navigate the explosion of information available to them.  Cziszar hints at an important aspect dividing these two communities: authority.  Libraries and scholars derive their authority from different sources and from different philosophical viewpoints.  The question is, given the current explosion in “big data” and the correct assertion quoted by Everts that “Producing and saving a huge amount of data that nobody will reuse has doubtful value,” whether it is even possible to solve this crisis of authority for the problem of big data.

There may be an answer that is found within the discipline of information science.  Archival studies has a sub-discipline called diplomatics that endeavors to understand the authority of a particular document within a particular historical context.  Modern scholars in diplomatics have recognized a concept of what they call “organic information” which recognizes all information (print and electronic) as a kind of living organism where meaning and authority depend on social context.  Philosophers of science have also noticed the link between information and living organisms.  Natural history museums of the type that Everts discusses provide an interesting analogy to this concept of organic information since they, quite literally, collect examples of living organisms.  Therefore, in a way, Everts article has uncovered an interesting link that needs to be further explored.

The last sentence of Everts’ article on information overload says, “with its overabundance of information, managers and creators of big data may find their inspiration in the most analog of collections.”  I agree, but think there are some interesting twists on that line of argument.  In the case of nineteenth century academic information, a divide grew between libraries and scholarly societies that were attempting to manage the first explosion of “big data.”  This division between the groups arguably still exists today, and may contribute in part to the problems of scholarly communication. The way to resolve this division, however, goes beyond just the provision of good metadata in the ways Everts suggests.  Rather, it may have to rely on the creation of a new method for deriving authority over information that is continually in flux.  Diplomatics may provide one framework to help reconcile this division between libraries, scholars, and many other groups.  There is one clear lesson from history in this case, however.  Given the vast quantities of data that continue to be produced, an explosion that will only grow over time, this is a problem that we both as a society and as an enterprise for higher education cannot afford to get wrong the second time around.



19th Century Information Use

I’ve finished gathering data on Theophilus Wylie’s personal library and his work as the librarian of Indiana University.  Overall, I think what is interesting, is a clear indication that Wylie seems to have different ideas about what is important to his own work as a scholar and what is important for the library to maintain.

First, some visualization of his personal library.  It contains about 700 books, and thanks to the director of the Wylie House, I have a list of all of the books which are still held at the Wylie House Museum. I went through all of the titles and created some general categories to see what we might say were the most important subjects in the collection.

Wylie_Personal_LibraryReligious subjects are clearly favored with the largest category with near even coverage in Humanistic and Scientific disciplines (with Science having a slight edge), followed by books about education and a few miscellaneous items (like cookbooks).  It seems that Wylie takes his role as a Presbyterian minister quite seriously, and it is likely that many of the religious works helped him prepare his sermons.  Wylie also taught science and languages, with science being his primary subjects in his later career.

There some additional questions though.  Did Wylie collect the same subjects for the Indiana University Library?  If not, how where they different?  Why? There  are a few ways to answer these questions.  Unfortunately no complete catalog of the library exists from Wylie’s tenure as librarian.  The library burned down twice between 1840 and 1880 and many of the records were lost.  There are, however, a few hints.

The first is a catalog that Wylie created of the library in 1842, shortly after he took over as librarian.  It likely does not show much of his collecting interest, but it does show what the subjects of the library were when he took over.  Fortunately there is a dissertation by Mildred Lowell on the History of Indiana University Library which has already done some analysis on this topic.  Instead of re-categorizing the thousands of books held in the library, I mapped her work onto the categories I used for the Wylie’s personal library and this is what the subject categorization looks like.


Clearly there is quite a difference.  The Humanities are very dominant.  The “other” category contains mostly reference works (like dictionaries and encyclopedias of various kinds), and neither science nor religion are particularly well represented.  The question still remains though as to what influence Wylie himself may have had when he collected books for the library.

There are two lists of books Wylie procured for the library both through gift and donation, one of which is available digitally.  Though this is probably not a representative sample containing just over 100 books, it is the best I could find to try and answer this question.  Here is the visualization of that sample.

purchasesAgain there are some interesting difference.  The stress on the humanities seems to be the same.  There is clearly more emphasis on scientific subjects, a slight increase in religious subjects, and some less emphasis on “other” subjects.

In all, it seems like there are some clear differences between what Wylie felt was important for a university library to hold and what it was important for him to use personally.  I am still working through the Indiana University archives which house his papers.  Fortunately there are some existing reports on his activities as librarian and a lecture he gave on books and libraries.  Perhaps there are some hints there about his views on the difference between personal information use and the perceived information needs of the students and faculty of Indiana University.

Purpose of (19th Century) University Libraries

In doing some more research on Theophilus Wylie, librarian of Indiana University from 1840 – 1880 (among other positions like professor of chemistry, interim president, and Presbyterian minister), I ran across an interesting speech he gave among his papers.  Entitled “On Books and Libraries” (delivered sometime in January of 1878) Wylie gave a brief history of what books and libraries are, but also gave some unique ideas of what he thought a college library should be.  Nowadays we think of libraries as a kind of center for scholarly communication in which we collect, preserve, and disseminate research.  Wylie felt differently.

First, he gives some idea of the importance of books in a scholars’ life by saying that “It is not the number of books that make the scholar.  We sometimes think we know what we have in our books.  This is a mistake.  We must make knowledge a piece of our minds.”  He then goes on to say that “Some books we must appreciate and digest, others consult.”  In other words, some books need to be investigated in depth, but others need only be browsed for facts or quick information.  He illustrates his point by giving an example of a dictionary (which one only uses to look up definitions of words), and says “A library is like a dictionary for consultation.”

Wylie was living in a time that was transitioning from universities from institutions that primarily taught a classical liberal arts classical into the research universities we think of today.  In his time, the libraries were used primarily to help with the curriculum for teaching.  However, Wylie was also was purchasing journals and periodicals of scientific literature which he may have used in his research.  Professional librarians (with master’s level training and full time jobs) would not exist at Indiana University for another thirty years, and it would be another ten years before Dewey established the School of Library Economy at Columbia.  So, in many ways, Wylie’s statements serve as a kind of scholars’ view of what libraries and librarians should be before full scale professionalization of librarianship began.

Similar to Wylie’s time, we might argue that universities and libraries are going through another transition.  One might also consider that the kind of information overload we talk about today is simply an acceleration of a trend that began even before Wylie.  For scholarly communication and libraries today, Wylie’s advice, I think, still holds.  There are certain books that scholars need to “appreciate and digest, others consult.”  The question is how do we make those distinctions now and how can or should librarians become “consultants” in a meaningful way?