Managing Big Data – Again

I was reading the recent Distillations magazine from the Chemical Heritage Foundation and saw an article on Information Overload.  It reminded me of the post I wrote a while ago on big data in the 19th century, along with multiple posts about the American Chemical Society and Libraries in the 19th century.  Sarah Everts, the author of the information overload article, rightly points out that having to manage vast amounts of data is not necessarily a new problem, as multiple other authors have pointed out.  She concludes by asking “how should we collect this metadata intelligently and in useful moderation when we don’t even know what research questions will be interesting to future generations of scientists?” and suggests that “modern data curators may wish to learn from the classical collectors: natural-history museums.”  She also discusses the importance of metadata in order to facilitate such management.

I wholeheartedly agree with all of Everts’ conclusions, but think that it is also important to look at two other organizations that are particularly relevant to scholarly communication: libraries and scholarly societies.  Both of these groups are also essential to managing information overload, and, I think, form a mutual dependency (similar in some ways to the mutual dependencies created by academic journals).  Additionally, I think that there is a social dimension to both libraries and scholarly societies (as well as to natural-history museums) that underlie much of what Everts is discussing.  Interestingly, in the case of libraries and scholars, there is a kind of divide between the two groups that provides an interesting twist on Everts’ argument.

So far in my own work I have been focusing largely on the history of “big data” in the nineteenth century, particularly as it relates to the American Chemical Society.  Other historians of science have looked more broadly at such issues, however.  For example, Alex Csiszar has argued that “The key point was not the increasing volume of papers coming into print” which is usually the argument one hears in modern discussions of information overload.  Rather, according to Csiszar, scientists in the nineteenth century attempted to replicate social organizations that were “safeguarding scientific value that had once been the putative territory of the societies and academies.”  I have found similar patterns in my work.  Certainly J. Lawrence Smith of the American Association for the Advancement of Science, and later the American Chemical Society, argued that research should be “pure” and free from interference of the outside forces Csiszar discusses.

What does this have to do with libraries?  During the nineteenth century, libraries were also transitioning.  My somewhat ancillary study of Theophilus Wylie the first librarian at Indiana University demonstrates this fairly well.  Wylie argued for a library that reflected the educational curriculum of the university, and also represented a tradition in which academics, not professional librarians, managed collections.  Universities, however, were changing to meet the needs for professional education.  Libraries changed with universities, and increasingly focused on becoming complete collections of all published work.  Thus, there was a tension between the two organizations.  On the one hand scholarly societies were struggling to maintain a social order that differentiated “pure” research from the vast amount of unscientific periodical literature available.  Libraries on the other hand tried to collect everything and provide tools for their patrons to navigate this sea of information.

Therefore, at least in the late nineteenth century, there were two ways of creating order out of the chaos brought on by information overload.  First, there was the scholarly method of using social organization (and eventually peer review and the other mechanisms that came with it).  Second, there was a set of methods in libraries that relied on specialists and classification systems to help library users navigate the explosion of information available to them.  Cziszar hints at an important aspect dividing these two communities: authority.  Libraries and scholars derive their authority from different sources and from different philosophical viewpoints.  The question is, given the current explosion in “big data” and the correct assertion quoted by Everts that “Producing and saving a huge amount of data that nobody will reuse has doubtful value,” whether it is even possible to solve this crisis of authority for the problem of big data.

There may be an answer that is found within the discipline of information science.  Archival studies has a sub-discipline called diplomatics that endeavors to understand the authority of a particular document within a particular historical context.  Modern scholars in diplomatics have recognized a concept of what they call “organic information” which recognizes all information (print and electronic) as a kind of living organism where meaning and authority depend on social context.  Philosophers of science have also noticed the link between information and living organisms.  Natural history museums of the type that Everts discusses provide an interesting analogy to this concept of organic information since they, quite literally, collect examples of living organisms.  Therefore, in a way, Everts article has uncovered an interesting link that needs to be further explored.

The last sentence of Everts’ article on information overload says, “with its overabundance of information, managers and creators of big data may find their inspiration in the most analog of collections.”  I agree, but think there are some interesting twists on that line of argument.  In the case of nineteenth century academic information, a divide grew between libraries and scholarly societies that were attempting to manage the first explosion of “big data.”  This division between the groups arguably still exists today, and may contribute in part to the problems of scholarly communication. The way to resolve this division, however, goes beyond just the provision of good metadata in the ways Everts suggests.  Rather, it may have to rely on the creation of a new method for deriving authority over information that is continually in flux.  Diplomatics may provide one framework to help reconcile this division between libraries, scholars, and many other groups.  There is one clear lesson from history in this case, however.  Given the vast quantities of data that continue to be produced, an explosion that will only grow over time, this is a problem that we both as a society and as an enterprise for higher education cannot afford to get wrong the second time around.



Big Data – A 19th Century Problem

Recently, I was reading an article entitled “Big data problems we face today can be traced to the social ordering practices of the 19th century.”  It led me to think a bit about this history of scholarly communication project which I think is very related to the larger issues they’re discussing.  The first reaction, at least from a historian’s point of view is that the “Big Data” conversation is not the second time this has happened but (at least) the third.  The first arguably would be what Ann Blair has discussed in Too Much to Know, which dealt with the large amount of information produced with the explosion of print also led to new ways of thinking about information management.  Additionally, Peter Burke’s two volume Social History of Knowledge traces some of the same trends over an even longer period of time.  All of that said, the link Robertson and Travaglia make here that I think is unique is the connection between the explosion of data and the political implications.  For the first time, managers felt the need to tie the information society collected to things like performance, productivity, and other metrics, particularly through statistical methods of analysis.  This development of measuring people via statistically sampling data, is certainly true today, and I would agree that in some ways it almost seems like an extension of these earlier trends.

I wanted to comment specifically however on the implications of the larger issues the article discusses with scholarly communication, some of which they actually mention briefly by stating “In some ways growing academic specialisations created a situation in which what was gained through a narrowing of focus and growth in sub-disciplinary activity was also lost in generalisability. This distinctly Victorian problem endures to the present day despite interdisciplinary projects of various kinds.”  They then continue on to suggest that nineteenth century ideologies (including some that are distinctly contrary to modern notions of equality) have continued into the analyses of present day big data issues, and that those underlying ideologies need to be changed.

One of the ideologies not specifically mentioned, but I think very relevant, is a “Whiggish” view of historical progress.  To put that another way, in the early nineteenth and even into the twentieth century, there was a view that the world will continually progress into something new and better.  One of the other strands of historical argument that has played into this Whiggish notion of progress is a belief in technological determinism which posits that technological change drives such progress.  Though such ideas are mostly anathema now, I think one can see the discussions we currently have about the university and its purpose might be tied to these Whiggish views.  For instance discussion that we should eliminate humanities departments and to increase STEM education seem at least to me to fit into these notions of Whiggish and technologically deterministic history.

What does this have to do with scholarly communication?  As Robertson and Travaglia suggest, disciplinarity and the ways universities developed in the nineteenth and twentieth centuries, arguably play into these very notions of creating a method for continuing progress.  The fields of history, history of science, and other disciplines have moved on to other philosophies of interpreting the past, and may even be using big data to prove theories about why technological determinism and whig history are wrong.  How do we bring this discourse into the conversation, especially since policy makers even now may be using discredited Whig notions to decide the future of university education and the production of knowledge?