Related Reading
Research Library
The Unseen Scholars
Herbert Van de Sompel and Johan Bollen discuss researching information in the digital age
Collecting the Data
Johan: We've collected perhaps the largest existing set of usage data in the world—over a billion "clicks" gathered over the years from some of the world's most-significant publishers and aggregators and a set of institutional consortia that includes the University of California, California State University, the University of Texas, and lots of others. It's an enormous dataset that we believe covers a good chunk of the online activity pertaining to research in science and the humanities, including medicine.
1663: Did you have to twist arms to get institutions to part with their usage data?
Johan: I often joke that all of my gray hair has been acquired in the past year from begging these people for their usage data. Really, most were eager to collaborate with us, in part because of the reputation that our team and the Research Library have in the community. They also know the data have value; they just don't know how to exploit the data yet. I tell them right off that usage data can be used to assess value because they reveal immediately how many people are reading which papers.
That information could be used, for example, to price the journals or to reward the authors. And the value assessment would be statistically more accurate than a citation-based value because a poorly cited paper may nonetheless be read thousands of times.
But as Herbert said, we can also look at relationships between papers, or between journals, and define, say, a "bridge value" metric that quantifies to what extent a paper connects normally disparate groups. We've come up with dozens of metrics that can be used to measure value and to improve our understanding of science.
1663: Wow! You may change the entire notion of what constitutes a good research institution or who should get tenure.
Herbert: That's a general theme of the Prototyping Team's work: use the new capabilities of the digital era to improve scientific communication. Another example is the Object Reuse and Exchange project (ORE), which we worked on for the past two years. Its starting point was the consideration that in so-called eScience, a publication is not just a paper, but rather the aggregation of a paper, a dataset, maybe a video recording of a computer simulation, some software, etc. All these resources sit on different Web servers, but they form a logical whole—a digital-era scientific publication. So, somehow we must be able to express that these distributed resources belong together. We need to glue them together.
The Web gives us a fantastic mechanism, the URI, to talk about each of those resources individually by means of its Web address.
It does not give us a way to talk about an aggregation of resources. I have worked with my team and with colleagues around the world to give the Web the ability to handle such aggregations. The resulting solution is based on the principles of the Semantic Web—the Web for machines—and the specifications were recently published. The Mellon Foundation, the National Science Foundation, and Microsoft funded this project. There are already groups in the United States, Europe, and Australia implementing these new specifications, and also the library is developing compliant tools. Pretty cool.
1663: Scientific communication will never be the same.
Herbert: Not if we have it our way.


