Related Reading
Research Library
The Unseen Scholars
Herbert Van de Sompel and Johan Bollen discuss researching information in the digital age
Below the Computer Interface
1663: What do you mean by standards?
Herbert: Standard specifications. Most of my work applies to a level that's way below the computer interface that users see. Basically, I find ways for information systems to work with each other better, and I create specifications that describe how they can do that. For example, a specification might be a set of instructions that tells two servers how to exchange information. Once a specification is released as a standard, it can be adopted by information systems on the Web. There's nothing fancy about it. It's plumbing, like the pipes running beneath the house. People are never aware of the plumbing, but because it's there, they can build fancy bathrooms and kitchens.
Johan: Plumbing is Herbert's private joke. Many people are acutely aware of his work because it's had such an influence on the way the academic and research communities access and exchange information.
1663: Joke or not, plumbing's a great analogy. Can you give us a concrete example?
Herbert: There's the Protocol for Metadata Harvesting (PMH). Soon after the Web emerged, hundreds of scientific publishers around the world started making their journals and associated article metadata available online. That was a good thing. The bad thing was that one had to search each publisher's metadata separately. In order to overcome this problem, you wanted to collect the metadata into one large pool and search it there. But there was no uniform way to collect metadata from the publishers' information systems, and they used different metadata conventions.
Johan: It was crazy. You couldn't just tell a search engine to look for an author; you had to do multiple searches in several systems. But there were hundreds of publishers, and you couldn't cover all the bases.
Herbert: In 1999, Paul Ginsparg, who created the Los Alamos preprint archive, Rick Luce, then the director of the Research Library, and I founded Open Archives Initiative (OAI). Its goal was (and still is) improving the dissemination of scholarly information through technical means. Under the OAI umbrella, several colleagues and I began to develop a protocol, a set of commands that would tell one computer system how to present metadata in a standard way, no matter how it was stored internally, so another computer system could grab it. The protocol created an interface for metadata exchange between the two systems, and it became a standard.
Systems around the world now use PMH in a variety of ways. Our own OPPIE uses it to obtain its metadata from its underlying content archive. Via PMH, OPPIE checks whether new content is available and if so, grabs it (also via PMH) and adds it to the search engine. OPPIE harvests about 90,000 new records a week in this way.


