Digital Collections, Digital Libraries and the Digitization of Cultural Heritage Information

This paper is based on the transcript of a largely extemporaneous keynote address given at the Web-Wise 2002 Conference on March 20, 2002 at Johns Hopkins University. It has been edited, but it preserves the character of an informal talk rather than a formal paper. I have taken the opportunity to expand upon or clarify a few points, and have also added a few footnotes and pointers to additional information on some of the topics discussed. Parts of the question and answer segment that were captured as part of the transcript have also been included, though I've had the advantage of being able to reconsider some of my answers while the questioners have not had that opportunity; my apologies to them.

It's really a pleasure to be here and to open this conference.

I need to tell you that I've spent the last three days in a pair of meetings dealing with the digitization of cultural heritage information and related matters. The first meeting ran two days and took an international perspective on programs and issues related to large scale digitization projects. And yesterday my organization, the Coalition for Networked Information, held a workshop looking at some of the more specifically U.S. related issues in selecting, prioritizing, digitizing and making available cultural heritage content [1].

So, the questions that this conference focuses on, and that I want to address today, have been very much at the forefront of my thinking for the past few days, and I think much of the discussion that I have had the opportunity to participate in at those two meetings is quite relevant here. I've chosen to use this talk as a way of framing some of these issues, which are central to our development of digital collections. And I'm going to draw reasonably heavily on some of the ideas that have been in play in these meetings in my presentation this morning, so let me thank all of the participants in advance as they have been influential in shaping my thinking.

One of the things that is very striking to me is the growing and persistent demand for more and more digital content. You hear this from many quarters. In the U.S., interestingly enough, one place where it's shown up is in some rather ironic twists in the debate about broadband. I don't know how many of you have been following the intricacies of the proposed Tauzin-Dingell legislation to (depending on how you look at it) deregulate or sort of re-regulate broadband as a follow-on to the 1996 Telecommunications Act. But there have been some very, very interesting observations and allegations floated and let me just share a few of them with you.

There is a now widespread piece of conventional wisdom that says that broadband is not rolling out fast enough in the U.S. and this is a bad thing. This is a bad thing because broadband is intrinsically good for consumers and small businesses for a lot of reasons but it's also a bad thing because if we had more broadband deployment people would need to buy new computers, software, and network services and this would help to pull our information technology and telecommunications industries out of their current slump.

I saw it suggested in the editorial page of the Wall Street Journal a couple of weeks ago, that one of the reasons why the long haul fiber carriers - companies like Global Crossing, Williams Communications and the like - are not doing well right now, and many of them are filing for bankruptcy, is because demand didn't materialize fast enough to provide revenue to service their debt loads, and the reason demand didn't materialize fast enough is that broadband to the home - the on and off ramps to their long-haul backbone networks, and the driver of demand for backbone network capacity - didn't roll out fast enough. In other words, the long haul carriers were the victims of policies that inhibited the deployment of residential and small business broadband.

The other very interesting variation that we are hearing is that the problem isn't just barriers to deployment and lack of availability for broadband consumer services, that one of the reasons why consumers aren't rushing to sign up for broadband service at enormous rates and beating on the broadband providers about availability is because there isn't enough compelling content and services there. In other words, it's a problem of demand rather than availability. There is some evidence to support this; even where broadband is available, and even considering factors like the observation that a home really needs a computer before it's likely to be interested in broadband connectivity, the "take" rates are modest. Though in fairness, nobody knows what a reasonable take rate would be at this stage of the evolution and rollout of the technology, and there are other deterring factors like price and the very well-known installation and other customer service problems associated with broadband offerings. Be this as it may, the folks in the so-called content industries, most notably the music industry (RIAA) and the film industry (MPAA) have seized upon this and used it as a justification for their own agenda. They've said that, "Oh, yes, well the reason that it isn't any compelling content and services is that only we can provide these, and we won't because there we don't have good digital rights protection. If we imposed what we think is sufficient protection on everything we'd happily at least think about putting our music and our movies, which we're currently distributing on DVD and CD audio, out there and everything would be happy and people would flock to broadband".

What they mean by good digital rights protection is something like the Hollings bill [2], a piece of legislation so frightening and so appalling that it would take me all morning to do justice to the damage it would do. But that would take us far afield.

I have more than a little trouble with the content industry argument. But even more than having some trouble with that argument I find it disappointing because it implies a view of the broadband digital future that is so limiting, so deeply based in a picture of broadband services as a direct extension of the current consumption of entertainment products. Somehow I want to believe that broadband is going to be an environment that's going to bring us more than a new delivery channel for recycled movies and music which we can currently pick up at the store pretty conveniently. And I have to believe that among the things that are going to make the new broadband networked information environment compelling are cultural materials, learning materials, broadly speaking materials in the public interest. (I also believe that broadband is going to be an exciting new creative medium for authorship and for new interactive services such as massively multiplayer games involving the building of virtual worlds and virtual societies which have great potential as learning environments as well as for entertainment. The broadband digital world will be about much more than just getting the same old movies and music without getting off the couch. But this again is another story that will take us far a field.)

I think that one of the things that people don't remember enough any more is that when the Internet and shortly thereafter the Web really took off as consumer services in the mid to late 1990s that whole universe was already richly seeded with free content that had been supplied by universities, by museums, by cultural heritage institutions, by government agencies. Content in the public interest, as opposed to commercial entertainment content. There was a whole lot of really interesting content and some provocative, thought-provoking, engaging applications out there and, in fact, if you look particularly at what was going on in the early days of the consumer migration to the Net, before the whole dot com boom and the commercial gold rush that reshaped the Internet, that noncommercial material was very attractive to the general public as they began to explore the Net. Indeed a lot of material provided by the non-profit sector remains attractive and compelling today and remains an important reason why the public increasingly relies on the Net as an information source.

So, I believe that we can make the case for continued emphasis on this notion that we need to create lots of compelling content to create demand for broadband services, and the education, research and the cultural heritage sectors have a very important role to play here. You'll find this notion echoed in the recent U.S. National Research Council study, Broadband: Bringing Home the Bits [3], for example. I need to say that I was privileged to serve on the committee that wrote that report, which came out late last year, and I also need to say that my comments here about broadband and content providers are my own opinions and don't necessarily reflect the views of that committee.

I heard this idea about nonprofit-sector content as a potential driver for broadband demand reflected at the international meeting I was at earlier this week. It's a real consideration in the digitization strategies of other nations. And let me assure you that other nations are making very substantial investments in digitizing cultural content. For example, we heard from our colleagues in the U.K. that they're making very, very heavy investments now and interesting to me, they're following a tripartite investment strategy. One part is connectivity, which we've done in the U.S. through a whole array of different programs, things like the library E-rate program and the NSF new connections program to get high bandwidth connectivity to our research universities. The second thing they were investing in is content itself. I'll come back to that issue because that's where I'm going to spend considerable time, but let me just note now that there is no government (particularly federal government) funded program currently in place in the United States at a scale similar to what's happening in the U.K. focused on actually creating digital content, despite the good work that IMLS and to a lesser extent agencies like NEH and NSF are carrying out here. Of course, in the U.S. we also find a much more distributed funding pattern in the creation of digital cultural heritage content, with private foundations and state governments making important funding contributions as well.

The third area of investment within the U.K., and I found this really interesting, is in training. They are investing in people. Training librarians, outreach to train people to help the public learn to use and navigate and appraise material. As well as training in the creation of digital resources. So in the U.K. they really saw this as not a two-part activity the way that technologists often tend to parse this - get the connectivity in and get the content up - but really a three part program involving people, content, and connectivity. And I think there's an interesting lesson there that it behooves us to think pretty hard about. IMLS has also been investing in the human resources through community building activities like the Web-Wise meetings, and NSF has had a very explicit goal of building a community around its digital library initiatives through meetings and through funding vehicles like D-lib magazine, but the investment has been more in community building than in direct training - we are assuming that if we create communities, the participants in these communities will naturally learn from and teach each other.

But let me return to focus on this issue of content. I want to first note that we talk a lot about digital libraries. That's been certainly an important phrase in our lexicon since at least the mid 1990s. The strongest focus that has been established around digital libraries comes out of the work that the National Science Foundation (along with an array of collaborating agencies including DARPA, NASA, the National Library of Medicine and others) has been funding in the United States; there are also parallel international programs. It's worth noting that there's still not a clear consensus about exactly what constitutes a digital library within this community; to get a sense of these issues, there's a nice article that Chris Borgman of UCLA wrote a couple of years ago for Information Processing and Management [4]. We also have been talking - particularly within the communities heavily represented at this conference - about digital collections. Digital collections are not an idea that is yet really well established or internalized within the research-oriented digital library community that I just described.

One of the things that I think is starting to become clear is that digital collections and digital libraries aren't the same thing. This is a crucial observation. We need to understand the distinction between the two, and the relationships between digital collections and digital libraries. It helps to clarify our goals in creating digital collections and also helps with the difficulties about defining digital libraries. And, in fact, we've got some very complicated and subtle questions that I think demand close scrutiny as we try and understand the distinctions between digital collections and digital libraries, and in particular understand the role of digital collections and some of the issues around them. I want to explore some of these issues in the next part of my talk.

Let me start by just making a few observations on digital collections as collections.

We're getting pretty good at digitizing material at scale. We have a wealth of experience and a large number of successful projects (not to mention some highly educational failures) to build upon. With the exception of relatively esoteric materials in specialized formats or that have some really unusual characteristics, this is not really research any more. Or to put it another way, the research questions are less about how to do it at all and more about how to optimize - how to do it more efficiently or effectively, how to be sure that you've chosen the most appropriate strategies and technologies. We are training a large cadre of people qualified to plan, manage, and execute digitization projects through vehicles like the Schools for Scanning. Best practices are becoming well established - consider the work that IMLS has done in this area, or the Digital Library Federation, or the forthcoming Guide to Good Practice [5] in preparation by the National Coalition for a Networked Cultural Heritage (NINCH). Costs are becoming more predictable for these projects. There are commercial and non-commercial mass production operations that are becoming well established to support organizations that want to do large-scale digitization; one no longer has to do it in house as part of a research and development effort.

If you look at many of the projects that are going to be highlighted at this meeting for example, you will see there are lots and lots of materials being digitized. Our museums, our libraries, our archives, our historical societies, are all running digitization programs. And we have efforts like JSTOR, Amico, and the forthcoming ArtStore, and scholarly societies digitizing back runs of their journals; here large-scale digitization projects have been established in what are intended to be sustainable economic frameworks.

All of these efforts are producing numerous large collections of material, databases which are open to exploration and presentation in dozens of different directions. These are collections - raw material. The focus is on creating large amounts of digital content and providing some fairly simple access tools, rather than upon sophisticated systems for ongoing use or apparatus providing interpretation. Now, what's interesting to me is to contrast this to so much of the public interest rhetoric which speaks not just to raw materials but to learning materials, to the need to package raw content from collections up in various ways such as learning experiences or curated exhibitions or interpretation and analysis. We need to study the lines of demarcation between raw cultural heritage materials, if you will, and interpretation or teaching, or presentations of these materials. This is a boundary line that I don't think we really have a very clear understanding of. It gets to the historic mission differences among museums, libraries and archives, and the growing confusion about those distinctions in the digital world; it involves the historical and perhaps changing roles of scholars, teachers, curators, and librarians. It invites questions about which audiences or user communities we are teaching to use uninterpeted databases of raw cultural heritage materials, and the methods we are teaching them to use in exploiting these resources. We have to ask questions about just how "interpretation-neutral" a collection of raw materials can really be - surely, for example, interpretation creeps into the descriptive metadata; as my friend and colleague Michael Buckland has pointed out, the changes in practice in the construction and assignment of subject headings over the past century is a window into many social changes that have taken place during that period.

But I think that we can identify a series of trends that may lead us to a world of digital collections - databases of relatively raw cultural heritage materials, for example - and then layers of interpretation and presentation built upon these databases and making reference to objects within them. Probably we'll see interpretations that draw from many digital collections, and single digital collections contributing materials to many different interpretations. While I think that libraries, archives, museums, and the higher education community will be among the major creators of digital collections, the creators of presentations and interpretations of materials from these collections will be much more numerous and diverse.

The implications of this dichotomy between raw materials and interpretation seem quite wide reaching. Let me explore just a few of them.

One implication is that learning materials, interpretation and presentation seems to me to typically - or at least often - have shorter lifespans than the primary source materials that they draw upon. If you look at the processes of scholarship they include a continual reinterpretation of established source material (as well as the continued appraisal of new source materials). Source material persists and generation after generation of scholars and students engage it, yet we typically rewrite textbooks every generation or so at least.

What I've just said is a generality, of course. Presentation and interpretation can itself make the transition to new primary material, and at least some of it is of lasting interest to some audience of scholars and researchers and perhaps other people - but generally a much smaller and more specialized audience than it was originally designed for. Think about a history of art written in 1900, essays on Shakespeare from 1870, or elementary science books that say that "someday a man may walk on the moon." While these can be valuable learning materials when placed in context - re-interpreted, if you will - as a part of a new set of learning materials being prepared today, they are much less likely to be valuable directly as learning materials in their own right for most of today's audience.

If we agree that interpretation and presentation is relatively short-lived in some sense, I think this raises some troublesome questions about sustainability (a topic I will return to several times in this talk). Every digitization project that I know of, every funder of digitization projects that I know of, is acutely sensitive to this issue of sustainability, of trying to avoid the dilemma where we fund the creation of materials that we cannot economically sustain in the long run. But the nature and economics of sustaining primary material, once digitized, seems to me to be quite different than sustaining presentations and packagings and interpretations of it. For primary material, often the main ongoing cost is preserving the digital content and operating access systems; for interpretive materials we face all of those costs plus the costs of intellectually refreshing the interpretations periodically if they are going to remain relevant and responsive to their primary intended audiences. We may need very different economic models for these two classes of materials, and it may be unrealistic to expect libraries, archives and museums to take on primary roles in sustaining interpretative material (though presumably they will continue to be heavily involved in ensuring its preservation). Recognize that there's a disconnect here: this result is at least somewhat at odds with the developing public expectations about the benefits that they are likely to see by underwriting the digitization of cultural heritage materials. Many of the learning materials and curated exhibitions may come later, building upon the digitized collections being created in the public interest, and the interpretations and learning materials may not be free to the public, even if the underlying digitized collections are available "for free" (in the sense of being publically underwritten).

We also need to understand how presentation and interpretation actually makes the transition to new primary material in a world of digital collections. There are mechanics and collection development decisions implied here. We know how the transition has happened in the world of print publication, but how it happens in a digital world full of new practices of authorship and new roles for cultural heritage institutions is anything but entirely clear.

We know that we want our digital collections to be reusable, though I suspect that there is little consensus on what reusability really means. I think that we believe that collections of lasting value have the characteristics of reusability. Part of reusability or re-purposing clearly is the ability to contribute, over time, to a large array of interpretations or presentations of materials for many different audiences and purposes within the context I've just described. In essence, it's the ability to have collections be overlayed in various ways. We have very limited experience with reusability and repurposing today. And right now our thinking about overlays is still in its infancy: we think about union catalogs, cross collection finding aids, new teaching or analytical works that make reference to objects in digital collections. As I'll discuss later, I think we are beginning to get a glimpse of much more sophisticated re-use and re-purposing that has deep implications of both markup of digitized objects and the metadata that accompanies them, however. Indeed, accommodating overlay may be too limited a way to describe the full range of repurposing that we'll want to facilitate.

Our digitization projects create databases of material and make it available to the public, but I don't see that much information moving between databases yet. That's kind of interesting - perhaps surprising - in a world where storage and bandwidth are getting cheaper and more plentiful every year. Part of the problem has been a lack of standards for moving information around, and here things are changing as we're starting to get technologies deployed like the open archives metadata harvesting protocol (see www.openarchives.org). What that metadata harvesting protocol really is fundamentally, is a way for metadata and pointers to data to migrate from one system to another. It's a piece of plumbing, if you will, that's designed to encourage metadata reuse and through metadata reuse data reuse and re-purposing. It's designed to permit copying and amalgamation and refinement. As metadata - and objects themselves - become more recombinant we may begin to get a broader perspective on reuse and repurposing.

Finally, let me just note that in some of the work that's taking place in mapping the architecture of digital preservation that is taking place as part of efforts such as the Library of Congress program to design a national digital preservation strategy, the distinction between digital collections, and interfaces which provide access (in the broadest sense, think interpretation as well as access here) to digital collections, looks like it is going to be very important. And the digital collection rather than the digital library or access interface is the locus of stewardship for digital preservation. You'll be hearing more about this thinking in the coming months.

So, that's an initial set of comments that I wanted to make about digital collections.

Now, another thing that I think comes across very clearly when we think about digital collections in this light, as part of this very complicated layered environment, is the need to really focus on infrastructure. We are getting some pieces of infrastructure in place rather quickly, like the open archives metadata harvesting protocol I just mentioned, but it's clear that we're going to have needs for other infrastructure components which are not as mature or widely deployed today.

I remain, for example, surprised at the relatively slow progress in the deployment of persistent identifier systems, which seem to me to be an absolute cornerstone of designing digital collections that are overlayable, reusable and repurposable. I would also note that I think in this area we have over-emphasized the engineering problems and failed to focus sufficiently on the hard intellectual problems such as the nature and identity of the objects we want to identify.

We also, I think, need to conceptualize infrastructure not just in the computer science or network engineering sense but in a more general intellectual sense - infrastructure related to the management, interpretation and use of our content. For example, it seems clear to me that as we construct and work with digital collections and digital libraries that we're going to need an infrastructure that encompasses digital versions of tools like gazetteers, dictionaries, vocabularies and vocabulary mappings. And these tools and infrastructure components need to become building blocks - network services for access by computer systems and structured data for interpretation by computer programs - that will facilitate their incorporation into a multiplicity of systems.

Let me turn now to the question of digital libraries in contrast to digital collections. As I indicated already, I find myself thinking now of digital collections as things close to raw content (perhaps with some limited interpretive materials - it's hard to completely isolate interpretation from raw materials; interpretation creeps in everywhere, for example in descriptive metadata that is part of the digital collection, as discussed earlier) and digital libraries as the systems that make digital collections come alive, make them usefully accessible, that make them useful for accomplishing work, and that connect them with communities.

I'm starting to believe that collections - at least many collections based around cultural heritage materials - don't really have natural communities around them. In fact, one of the things that we learned over and over again by anecdote at the meetings I talked about earlier, and I think this has been born out a hundred times in other settings, is that digital materials find their own unexpected user communities. That when you put materials out there, people you would never have expected find these materials from sometimes very strange and exotic places that you wouldn't have imagined, and sometimes make extraordinarily creative or unpredicted uses of that material. So perhaps we should avoid over-emphasizing pre-conceived notions about user communities when creating digital collection, at least in part because we are so bad at identifying or predicting these target communities.

But I think that digital libraries are somehow the key construct in building community, making community happen and exploiting community. Indeed, much of what we have learned about designing successful digital libraries emphasizes the discipline of user-centered design. Effective digital libraries are designed both for purpose and audience, very much in contrast to digital collections. And I want to underscore two aspects of digital libraries that I find myself thinking about a lot these days.

The first is that if we think of digital libraries as a collection of tools that make content alive, that help you to find it, that allow you to manipulate it, analyze it, annotate it, comment on it then digital libraries attract, they create, they define a community. But they also let the members of that community talk to each other. This conversation happens in explicit ways which we certainly are well aware of and have exploited in the sense that people who are working together on common interests find each other, they begin to talk to each other, we see digital libraries stretch into systems like collaboratories where there is active group annotation and analysis and creation of new knowledge happening.

But digital libraries can also enable and facilitate implicit communication. My favorite example of implicit communication, which has not been muchexploited yet, is recommender systems, where basically the digital library system becomes a mechanism for reflecting the behavior patterns of members of the community to other members of that community in a controlled and useful way. The trivial example of this, of course, is what we see in commercial systems like Amazon.com: "here are things that people with interests very similar to yours have been looking at (purchasing) lately, and I notice you haven't looked at (purchased) this one yet, perhaps you'd be interested." What amazon.com does, using purchasing patterns as a surrogate for user evaluation, is a fairly simple example, but I believe with some focused exploration of the observation that digital libraries let members of the community talk to each other not just explicitly but through their history of actions and behaviors, will lead us to some very interesting new things we can do. And it becomes even more interesting if we can do this in a distributed fashion, if in an environment of collaborating organizations concerned with the advancement of teaching and learning and scholarship rather than competitive commercial advantage, we can find the right framework of standards, technologies and social practices to permit controlled sharing of history and behavior between digital libraries rather than only within single digital libraries [6].

The other fascinating aspect of digital libraries which we haven't thought about very much and I think needs to be a new focus - and, if I'm right here, is going to have some very significant implications for the construction of digital collections as well as digital libraries - is that the aggregation of materials in a digital library can be greater than the sum of its parts. I think this is a very interesting and exciting possibility - though it's a bit hard to talk about because the ideas are still emerging, and imprecise, as much still impressionistic and speculative as actually proven out in implementation practice. But if this possibility proves out it will take us us very, very far away from traditional practice in physical world libraries and archives. Perhaps one underlying intuition is that as a scholar reads, absorbs, and integrates a body of primary materials and works written by other scholars, the collection of knowledge in his or her head goes beyond the simple sum of what has been read. Our digital libraries can assist, amplify, and to some extent reify this activity and allow the results to be more readily communicated, shared, and further advanced by entire communities.

Now, one of the things that we can do in the digital world is we can move away from the historic notion of fixed editions where it's appropriate. And I don't want to suggest it's always appropriate. How we structure intellectual dialogue, how we document things, how we choose to conduct scholarly discourse - these kinds of things clearly need to include channels where there is some discipline of fixed editions. We can also create many more editions, revise and update much more frequently, in the digital environment. There's been much speculation about this characteristic of the digital world and its implications both for libraries and for practices of communication, and many people - certainly going back to the elegant and eloquent writings of great thinkers like Ithiel de Sola Pool, for example - have emphasized this as perhaps a defining difference of the digital environment. While I believe this is an important difference, surely, I don't think it's the defining one. At the same time that units of information can undergo more editioning and more updating, we can perform computational activities across collections of information that make them more than the sum of their parts, regenerating the computations as the collections grows and changes, and this may be a much greater difference in the long run.

I want to note we have a powerful, deeply established cultural bias from our physical world models of publishing and libraries and our construction and honoring of the practices of authorship that tends to think of works of the intellect such as books as independent, as standing alone, as the product of individual voices and individual minds; while authors may quote or cite the works of others, outside of certain acts of appropriation in the contexts of the creative rather than scholarly arts, and the relatively recent practice of building and curating community databases of scientific knowledge, each work stands proudly alone. In our libraries, these works are placed on shelves in a systematic order. We put overlays on them that organize the material, make it accessible, group similar or related works together. But we don't mess with the work once it gets to the library. Rather we honor the preservation of the integrity of each individual work, just as we honor the act of authorship. Recognize that part of what is at issue here if digital libraries are going to be more than the sum of their individual constituent content objects is the need to become more flexible in thinking about the integrity of works and authorship, of figuring out how to balance our need to honor this integrity while also being able to integrate huge numbers of such individual works.

I'll give you three glimpses - just provocative snapshots, really - of what this may mean for the future of digital libraries. One is the really wonderful work that is going on at the Perseus Project at Tufts University. If you've not looked at this, you really must (see perseus.tufts.edu). One of the things that they are doing is they are computationally linking multiple resources together. So you take things like a biographical dictionary or dictionary of place names, you link it to maps, you link it to mentions of place names or people in literary works. What happens is through computation, plus the contribution of additional intellectual effort by the designers and curators of the digital library, you begin growing a corpus that is more than the sum of its parts, that evolves over time, and grows richer and richer. Adding a new work and computationally integrating it may enrich other works that are part of the digital library.

Closely related to the work going on at Perseus is the broader emerging group of technologies called data mining. Data mining is now a very popular and important activity in the scientific community and in industry, where they're applying it to all kinds of demographic, consumer and marketing data. It's also starting to come into play in very significant ways in areas like understanding public health issues. Basically the idea here is you amass huge amounts data and then you apply computational resources to look for patterns and relationships in it. The more data you can amass, the more computational resource you can apply, the more likely it is you may be able to unearth interesting and novel patterns and relationships. This is a very powerful model. It's not exactly clear to me how it extends into all of the rich array of cultural heritage information. Certainly you can think of some of what Perseus is doing as data mining. But I think of it as much as linkage creation as data mining. There's an inter-dependency and inter-relationship there which is not fully explored, as with so many other things in this area.

Let me just tell you one interesting anecdote about data mining. Think about astronomy. Now, what astronomers used to do is spend their lives trying to get telescope time and look for interesting phenomena, all the while worrying about bad weather than can interfere with their scarce observational time slots. More recently, we have started putting telescopes on the Net so that at least astronomers don't have to get travel funds to go to where the instruments are, but they now have to worry about whether the network is up when they finally get their scheduled slots of time on the instruments.

But what's happened now is astronomers are starting to compile large amounts of observational data into this sort of virtual sky database [7]. And all of a sudden there's a new kind of astronomy research showing up, which is not about taking observations new, but rather about applying data mining and pattern recognition technologies against this virtual sky database. Now, obviously if astronomy stops observing and only goes to a mode of mining existing data, that's not going to be healthy for the field. We need a balance between the two. But I think that this underscores how data mining ideas are changing the way in which science is done. To the extent that digital libraries house scientific data and provide tools to work with that data, we can think of the virtual sky environment as a digital library, or at least a big component of one.

Let me just give you one other glimpse of how the whole can be greater than the sum of its parts which intrigues me but also makes me less comfortable than the two examples I've just covered. There is a charming statement which I had originally attributed to Alan Kay and I've been corrected that it really comes from Marvin Minsky apparently, a statement proposing someone in the future saying something like , "Can you imagine there was a time when the books in a library didn't talk to each other?" [8]. While I first heard this probably fifteen years ago, this quote has been coming back to haunt me in the past few years. And it's coming back to haunt me for two reasons. One, the positive vision, is we really can have digital libraries now where the books could in some sense talk to each other to make the library greater than the sum of its individual books. What do they say?

I think that one of the things they "say" is what we code into them with mark-up. Really good deep mark-up that exposes intellectual and semantic structure, that exposes content for linkage and data mining, and computation, is part of the language they're going to talk. I believe that efforts and ideas as diverse as the Text Encoding Initiative, some of the work with ontologies, with XML and particularly the development of XML schemas to support scholarly communication, and with the Semantic Web effort all reflect the importance of mark-up as a means of structuring information for reuse in a computational environment. And I think that suggests things to us about some of our strategies for digitization and building digital collections and in particular for the need to really talk with scholars, with teachers, with data miners, with digital library builders, with computer and information scientists and computational linguists and many other disciplines in a continuing way about what the appropriate mark- up structures are and how to implement them in our digitizing programs. And, perhaps most importantly, to recognize that appropriate mark-up is going to be an evolving area. It's going to evolve both in terms of what we want to mark up and how, and also in terms of what mark-up we can actually afford to do economically.

This has a lot of ramifications. For instance, I think there's a mental picture that many of us have that digitization is something you do and you finish, in the sense that when you digitize a photograph this is a finite, one-time process and as a result you have an image file, or you convert a book into marked-up text. But when we consider objects with mark-up, I'm starting to think that we're going to need to revisit this mark-up periodically as our understand of mark-up evolves, and our capabilities to apply mark-up economically also evolve. There are going to be layers of mark-up. In fact, we may need to be thinking about representations for things like contingent or speculative mark-up, mark-up with confidence levels and provenance.

Many researchers are developing computer programs that can parse natural language, both spoken and written. They can do things like identify proper names and decide whether they're people or places, or organizations. Unfortunately they don't always do it right. Sometimes they do it really wrong. On the other hand, when you look at some of the experimental results, a lot of these systems get it right a lot more often than they get it wrong. Until the humans come, until some intellectual analysis can be done, which means finding and or funding some human being to apply or review mark-up, is it useful to run these systems against digitized materials and put in preliminary or unevaluated tagging from automated analysis systems? I think when we think about these visions of digital libraries, probably the answer is yes. But we need to think about it in such a way that it can happen in an evolutionary fashion. When next year's program comes out that's better than this year's, you'd like to rerun it and take out the old mark-up and bring in the new mark-up. And there are many programs, representing different approaches to the problem; we have multiple research groups developing these programs, and they all work sometimes and they all screw-up occasionally. Maybe a good strategy is to look at the places where lots of these programs agree, and figure those are probably the easy cases and we should assign them a little bit more credulity. And of course similar computer programs can generate metadata as well as markup, so we also need to consider similar developments in the metadata area.

People, places and organizations are just the beginning. There are also efforts to pick out dates, mentions of genes or proteins or species in the scientific literature, to automatically classify citations as one of a number of different types, even to parse the structure of entire articles in some very rigid scientific article genres. Further into the future are the promise of other computational tools that will extend our reach even further: speech to text conversion, image analysis and feature recognition, and video analysis software, for example.

Let me also just note that the digitization of reference works of various genres, such as dictionaries or encyclopedias of various sorts, or the authoring of new reference works for the digital medium represents both a high priority and low-hanging fruit for extensive mark-up. A high priority because of their extensive potential as interlinking mechanisms among other works (as demonstrated by Perseus, for example), and low-hanging fruit because the semantics of the information stored in these kinds of works is often relatively simple, homogeneous, well understood, and highly structured (when contrasted with more general scholarly communication).

So, that's part of the attractive picture of books talk to each other. Now, I can't resist sharing the perhaps less attractive version and that takes us into a commercial or consumer realm.

One of the things that I'm very conscious of is that we are starting to see a set of technologies evolve that basically provide people with individual portable libraries. This is something that libraries, higher education and the digital library community all have yet to really come to grips with. But I think it's something that is part of the environment of digital library systems that we better start thinking about. It's starting to get quite reasonable to think about people running around with a couple of thousand digital books on their laptop. There are a number of consumer products around that let you tote around thousands of songs to listen to. This is not a substitute for a portable audio CD player, though it has a similar form factor; instead, you're talking about taking your entire music collection with you. And if you've got a big music collection, no, it won't fit this year, but we know that discs should just keep on getting bigger and keeping on getting cheaper, and I promise you in a couple of years it really will fit. They're making disks bigger faster than they are inventing new music or writing new books. So, personal portable digital library technology will quickly catch up, I think, to even the needs of the most acquisitive individuals.

Now, what happens when people start amassing these kinds of large personal collections of digital materials and the objects compare notes with each other? I can readily imagine a situation where you add a new book in your personal digital library and it runs around and takes inventory of what else you have in there and it either consults its built-in catalog of other books from its publisher and starts bombarding you with ads, or, if it can talk to the outside world (think of some digital rights management type scenarios here), it reports it back to its publisher and says, "Oh, well, you should send this person new books announcements and special offers based on what I've just found in his personal digital library".

That's part of the more disturbing - or at least disconcerting - side of books talking to each other. They're not just going to talk to each other but also to other external programs and organizations and people, potentially. That's going to happen. It's going to happen in a variety of different ways. I just gave you the sort of commercial version of it, which is disturbing because it's potentially annoying and invasive and raises some privacy issues. But it's also going to happen in more benign and helpful ways, particularly if we can sort out ways to give the user effecitve control over his or her personal information. Think back to my description of reccomender systems in a distributed environment, and recognize that a part of that environment may be one's personal portable digital library. Acquiring a book may become sort of an invitation to an ongoing communication among people and information sources and we haven't even begun to scratch the surface of the implications of that. Just as a quick example, think about what this might imply in the future about the ongoing responsibilities of authors, particularly in scholarly settings.

Further out, approaching the fringes of speculative fiction, we can also imagine fact extractors that try to create databases of knowledge by mining evidence out of a continually growing corpus of books and articles; this gets much easier if data in these works is marked up in ways that facilitate extraction without having to fight the battles of natural language parsing and understanding.

But enough speculation about digital libraries. I present these possibilities because I really want to stress the distinction between digital libraries as active environments of communities and engagement and analysis and interpretation and computation and the equally important but rather separate activities of creating digital collections that can feed and support digital library systems. My belief is increasingly that these are not the same. It's helpful to separate them, if for not other reason than it illuminates dozens of interesting and often difficult questions that we should be exploring. These are not just about technology, but about organizational practices and roles and responsibilities, about sustainability, about economic strategies, about a whole range of different things.

Now, let me just move on to my last two points before opening this up for some conversation.

I know that many of us here are specifically concerned with cultural heritage information. Now that I've separated out the effort of building digital collections of this material from the work of constructing digital libraries as community use environments for cultural heritage information, one central question is where the interpretation, annotation, and reuse of these materials occurs. I've argued that these activities refer to content in digital collections but largely remain separate from them. There is a territory beyond digital libraries, at least as we have usually built them up to now, which includes scholarly communication and publishing, the construction of teaching materials, the authoring of new works that build upon primary materials. How much of this happens within digital libraries, and how much beyond the "walls" of the digital library? What are the relationships here? How do digital libraries relate to the work of presentation and interpretation discussed earlier? And, in particular, we should recognize that one of the strengths of digital libraries is their ability to construct, exploit and amplify community - yet this is to some extent at odds with historical practices of largely individual and individualistic authoring that characterizes much interpretation and presentation. We need a much deeper understanding of these issues.

As a case in point, with the availability of substantial number of digital images from museums, we are seeing universities employing these collections both for teaching and research. But we don't seem to be seeing the dialogue I would have expected between the scholars in the university who study this material and teach it, and the people in the museums who curate and exhibit it. (I do recognize there are some long-standing cultural divides here.) I don't think we have seen much change in the shape and practices of the scholarly literature that uses and interprets these images, or the teaching materials that build on them. Is it just too soon, or perhaps are there still not enough (or the right) images available to create fundamental change? Are we still missing the digital libraries that complement the digital collections in this area? An analysis of the situation here might be a useful case study.

We're in a world now where there is a tremendous amount of information being created in digital form. When we think about creating digital cultural heritage collections we tend to focus very heavily, in large part because of copyright constraints, on material that is old - and thus on digitizing material that started life in as a physical artifact. More recent creative works are often off limits because of copyright barriers, and if they are available at all they are available as part of licensed commercial offerings. At the same time though, I think we need to also be very conscious that there are living databases that are the raw material of a lot of our cultural, political, legal, and social record. These are parts of active organizational records or operational systems within government agencies of various kinds. Cultural heritage information is much more than just creative works; there's a lot of history in it as well. Many of these materials are not encumbered by copyright. (Though they may be encumbered for other reasons, such as privacy. The massive availability of digital material, if you just think of areas like public records right now, has taken us to the point where we're confronting a new set of questions about what are we comfortable having public. There's public and there's really public. There's a difference between things are public for inspection if you go down to the courthouse and do a lot of digging around, as opposed to digital documents that pop up on Google when you're killing time plugging people's names into the search box.)

These materials can legitimately be looked to as content for digital cultural heritage collections. We need to recognize that indeed there's an intellectual continuity between the cultural materials, the historical materials, the social materials of the past and those of the present. And we need to start thinking about how to make these more of a unity as we build digital cultural heritage collections. The fact that these records are now born digital is going to change the nature of our intellectual record going forward. These materials are an important part of the new digital cultural heritage collections, but they're different because they're born digital and they will follow very different paths, perhaps, from their creators to our cultural heritage institutions. We do not have time to explore this topic today, but it's important not to limit our thinking about digital cultural heritage collections to only out of copyright materials that we have digitized.

So, I hope that these speculations - and frankly many of these comments are speculations - are helpful in at least framing questions, in providing ways to think about some of the projects you're doing. My sense is that we're starting to see some maturity now in the practice of developing digital collections. It's very impressive to me how we can point to not just useful standards and useful experience, but to a large body of good practice. We have enough experience, we have enough exemplars now that if you're doing a digitization program we can point you at guides and educational programs that help you think about how to construct the digital collections, how to plan and budget for a digitization project, how to select appropriate standards and technologies, and even to some extent how to think about sustainability.

In contrast, digital libraries seem to me in some sense to be enormously more complex, enormously more open-ended. While there's been a tremendous amount of good work done there, I don't think we're anywhere near being able to point with confidence to good practice in how to construct a digital library (though we can probably tell you at least a few things it's a good idea not to do). Digital libraries are as rich as our visions about how we can use and reuse digital information. They're as rich as the conversations we can imagine between books. They're as powerful as the linkages we can imagine creating as we amass material and in the digital world it becomes greater than the sum of its parts. And perhaps it's good that we're not ready to produce a best practices guide for true digital libraries (as opposed to digital collections masquerading under the label of digital libraries), that everything is still very much open to trying new things and exploring new ideas in this area. In digital libraries we still want to keep our options open, not to limit our thinking prematurely. I think that recognizing this distinction between digital libraries and digital collections, and the vastly different levels of maturity, is very valuable. All of the uncertainty and all of the promise around digital libraries - and the exciting glimpses of the future that we can see in today's path-blazing digital library projects - also helps us with our understanding about how to continue to evolve our thinking about digital collections and our practice in creating and maintaining them.

Thanks.

I actually promised to finish in a timely fashion and I'm told we really do have time for some questions, comments or discussion, and I would welcome that.

Dr. Hastings: Clifford, would you talk a little bit about finding aids and specifically the work that Brewster Kahle is doing with his Wayback Machine.

Dr. Lynch: Finding aids and Brewster Kahle's Wayback Machine. I can certainly talk about both those things, I'm not sure I can connect them especially. (Laughter.)

Brewster Kahle's Wayback Machine forthose of you who haven't seen it is a system that essentially lets you navigate his archives of Web pages captured from the Web by time. So you can say I'm interested in the Web page at this URL and he comes back and says, "I've got one from March of 1998, another from April of 1999, etc." And you can pick the one you want. It's a fascinating and wonderful tool for trying to get a handle on some of the historic material around the Web that he's captured in the Internet Archive. The name is from the Sherman and Peabody cartoons that were part of the Rocky and Bullwinkle show, which demonstrates Brewster's exquisite taste in culture (which I very much share). In the context of my talk, I'd characterize it as much more of a digital collection than a digital library at this point.

Finding aids and where those fit? I didn't talk much about finding aids. Finding aids are certainly part of organizing collections. Sometimes we make them because they're the best we can afford to do. Sometimes we make them because even though we've been able to invest in item level description, they still provide a kind of road map to material, a higher level of abstraction that's useful. I think that this is an area that we really need to think a lot harder about. A lot of our practice in this area really comes out of historic practice that's aimed at small communities of highly skilled scholars. And while these finding aids are very useful to scholars, when we look at the multiplicity of audiences now engaging a great deal of the enormously more accessible digital materials we're making available, I'm not sure what the right metadata structures area to facilitate the creation of a wide range of presentation interfaces, and how helpful how helpful finding aids are in this context. I know Howard Besser, sitting next to you, has done some looking at this question of repurposing for different audiences.

I think this is actually part of a much bigger issue about collection description. You can think of finding aids as one element of collection description. There's a lot of interest now in describing collections as we've created this incredible constelletion of collections, of pools of information, accessible through the Net. And people can't find which pool to go look in. My sense is that we still have an awful lot to learn about this, and that collection description is really a fundamentally hard problem. Neither the traditional bibliographic approaches, nor the sort of computational approaches that have been tried by computer science research groups at places like Stanford University (Hector Garcia-Molina and colleauges there build a system called GLOSS) and the University of Virginia (see the work of Jim French and Allison Powell there), to solve which I might characterize as the "what database should I search in?" problem, really capture all of what we need from collection description. I think this is an area that really merits some very, very serious investigation.

I'd make an even broader statement which is that I think that one of the things that the richness and the proliferation of information in the digital world is forcing us to revisit generally is levels and hierarchies of abstraction. We kind of got locked into a single picture of the world, for example in bibliographic practice, over the past decades, and now I see these assumptions being revisited with efforts like the IFLA FRBR work, which tries to recognize multiple levels of abstraction, of work edition, manifestation. I see this question in debates about whether one should harvest up item level records or collection level records as part of metadata harvesting. It seems to me that the problem we need to be thinking about is giving people a more flexible view through sameness and distinction, to let them be able to say, "I'd like a picture of this that really is sort of clustering at a pretty gross level, don't bother me with details about editions, I just want to survey the landscape", or "I'm very interested in precise distinctions". We don't have systems and we don't really have descriptive practices that give us as much flexibility there as it's starting to look like people trying to navigate these vast oceans of information really need.

Dr. Goodrum: Moving a step beyond the description of the collection level to thinking about inventing intelligence in objects, it's really a shift in how we thought in the last century about how we organize information to be found, whether it's in the library, archives or museum. And I would suggest that it's going to require a real shift in our thinking to be able to sort of let go, to imbue an object with enough intelligence that it can then be sent off on its own to sort of find out who it is and find its own audience. And we're not really prepared for that.

Dr. Lynch: Well, let me make a distinction here and I think it's an important one. You spoke about imbuing objects with intelligence, I'm very cautious about the notion of imbuing objects with intelligence for a number of reasons. One is that I think it's fundamentally hard. I think that the objects run the risk of being unpreservable. Certainly the more intelligence you put in an object, the more you up the complexity of preserving it. And I think that the security problems in the broad sense may be intractable or very difficult at least. The way I'm thinking about things these days is that you don't so much put intelligence in objects as in their environment. So, objects are intelligent in the sense that they represent structured knowledge. The inference around it happens in the digital library system, not with objects reaching out (be it in a random or purposeful way) and communicating with other objects. Maybe this is just going to be sort of a transitional step to really intelligent objects, but I guess I feel like actively intelligent objects are beyond our ability to safely engineer and deploy on a production scale at this point. I could be wrong.

I think we're out of time. Thank you.

About the Author

Clifford Lynch is the Director of the Coalition for Networked Information (CNI).
E-mail: cliff@cni.org

Notes

1. Eventually a meeting report will be available at CNI's Web site at http://www.cni.org.

2. Declan McCullagh, "What the Hollings' Bill Would Do," at http://www.wired.com/news/politics/0,1283,51275,00.html, accessed 1 May 2002.

3. Broadband: Bringing Home the Bits, at http://books.nap.edu/html/broadband/na_statement.html, accessed 1 May 2002.

4. C.L. Borgman, 1999. "What are digital libraries? competing visions," Information Processing and Management, volume 35, number 3 (January), pp. 227-243.

5. "Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials," at http://www.ninch.org/programs/practice/, accessed 1 May 2002.

6. Clifford A. Lynch, "Personalization and Recommender Systems in the Larger Context: New Directions and Research Questions, " Second DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 18-20 June 2001; at http://www.ercim.org/publication/ws-proceedings/DelNoe02/CliffordLynchAbstract.pdf, accessed 5 May 2002.

7. See, for example, "The Two Micron All Sky Survey at IPAC," at http://www.ipac.caltech.edu/2mass/index.html, accessed 1 May 2002.

8. See Raymond Kurzweil, 1990. The Age of Intelligent Machines. Cambridge, Mass.: MIT Press, p. 328.

Editorial history

Paper received 17 April 2002; accepted 18 April 2002; revised version received 4 May 2002.

Copyright ©2002, First Monday

Digital Collections, Digital Libraries and the Digitization of Cultural Heritage Information by Clifford Lynch
First Monday, volume 7, number 5 (May 2002),
URL: http://firstmonday.org/issues/issue7_5/lynch/index.html