Mass book digitization: The deeper story of Google Books and the Open Content Alliance
First Monday

Mass book digitization: The deeper story of Google Books and the Open Content Alliance by Kalev Leetaru

The Google Books and Open Content Alliance (OCA) initiatives have become the poster children of the access digitization revolution. With their sights firmly set on creating digital copies of millions upon millions of books and making them available to the world for free, the two projects have captured the popular imagination. Yet, such scale comes at a price, and certain sacrifices must be made to achieve this volume. With its greater visibility, most studies have focused on Google Books, addressing limitations of its image and metadata quality. Yet, there has been surprisingly little comparative work of the two endeavors, exploring the relationship between these two peers and their deeper similarities, rather than their obvious surface differences. While the academic community has lauded OCA’s “open” model and condemned the proprietary Google, all is not always as it seems. Upon delving deeper into the underpinnings of both projects, we find Google achieves greater transparency in many regards, while OCA’s operational reality is more proprietary than often thought. Further, significant concerns are raised about the long–term sustainability of the OCA rights model, its metadata management, and its transparency that must be addressed as OCA moves forward.


Conclusion: Comparing the two




Perhaps the two best known digitization projects, by virtue of their near–constant coverage in the mainstream press, are the Google Books and Open Content Alliance (OCA) initiatives. Google pioneered the modern mass access digitization project with its 2004 debut of Google Books — then known as Google Print (“Google Book Search,” n.d.). Prior to its announcement, the library and information science profession, with its high–cost, low–volume preservation mindset, had paid little attention to the emerging trend of lower cost access digitization. In keeping with its innovative approach to large–scale problems, Google developed the necessary set of enabling technologies, proving quite spectacularly that access digitization’s time had indeed arrived. However, once it had been shown that such mass access campaigns were not only possible, but could be quite successful, the academic community reacted by forming its own initiative, the Open Content Alliance (OCA), which was announced in October 2005 (Kahle, 2005). Unease about corporate ownership over the digital copies of historic works led the OCA to declare that all of its digitized works would be available for unrestricted use. It debuted to a fanfare of openness and transparency about its processes, with many calling it the final democratization of the world’s knowledge.

The popular press, as well as the academic community wide and far, has portrayed the two projects as polar opposites, criticizing Google’s shroud of secrecy over its book scanning operation, while lauding the OCA as a model of transparency and openness. The purpose of this study is not to criticize or defend either initiative’s practices, but rather to evaluate them as peers and illustrate that sometimes these existences may be more nuanced than often believed. In the meantime, Google Books continues to lead all other access digitization projects in sheer volume of material digitized. In December 2006, the OCA announced it had reached its 100,000th digitized book (“Internet Archive Receives Grant,” n.d.), while by September 2008, Google had reached over seven million books in its holdings, with almost three million of those published prior to 1980. By September 2008, only 31,000 of the works published after 1980 offered unrestricted online browsing. The remaining entries offer either bibliographic entry, snippet–based search results, or limited online browsing with a “displayed by permission” statement indicating the publisher had agreed to allow inclusion of their work for indexing, but not complete online browsing or download (“Date:1500–2008,” n.d.). The large number of restricted modern books illustrates the growing trend of publishers to contribute their online catalogs to Google in order to increase visibility of their product lines, much as they provide copies to and other electronic booksellers. Through its high visibility as a Google product, the Google Books project has become a centerpiece of the digital text revolution.




The role of access digitization

Before one can compare the two projects, it is important to first realize that both projects are really only access digitization projects, despite the common assertion of OCA captures as preservation digitization. Neither initiative uses an imaging pipeline or capture environment suitable for true preservation scanning. The OCA project outputs variable–resolution JPEG2000 files built from lossy camera–generated JPEG files. A consumer area array digital camera is used to produce images with resolution from 600 dpi at 4.5 inches x 7 inches down to just 300 dpi for 11 inches x 14 inches works. To reduce camera bandwidth and storage requirements, digital captures are output from the camera in JPEG format, a notoriously artifact–laden format when output directly from a camera (post–conversion from TIFF to JPEG using desktop software can generate significantly higher resolution JPEG files). Google guarantees archival resolution output files, but given its use of area array capture technology, it too must acquiesce to variable capture resolution, relying on software image processing to convert to archival files.

Preservation digitization, on the other hand, uses linear array technology or extremely specialized medium–format area array units with very exacting quality specifications. Images are output in RAW or extended TIFF format, preserving all of the detail captured by the image sensor. Lens assemblies are designed to very precise optical standards and, in many cases, custom designed for a specific type of imaging application. Lighting and color are sampled at every frame and, depending on material characteristics, variable–source lighting is used to image the work from multiple angles and using multiple spectrums. Once it understood that these projects are access digitization initiatives and not preservation endeavors, the quantity of technical documentation published by each group becomes less important. Only in preservation digitization is it truly necessary to fully understand the exact technical details of the digitization process in order to understand its impact on the faithfulness of the digital surrogate. In preservation digitization, the focus is on the entire process and not just the end product. In access digitization, however, the focus is on the end product only and certain limitations and shortcuts are accepted to achieve the high levels of efficiency necessary. It therefore becomes clear that a full and complete knowledge of the workings of the two projects is not necessary: the purpose of both projects is to produce usable, but not preservation–grade, digital surrogates to offer mass access to the materials being digitized.

Access, however, is not defined merely by the quality of the image scan, but also by the way in which digital materials are made available to their target communities. The underserved that can benefit the most from such archives often lack the high–bandwidth Internet connections capable of downloading and viewing high–resolution color imagery. Google’s use of bitonal imagery and its interactive online viewing client significantly decrease the computing resources required to view its material. Even those with relatively modern computing hardware may notice a substantial difference in the viewing experience between the PDF files generated by both services. The use of full–color JPEG2000 imagery in OCA PDF files allows them to achieve relatively small file sizes while still offering color reproduction, but at the expense of significantly increased computational requirements to render each page. On a late–generation WindowsXPTM desktop with a 3.2 Ghz dual core processor, 2GB of RAM, and a 256MB nVidiaTM graphics card, an average OCA PDF cannot be interactively explored. Scrolling forward or backward introduces a significant lag as the computer rerenders the complex page. Google Book’s bitonal page images, on the other hand, render nearly instantly, permitting realtime interactive exploration of works.

Of course, there have been numerous studies focusing on scattered inconsistencies and image quality concerns within the Google Books project. Such problems are tenants of any large access digitization project, especially one which operates at such a significant operational scale. One must also consider the density of such errors, rather than the anecdotal discovery of a particularly problematic work. With over three million digitized works to just 100,000 completed by the Open Content Alliance, even a superbly small error rate will be magnified into a very large number of works. As a tenant of the genre, access digitization projects accept a certain degree of loss and instead focus on the average case, trying to ensure that the majority of works are error–free, while even those with errors are largely useful. It should be noted that OCA has placed a very high emphasis on quality control in order to reduce such errors even further. This mandate comes at a steep cost, however, requiring substantial human intervention and the associated slowdown in time spent reviewing each captured page.

Perhaps the difference of greatest importance between the two projects, however, relates to search. Both projects offer the ability to search within a particular work, but only Google offers the ability to search across its entire collection. A search across the OCA archive only searches titles and description fields, not the full text of works. The OCA system thus offers a document–centric model, while Google offers both document and collection–based models, allowing broad exploratory searches of its entire holdings: the equivalent of being able to “full text search” a library. The importance of this difference cannot be understated in the limitations it places on the ability of patrons to interact with the OCA collections.

The Google Books Project

The mass digitization of books (and use of advanced technology to deliver it usefully online) has strong ties to Google’s past. Its co–founders Sergey Brin and Larry Page were Stanford University graduate students in 1996, developing enabling technologies for digital libraries. Their work focused on the use of citation analysis to rank the relevance of digitized books to specific queries. The resulting technology was applied to Web site archives as an experiment, and the rest, as they say, is history. However, the interest in such large“scale digitization efforts remained a significant interest of the two, and in 2002 Google began the early planning stages of its own mass book digitization program (“Google Book Search,” n.d.). Over the last three years the project has grown by leaps and bounds, but relatively little is known about the technology behind the Google Books of today due to the shroud of secrecy kept around the project. However, while an official report may not be available, an analysis of media reports, its contracts with library partners, and even the digital files it makes available for download, all provide insights into its operations.

To begin with, the agreement between the University of Michigan and Google Books (its first institutional partnership) offers several important details regarding the output format of the project. In its U–M Library/Google Digitization Partnership FAQ, (UM Library/Google) dated August 2005, the University notes that pages without illustrations are provided to Michigan as G4–compressed bitonal TIFF images at 600 dpi, while pages with significant illustrations are provided as 300 dpi JPEG2000 images. The Michigan Library cites the December 2002 Digital Library Federation’s (2002) Benchmark for Faithful Digital Reproductions of Monographs and Serials report, with its recommendations of 600 dpi bitonal scans for text–only and line drawing material and 300 dpi scans for illustrations like photographs, as evidence that these scans are faithful digital reproductions. Indeed, as final products, these images meet most current standards for archival/preservation digital image files, but it is important to examine the workflow used to produce the images to determine whether the entire process meets preservation standards.

The process starts with collection identification, in which the library and Google determine an appropriate subset of available holdings to digitize, taking into consideration condition of material, value, and other factors. Once a collection has been identified for digitization, the process of transporting materials to and from the Google digitization facility begins. Library staff prepare a batch of books and scan their barcodes, checking them out to Google for the duration of the scanning process. The books are shipped across campus to the scanning facility, where they are handed over to Google personnel (Said, 2004). At this point, they enter a large room or series of rooms filled with numerous stations each outfitted with specialized Google–developed scanning units. A human operator staffs each scanning unit (Helm, 2006), manually turning the pages of each book as twin overhead digital cameras photograph each page. The immediate output is a collection of full color variable–resolution camera images downloaded from the camera directly to an accompanying PC via a firewire or USB connection and uploaded to a server farm at Google’s headquarters for further processing.

Since the majority of out–of–copyright books do not have color photographs or other substantial color information, Google decided early on that it would be acceptable to trade color information for spatial resolution. While it is true that book covers, especially leather ones, may contain a certain amount of color information, this information is rarely important (Langley and Bloomberg, 2007). Much as historic albumen or ferrotype photographic prints are traditionally scanned in grayscale for preservation digitization with their irrelevant color information discarded, Google believed that the same could be done for printed materials. Using interpolation techniques, it is possible to reduce the bit depth of an image and convert that information into enhanced spatial resolution. Indeed, the Digital Library Federation’s report notes that 600 dpi bitonal images may be interpolated from 400 optical dpi 8–bit images (Digital Library Federation, 2002). If Google uses digital camera technology similar to the Open Content Alliance, it can easily achieve 400 dpi scanning for most materials at 24–bit color depth and convert those to 600 dpi bitonal images for textual material while meeting archival standards.

Each full–color page image is subjected to specialized algorithms that examine the page and divide it into text and illustration/photograph regions. Text regions are dispatched for OCR recognition to allow fulltext searching. As region identification is completed for each page, a master version is saved for storage at Google and transmission back to the contributing library. Images which have been identified as containing only text are saved as 600 dpi bitonal TIFF images with G4 compression, while those containing one or more image zones are saved as 300 dpi JPEG2000 images (“UM Library/Google,” 2005). Once master versions have been created, the pages are forwarded to a pipeline that separately compresses each region and packages them all together into a single PDF file for download. From the beginning, Google identified high image quality with lowest possible file size as a priority for its digitization project, ensuring that even those in underserved communities with lower bandwidth connections to the Internet could make use of their materials. Towards this end, Google realized it was necessary to use different compression algorithms for text and image regions and package them in some sort of container file format that would allow them to be combined and layered appropriately. It quickly settled on the PDF format for its flexibility, near ubiquitous support, and its adherence to accepted compression standards (JBIG2, JPEG2000) (Langley and Bloomberg, 2007).

In keeping with its tendency to innovate rather than make do with the status quo, Google devoted substantial resources to improving the state–of–the–art in compression technologies for digitized textual page scans. It conducted numerous experiments with three of the major compression algorithms used for bitonal text scans: G4, FLATE, and JBIG2. For one selected book, FLATE yielded files with an average storage requirement of 120KB/page, while G4 reduced consumption to 52KB/page and JBIG2 achieved 26KB/page. To apply JBIG2 to its incredible volume of material, Google needed to substantially increase the compression ratio and execution speed of JBIG2. In the end, Google wrote its own JBIG2 compression algorithm, which Google engineers published in the literature, along with a detailed analysis of their findings (Langley and Bloomberg, 2007).

In an effort to determine how these results compare with the PDF files Google makes available for download, a work was selected which the author was familiar with, The University Studies – Illinois Libraries, by Katharine Sharp, published by the University of Illinois in 1908 (Sharp, 1908). Four types of pages were selected at random and compared: text–only (page 243), text with two medium–sized photographs (page 668), photograph with line drawing and caption (page 673), and page–length photograph with caption (page 816). The PDF container for each page was examined, with the image stream extracted and decoded for examination. The resulting analysis provides several insights into the Google Books post processing workflow. The base image layer of all pages is a bitonal 600 dpi image of the page, including text and line drawings, but with whitespace in the place of any photographs. Photographs are embedded in a second layer as 8–bit grayscale JPEG2000 images at 150dpi. The separation of the base text layer and the photographic layer allows Google to maximize the resolution of the text to facilitate reading, while minimizing file size through the higher compression and decimation of photographs.

As Google completes each batch of books, they are returned to library staff, who check them back in to the library catalog system and place them back on the shelf. Library personnel additionally inspect a random sample of the digital files returned by Google to ensure they meet necessary quality controls (Karle–Zenith, 2007). Google provides an access mechanism for the University to download all digital files created from its material and grants the institution permission to redistribute the digital files without substantial restriction on limited personal use through its own online services. While commercial and mass redistribution are prohibited, these restrictions do not fall substantially outside of the normatives of the standard terms of use enforced by state–sponsored educational institutions across all of their Web properties. The University of Michigan has a nearly 20–year history of digital initiatives and already had an extensive interest and established network of digital services systems in place to utilize the digitized content returned to it by Google. Since its partnership with Google, the Library has been rapidly developing new services to take advantage of the previously unfathomable volume of incoming digital material (Karle–Zenith, 2007).

Google, of course, makes all its digitized material available on its own Web site as well at While many digital library systems either do not permit online viewing of digitized works, or force the user to view the book a single page at a time (called flipbook viewing), Google has developed an innovative online viewing application. Designed to work entirely within the Web browser, the Google viewing interface mimics the experience of viewing an Adobe Acrobat PDF file. To minimize bandwidth requirements, images displayed online are limited to 575 pixels in width and, as an online service, users are required to maintain an active Internet connection during the entire viewing session. To enable printing and offline viewing, Google permits its out–of–copyright materials to be downloaded as PDF files (Langley and Bloomberg, 2007) (and, more recently, as plain ASCII text files so that speech readers can be used by the visually impaired).

While most services take advantage of the linearized PDF format, Google made a conscious decision to avoid it. Linearized PDFs use a special data layout to allow the first page of the file to be loaded immediately for viewing while the remainder of the file downloads in the background. While this may seem like an ideal mechanism for online file delivery, Google found several shortcomings with the format that place an untenable burden on the backend server. Furthermore, they note that users wishing to view works online will simply use the Google online viewing application, suggesting that the majority of PDF downloads are from users wanting to view the entire work offline or print it. For these users, linearized PDFs provide no benefit (Langley and Bloomberg, 2007).

To ensure users are aware of the legal restrictions on their use of Google materials, the Google Books service dynamically inserts one to two legal notices into the beginning of each downloaded PDF file. The first legal notice is inserted in the language of the current user, based on the interface language he or she is using Google Books through. If the user has set Google Books to display its interface in French, for example, a French version of the legal notice will be inserted at the start of the file. This initial notice is followed by one in the source language of the work, under the assumption that a file may be forwarded on and the eventual user may not share the same language as the original downloading user. By including a version in the language of the work itself, any user who can read the work can also read the notice. A French–speaking user downloading a German work will receive a PDF file with both French and German editions of the Google legal notice. If the downloading user’s language is the same as the work’s language, only a single copy of the notice is embedded to conserve space (Langley and Bloomberg, 2007).

With its army of human page–turners, ability to leverage Google’s existing server farms, and the reliance on an almost entirely automated backend processing environment, the Google Books project is able to scan a book at a cost of just $US10 each. When the project first began in 2004, it was estimated that each operator could scan 50 books a day (Said, 2004). Before its partnership with Google, the University of Michigan Library estimated it could scan 5,000 to 8,000 books a year, meaning it would take more than 1,000 years to scan its entire collection (“John Wilkin,” 2006). By the start of 2007, its partnership with Google was scanning several times that number each week, with a total time-in–flight of each book (the total duration from being picked off the shelf until being returned to the shelf) of five–eight days on average (Karle–Zenith, 2007).

Open Content Alliance

The Open Content Alliance (OCA) was launched in October 2005 as an open academic consortium to counter the commercially oriented nature of the Google Books project. It is based around a partnership of libraries and corporate sponsors under the administration of the Internet Archive. Libraries make their collections available to the Alliance for scanning, and corporate sponsors or the Internet Archive provide the funding to digitize them. The overall workflow of the OCA initiative follows that of the Google Books project very closely. Each library works with the OCA to develop a digitization plan, selecting a collection of works to be digitized under the project. The OCA provides the scanning equipment and digitization staff, and operates under an outsourcing agreement. Library personnel deliver materials to the onsite OCA facility, where it is scanned by OCA staff, and delivered back to the library. At no time are non–OCA personnel permitted to operate the equipment, but, unlike Google, OCA does permit physical inspection of its facilities.

Despite initial exploration of robotic book scanning technology, the OCA settled on a manual page turning model, developing a custom scanning system called the Scribe. The unit shares a similar design with the Kirtas Technology APT BookScan line and relies on the same Canon EOS–1Ds Mark II consumer digital cameras for its imaging system. A Scribe operator turns each page of a book and releases a foot pedal that lowers a V–shaped glass plate onto the book to flatten the pages for photographing. Each captured page is checked for image quality and manually adjusted as necessary, with the average Scribe operator completing 350 pages per hour (one page every 10 seconds) [1].

The full–color digital images are downloaded from the imaging cameras by USB cable to an accompanying computer and transmitted back to OCA’s primary server farm for further processing. While OCA originally experimented with outputting RAW images from the cameras to maximize detail capture, the file sizes were simply too large for its systems to handle, and it switched to outputting JPEG compressed imagery. Like Google Book’s workflow, this highly compressed lossy JPEG imagery is then converted to lossless archival JPEG2000 imagery for further use. However, converting a lossy format to a lossless one does not restore the data lost when the file was first saved or reduce the compression artifacts. In fact, when examining the raw JPEG2000 image scans for many OCA works (the FTP download link), JPEG DCT block artifacting is clearly visible, an unfortunate consequence of in–camera JPEG production.

Despite billing itself as the transparent alternative to Google Books, the Open Content Alliance has unfortunately released even less technical information about its operations than its corporate twin. While it offers numerous photo opportunities with its Scribe scanners, the backend technical infrastructure underlying the initiative remains largely unavailable for inspection. Some front–end components, such as the Scribe scanner operating software, have been released to the open source community [2]. However, at the time of this writing, the latest available version — September 2006 — included mainly binary files and negligible documentation. Short text files with basic installation and execution commands were provided, but it is left to the user to decipher the underlying processing pipeline. The downloadable software is designed specifically for the hardware of a Scribe scanning system, but no information is included in the package documenting the specifications of such a system. Users are left on their own to find information on building their own Scribe–like system. Further confusing users, the package includes utilities for working with Canon RAW image files, which the OCA no longer uses in its production workflows.

The backend infrastructure is almost entirely obscured from view. Through an analysis of the output PDF files, it is known that OCA uses the commercial LuraTech PDF Compressor product [3], with Abbyy FineReader providing OCR support, but the overall data management workflow remains largely unpublished. The Scribes themselves also remain somewhat elusive, with onsite personnel usually unable to answer most technical questions about the equipment. OCA has described its lack of published technical documentation as simply a matter of limited resources, a result of a small technical staff who cannot afford the time to generate detailed technical documentation for publication [4]. Nevertheless, extremely detailed documentation, including workflows, exacting hardware and software specifications, and other key documentation is circulated internally among its engineering staff [5]. It remains unclear why OCA has not published substantive information regarding its operations, even while Google engineers offer considerable detail in the field literature. However, the reasoning behind why both initiatives chose not to publish exhaustive technical information is not truly relevant. Whether it is selective dissemination or a simple lack of time, the end result is the same: precious little technical information is available to those interested in pursuing their own large–scale book digitization campaigns.

Transparency and openness

A common comparison of the Google Books and Open Content Alliance projects revolves around the shroud of secrecy that underlies the Google Books operation. However, one may argue that such secrecy does not necessarily diminish the usefulness of access digitization projects, since the underlying technology and processes do not matter, only the final result. This is in contrast to preservation scanning, in which it may be argued that transparency is an essential attribute, since it is important to understand the technologies being used so as to understand the faithfulness of the resulting product. When it comes down to it, does it necessarily matter what particular piece of software or algorithm was used to perform bitonal thresholding on a page scan? When the intent of a project is simply to generate useable digital surrogates of printed works, the project may be considered a success if the files it offers provide digital access to those materials.

An interesting contrast between the two initiatives may be found simply by examining the contracts they sign with partnering institutions. Google was initially extremely secretive regarding its contracts with partnering libraries, but several argued that as state–funded universities their legal agreements were public record and eventually published their contracts online. The company has subsequently made an about–face in this regard and now maintains a page on its official Google Books site with a link to the materials published by each of its partnering libraries, including the contents of the formerly secret contracts. While OCA maintains that its contracts are public and open, the author was unable to find any partnering library that published its contract online and at least one institution contacted declined to provide a copy of its contract. Others have similarly had little luck accessing the contracts, though at least one library has now come forward with its contract (Hirtle, 2007). While much is now known about the specific terms Google Books negotiated with its partners, precious little is known about OCA’s partnerships.

Public domain vs. rights restricted

The two initiatives have very different approaches to the often thorny issue of digitizing materials still protected by copyright or whose protection status is unknown. Google has taken the stance that in–copyright materials may be digitized under the allowances of fair use as long as they are available to users only as short snippets of text. Users may freely search the material and view results as two–three sentence blurbs showing where matches occurred, making it possible to find works that otherwise may never have been seen. OCA, on the other hand, has not entirely discarded the notion of digitizing in–copyright material, but has taken the approach of proactively working with publishers to secure permission to reproduce those works on its site before they are published online.

Beyond copyright, however, is the question of restrictions placed on what may be done with the digital files themselves. Even though copyright protection on the original printed work may have lapsed long ago, the digital files themselves are subject to a variety of legal protections. The Open Content Alliance makes considerable mention of its materials being available without restriction to public access and enjoyment (“Open Content Alliance,” n.d.). Google, meanwhile, offers a very similar motto for its project, touting at its latest partnership that its service allows readers around the world [to] view, browse, read, and even download public domain material (DeBonis, 2007). While both projects share a common allowance of personal viewing, downloading, and printing, use beyond that of a single individual varies considerably among the two projects, and even among the works within them.

The Google Books project is overseen by a single company with a single set of rules governing the use of its files. Any out–of–copyright work may be readily downloaded for personal use and enjoyment, and restrictions are placed only on the redistribution of Google–produced works. Permission is granted to partnering libraries for selected distribution of digitized versions of their contributed works, but other uses must be approved by Google. According to the legal notice at the beginning of all downloaded PDF files, Google encourages large–scale use of its materials for research and asks interested parties to contact it for further information. As a consortium of independent partners, however, OCA allows each of its member organizations set their own distinct rights policies, rather than enforcing a single global set of rights. This forces users to inspect the rights statement of every single work they find to determine their rights to the content (such as whether it can be redistributed). Further complicating matters, the OCA entry for most works does not provide information on the rights restrictions for a given work, only the copyright status of the original printed material. To determine the rights status of any particular work, a user must follow the link in both the Digitizing Sponsor and Book Contributor metadata fields to view the rights restrictions enforced by the two organizations. A book contributor, such as an academic library, may place restrictions on its works regardless of the organization scanning it, while the scanning body (the sponsor) may place restrictions on works it scans, regardless of source. Hence, the burden falls to the user to read and legally interpret the rights statements of both entities before being able to determine the rights status of any particular work and whether a particular usage would be permitted. For example, the University of California Library system offers any out–of–copyright works it scans to the public domain, while works from its collections scanned by other agencies fall subject to those organizations’ rights restrictions (“Internet Archive,” n.d.). Some works, whose publication dates would automatically place them under copyright, include no copyright or rights status information whatsoever, such as the 1995 Microsoft RPC programming guide, contributed by O’Reilly & Associates, Inc. [6]

For works scanned by vendors which have substantially restrictive rights policies that would impact many common uses of a work (as opposed to more subtle restrictions), an additional Usage Rights link is usually provided. As an example, while Microsoft was an OCA partner, it changed its rights policies on materials it scanned to be considerably more restrictive on what may be done with them. Effective 1 November 2006, MSN–scanned materials may only be used for non–commercial purposes and may not be redistributed in whole or part by any commercially oriented service. Users are further encouraged to contact OCA to verify whether any particular intended use of the material would be permitted. The complexities of this relationship between book contributor and digitizing sponsor may be seen in the case of two out–of–copyright books contributed to OCA for scanning by the University of California Library system. The digital files for The adventures of Tom Sawyer, an 1876 book scanned by Yahoo! [7], are available in the public domain for any and all possible uses [8], while those for Within the golden gate: a souvenir of San Francisco Bay, an 1893 work scanned by Microsoft [9], may not be used for any commercial purposes. Even a service offering at–cost paperback reprints of OCA files to underserved community libraries would be prohibited from including the latter work without special permission from Microsoft [10]. Hence, two works from the same library digitized by OCA may have drastically different rights restrictions. With vendors changing their rights policies on new works over time, the OCA model unfortunately presents an extremely complex and ever changing legal landscape of ownership and rights over these digital files.

OCA further complicates rights determination by improperly marking some public domain works as being subject to copyright protection. For example, let’s examine the 1929 work entitled Honduran mosses collected by Paul C. Standley [11] by Edwin B. Bartram (1878–1964), published by the Field Museum of Chicago in its series Fieldiana: Botany as volume 4, number 9. For nearly 80 years after its initial release, this publication has remained in print; it is currently available from the Field Museum’s distributor of Fieldiana, Fortsas Books of Chicago (see for $US8.00. The Possible Copyright Status field claims the work to be IN_COPYRIGHT, while the scanning operator clearly indicates there is no visible notice of copyright. With a publication date of 1929 and publication in Chicago, this work fell subject to the U.S. Copyright Act of 1909, which unequivocally placed works published without notice of copyright into the public domain immediately upon publication. There is no ambiguity surrounding such works: they are clearly and without contest in the public domain. Ironically, while failing to clearly note rights restrictions on works with significant use limitations, the Open Content Alliance improperly classifies works genuinely in the public domain as restricted.

A digital work does not need to include a clear statement of rights to enforce use restrictions, but many online services choose to provide clear rights information to proactively counsel users in acceptable uses. While it fails to include other types of metadata in its digital files, Google does take great pains to include a language–targeted legal notice at the beginning of all its downloadable PDFs, delineating acceptable uses of the work. Even though all out–of–copyright materials on the Google Books site fall subject to the same common rights restrictions, Google recognizes that files are often renamed and forwarded on, posted on bulletin boards, or otherwise disseminated beyond their original intended audience. To ensure that all eventual recipients of its works are aware of restrictions governing its content, Google provides one or more legal notices at the beginning of each work, one in the language of the user who originally downloaded the work, and one in the language of the work itself (though if both are the same, only one copy of the notice is included). The Open Content Alliance, on the other hand, embeds no legal notification of any kind within its files, even those governed by substantially restricted rights.

As an example of this difference, both the Google Books and OCA digital editions of the out–of–copyright Within the golden gate: a souvenir of San Francisco Bay carry restrictions prohibiting its use in any commercial capacity[12] [13]. The Google edition clearly states that commercial use is prohibited on the first page of the PDF file. The OCA edition, on the other hand, carries absolutely no notification of rights restriction of any kind. Given that no other metadata is included with either file, if the OCA edition is renamed and forwarded on, the eventual recipient will be entirely unaware of its restricted state. If the user is aware that the file was originally downloaded from the OCA site, he or she may even believe that the work is public domain, given the preference of OCA towards unrestricted works. Combined with OCA’s general pattern of making the rights status of its materials difficult and time–consuming to determine, significant questions are raised regarding long–term access and management of its digitized content.

The OCA’s stance on rights–restricted content

It is worth exploring the Open Content Alliance’s stance on rights restricted content a bit further, given the common comparison of its open content model to the more commercially oriented one used by Google. When the Open Content Alliance was first announced, its founder and chief evangelist Brewster Kahle gave a number of speeches on the ways in which the Alliance would revolutionize access to knowledge. Touting his project’s vision of making digitized books open and free, Kahle affirmed the Alliance’s guiding principle of offering its digitized works to the public domain (Kahle, 2005). However, he tempered this enthusiasm by noting the reality of digital ownership in that content donors would be allowed to place restrictions on the large–scale redistribution of their content. In his words, Kahle saw it as simply being fair to allow the organizations who funded the digitization to be able to control who could distribute those materials (Kahle, 2005). He saw no problem even with preventing other libraries from redistributing OCA material for academic purposes. If one library paid to digitize its materials under the OCA project, it should not be forced to allow another library to offer the same content on its Web site (Kahle, 2005). This mandate follows the general trend of academic libraries to prohibit the redistribution of materials they digitize, even by research groups within their own institutions. Microsoft took advantage of this rights environment when it changed its policy in November 2006 to prevent the commercial use of materials it digitizes for OCA.

While holding to a vision of an idealized world in which all knowledge would be held in the public domain, the Open Content Alliance was cognizant of the reality of the publishing world and provided for the inclusion of copyrighted works from the beginning. Its Collection Policy notes that it will initially concentrate on materials in the public domain or available under a Creative Commons license, (“Next Steps,” n.d.) but its Call to Participate allows contributors to set the terms and conditions under which their content may be used (“Participate,” n.d.). For example, the publisher O’Reilly & Associates, Inc. was an early partner of the Open Content Alliance and has contributed a number of works to the project. These works are licensed under the Creative Commons Founders’ Copyright, (“Founders’ Copyright,” n.d.) which grants traditional copyright protection to a work, but shortens the time span before a work reverts to public domain to just 14 years, with an option for a single additional 14–year extension. Some O’Reilly–submitted works, however, carry no rights information at all, leaving their status unknown [14]. Submission of copyright–protected works is not restricted solely to OCA’s corporate partners, however. Academic libraries have also contributed in–copyright material to the Open Content Alliance, such as a series of conference proceedings digitized by the University of Illinois [15]. The printed proceedings themselves were published under the Copyright Act of 1976 or carried notices of copyright, and so are subject to copyright. The University supported their digitization under the Open Content Alliance initiative and granted permission for their distribution through the OCA Web site for personal use only.


Metadata is obviously a crucially important attribute of any large–scale digitization effort, with especial care being necessary to track the many attributes of each work through its lifecycle in the digital library. Both the Google Books and Open Content Alliance initiatives provide basic metadata regarding each work online, including title, author, publication date, publisher, and length. Google uses automated algorithms to further identify the title, table of contents, and copyright pages for all works, allowing one–click access to additional information about the work. The Open Content Alliance provides a considerable amount of additional metadata for their works, but only regarding internal operational information, such as the date the material was scanned, the name of the scanner operator, the unique identifier of the scanning computer system, etc.

Of particular interest is the decision by both initiatives to forgo the inclusion of metadata in PDFs made available for download. It is extremely common with any digitization project that allows downloading of its materials for those works to be forwarded on countless times, with filenames changing completely from their original names. By the time a file has been forwarded on to the third or forth person with not even a filename to go by, it may be difficult or even impossible to locate the original source of the work. Even a scholar downloading a copy to his or her personal computer may come back to a file a year later and be unable to recall its original source. Large corporate content repositories often embed metadata in their PDF files that include all of the basic attribution fields (author, date published, etc), along with a locator URL that points the user back to the original source. It is especially surprising that the Open Content Alliance, an academic–led project driven largely by the library community, has failed to include any form of metadata in its downloadable files. As noted earlier, Google Books actually includes limited metadata in its files regarding rights restrictions, while OCA files licensed under terms prohibiting commercial use fail to include any such notifications in the files themselves.



Conclusion: Comparing the two

While on their surface, the Google Books and Open Content Alliance projects may appear very different, they in fact share many similarities:

  • Both operate as a black box outsourcing agent. The participating library transports books to the facility to be scanned and fetches them when they are done. The library provides or assists with housing for the facility, but its personnel are not permitted to operate the scanning units, which must be staffed by personnel from either Google or OCA.

  • Neither publishes official technical reports. Google engineers have published in the literature on specific components of their project, which offer crucial insights into the processes they use, while talks from senior leadership have yielded additional information. OCA has largely been absent from the literature and few speeches have unveiled substantial technical details. Both projects have chosen not to issue exhaustive technical reports outlining their infrastructure: Google due to trade secret concerns and OCA due to a lack of available time.

  • Both digitize in–copyright works. Google Books scans both out–of–copyright books and those for which copyright protection is still in force. OCA scans out–of–copyright books and only scans in–copyright books when permission has been secured to do so. Both initiatives maintain partnerships with publishers to acquire substantial in–copyright digital content.

  • Both use manual page turning and digital camera capture. Large teams of humans are used to manually turn pages in front of a pair of digital cameras that snap color photographs of the pages.

  • Both permit libraries to redistribute materials digitized from their collections. While redistribution rights vary for other entities, both the Google Books and OCA initiatives permit the library providing a work for digitization to host its own copy of that digitized work for selected personal use distribution.

  • Both permit unlimited personal use of out–of–copyright works. While redistribution rights vary for other entities, both the Google Books and OCA initiatives permit the library providing a work for digitization to host its own copy of that digitized work for selected personal use distribution.

  • Both enforce some restrictions on redistribution or commercial use. Google Books enforces a blanket prohibition on the commercial use of its materials, while at least one of OCA’s scanning partners does the same. Google requires users to contact it about redistribution or bulk downloading requests, while OCA permits any of its member institutions to restrict the redistribution of their material.

When compared side–by–side, Google Books and OCA are actually not that dissimilar: both focus on volume scanning of books using a black box outsourcing model and make them freely available for personal consumption. When it comes to redistribution of their digital files, Google and at least one OCA partner prohibit commercial redistribution, but make allowances for select academic redistribution. While many libraries point to the OCA’s more liberal policies towards academic distribution (Google only allows contributing libraries to redistribute materials it digitizes from their libraries with certain restrictions), one question that has thus far not been asked is why other libraries would want to distribute these materials. Certainly libraries may point to internal research projects that can benefit from large holdings of digitized books (which Google actually makes allowances for), or the myriad digitization portals being constructed by libraries around the country, but ultimately the issue that is not being considered is user experience. Google has continually set the standard for making digital materials accessible with technologies and interface design that appeal to the masses. Even a user with a low–bandwidth Internet connection can interactively explore Google’s carefully optimized digital content. The academic community has thus far failed to match Google’s success at creating powerful, yet extremely intuitive, interfaces for their digital content and at mastering the ability to harness the latest in technological advances to optimize the transfer of knowledge between producer and consumer.

A common argument in favor of library redistribution is that it allows those institutions to store their own copies of the digital files to ensure long–term preservation, whereas files hosted by Google could someday become unavailable. This is certainly a legitimate argument, especially in light of Microsoft’s recent withdrawal from book digitization, but fails to take into consideration that, as discussed earlier, both projects are access digitization projects and do not generate preservation–grade materials. In terms of providing wide access through redistribution, an access digitization project is measured by the success of bringing its digital materials to its user communities, not merely on its ability to collect large amounts of material together. Similarly, the long–term availability of any digital collection is heavily dependent on the underlying funding that supports it. In the case of Google Books, advertisements are displayed alongside digitized content. While this does “commercialize” the user experience, it provides a critical revenue stream that supports the operations of the site, making it self–sufficient and providing the site with long–term financial viability. OCA, on the other hand, is exclusively dependent on the contributions of foundation and corporate support, and requires libraries to heavily subsidize the digitization of their material. For example, both Yale University and the Boston Library Consortium will have spent many millions of dollars to support the digitization of their works (Kaplan, 2007), which would have been free under the Google system. In contrast, OCA is entirely dependent on its donor agencies. The Alfred P. Sloan Foundation provides a substantial amount of OCA’s operational funding (“Internet Archive Receives Grant,” n.d.) and if it ever decided to change its priorities, as Microsoft did earlier this year, OCA does not have the same diversity of funding sources to fall back upon that Google does.

One final point worth exploring in the library versus commercial distribution model is the ethical issue of sending users to a commercial site that logs the items they look for and the terms they search for. This is especially troubling given libraries’ historical commitment to privacy in their physical establishments. Google has recently announced significant enhancements in its privacy policy to address many of these kinds of concerns, significantly reducing the amount of time it retains user records. On the other hand, such arguments from the academic sphere fail to take into consideration the fact that all web servers log the actions of visitors, including those at their own institutions. Many universities furthermore have very different privacy policies in effect for their Web resources than librarians realize. In fact, numerous university Web sites actually embed within their pages Google Analytics tracking technology to help them better understand their visitor traffic (including the main homepage of at least one Big Ten university). In the end, for patrons of many university Web sites, Google will still track their every action, whether they were referred to Google directly, or whether the university’s Web staff use Google’s server log analysis technology behind the scenes. End of article


About the author

Kalev Leetaru is Coordinator of Information Technology and Research at the University of Illinois Cline Center for Democracy, where he established and oversees its mass digitization center, Chief Technology Advisor to the Illinois Center for Computing in the Humanities, Arts, and Social Science, and Center Affiliate of the National Center for Supercomputing Applications. Among his research areas is the intersection of digital technologies and information management and he has recently completed a book manuscript on commodity access digitization.
E–mail: leetaru [at] uiuc [dot] edu



1. Personal conversation with Jae Mauthe of the Internet Archive.

2. See

3. See

4. Personal conversation with Jae Mauthe of the Internet Archive.

5. Personal conversation and forwarded information from Robert Miller, Director of Books for the Internet Archive.

6. See

7. See

8. See

9. See

10. See

11. See

12. See

13. See

14. See

15. See



“Date:1500–2008 — Google Book Search,” at–2008, accessed 9 October 2008.

Laura DeBonis, 2007. “Keio University Joins Google’s Library Project” (10 July), at–university–joins–googles–library.html, accessed 9 October 2008.

Digital Library Federation, 2002. “Benchmark for faithful digital reproductions of monographs and serials” (December), at, accessed 9 October 2008.

“Founders’ Copyright — Creative Commons,” at, accessed 9 October 2008.

“Google Book Search — News & Views — History,” at, accessed 9 October 2008.

Burt Helm, 2006. “Life on the Web’s factory floor: Who do you think turns all those words into an easy click?” Business Week (22 May), pp. 70–71, at, accessed 9 October 2008.

Peter Hirtle, 2007. “How open is the Open Content Alliance?” LibraryLaw Blog (30 October), at, accessed 9 October 2008.

“Internet Archive Receives Grant from Alfred P. Sloan Foundation to Digitize and Provide Open Online Access to Historical Collections from Five Major Libraries,” 2006. (20 December), at, accessed 9 October 2008.

“Internet Archive: University of California Library,” at, accessed 9 October 2008.

“John Wilkin on the U–M — Google Digitization Project,” 2006. OptimizationWeek (25 April), at, accessed 9 October 2008.

Brewster Kahle, 2005. “Announcing the Open Content Alliance,” Yahoo! Search blog (2 October), at, accessed 9 October 2008.

Thomas Kaplan, 2007. “Microsoft contracted to digitize library in 2008,” Yale Daily News (9 November), at, accessed 9 October 2008.

Anne Karle–Zenith, 2007. “Google Book Search and The University of Michigan” (20 February), at, accessed 9 October 2008.

Adam Langley and Dan S. Bloomberg, 2007. “Google Books: Making the public domain universally accessible,” In: Xiaofan Lin and Berrin A. Yanikoglu (editors). Proceedings of Document Recognition and Retrieval XIV (30 January–1 February, San Jose, Calif.), Proceedings of Electronic Imaging Science and Technology, volume 6500, pp. 65000H1–65000H10; version at, accessed 9 October 2008.

“Open Content Alliance” (OCA), at, accessed 9 October 2008.

Open Content Alliance (OCA), “Next Steps,” at, accessed 9 October 2008.

Open Content Alliance (OCA), “Participate,” at, accessed 9 October 2008.

Carolyn Said, 2004. “Revolutionary chapter — Google’s ambitious book–scanning plan seen as key shift in paper–based culture,” San Francisco Chronicle (20 December), p. F1.

Katharine, L. Sharp, 1908. Illinois libraries. Part 5: Buildings, sources, publications. University studies of the University of Illinois, volume 2, number 8.

“UM Library/Google Digitization Partnership FAQ,” 2005. At, accessed 9 October 2008.


Editorial history

Paper received 31 January 2008; revised 3 October 2008; accepted 6 October 2008.

Copyright © 2008, First Monday.

Copyright © 2008, Kalev Leetaru.

Mass book digitization: The deeper story of Google Books and the Open Content Alliance
by Kalev Leetaru
First Monday, Volume 13 Number 10 - 6 October 2008

A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2015.