The rise of reading analytics and the emerging calculus of reader privacy in the digital world
First Monday

The rise of reading analytics and the emerging calculus of reader privacy in the digital world by Clifford Lynch



Abstract
This paper studies emerging technologies for tracking reading behaviors (“reading analytics”) and their implications for reader privacy, attempting to place them in a historical context. It discusses what data is being collected, to whom it is available, and how it might be used by various interested parties (including authors). I explore means of tracking what’s being read, who is doing the reading, and how readers discover what they read. The paper includes two case studies: mass-market e-books (both directly acquired by readers and mediated by libraries) and scholarly journals (usually mediated by academic libraries); in the latter case I also provide examples of the implications of various authentication, authorization and access management practices on reader privacy. While legal issues are touched upon, the focus is generally pragmatic, emphasizing technology and marketplace practices. The article illustrates the way reader privacy concerns are shifting from government to commercial surveillance, and the interactions between government and the private sector in this area. The paper emphasizes U.S.-based developments.

Contents

1. Introduction: Who’s reading what, and who knows what you’re reading?
2. Collecting data
3. Exploiting data
4. Some closing thoughts

 


 

1. Introduction: Who’s reading what, and who knows what you’re reading?

There is a long and ugly history of efforts by various authorities to monitor and control what people read. Historically, going back many centuries, this control usually operated through a few fundamental tactics: kicking in the door to find out what materials people had in their possession; tracking what materials individuals obtained (through sales records from bookstores when available, or monitoring library use); and, controlling what materials people could obtain through various censorship and regulatory schemes focused on content supply channels. (Intensive interrogation and torture are also perennially popular ways to encourage people to confess to reading forbidden materials, or to incriminate others, and to exploit and reinforce the connections authorities like to build between such reading and other crimes). The enthusiasm with which authorities can employ these tools varies over time and place based on legal and constitutional controls, political winds and the level of acceptance of the rule of law, and social norms. My focus here is on monitoring, not simply as a pathway to control a populace, but as an avenue both for commercial and governmental activities broadly.

Surveillance of reading is both an end in itself and a means to an end. Sometimes the very act of possessing or reading certain materials constituted a crime or a sin; in other cases, reading the wrong things was viewed as (often conclusively damning) evidence of a wide variety of other crimes: terrorism, apostasy, heresy, assorted forms of perversion and degeneracy, treason, etc. Across the centuries, people have suffered and sometimes died for seeking knowledge or understanding through reading.

One striking development that I will highlight is the growth in commercial (as opposed to governmental) surveillance of reading behaviors, and the very flexible boundaries between the public and private sectors in the examination of this information. It’s also important to note that traditional, text-based long-form reading is a diminishing part of the overall consumption of various media. Private sector interests have become leaders in the surveillance of reading and the integration of this into a broader world of data, the tracking of trends and interests, of biases and affinities, and with this goals shift: marketing, profiling, and manipulating are the new order of the day.

I concentrate here on the United States, where the First Amendment to the Constitution certainly helps to ensure the freedom to read, despite constant attacks and efforts to criminalize publishing or reading certain materials, or at least using reading choices as a way of targeting intensive scrutiny by the intelligence, law enforcement, or judicial systems. There’s no clear and broadly acknowledged legal right to privacy about one’s reading, at least at the federal level [1]. And commercial, as opposed to governmental, data collection and trafficking is largely unregulated.

1.1. A roadmap for this paper

This is a long paper; it covers a great deal of diverse territory, and is intended as a broad survey of a little-investigated area that calls for a number of disparate threads to be woven together in order to understand the terrain. At the same time, it clearly exposes several important trends and developments.These include the rise of the author as a not only interested but also increasingly empowered party in the data collection and distribution process. We’ve moved from collecting data primarily on what is acquired to also capturing what is actually read, and the amount of information about what’s being read has become massively more detailed, intimate, and timely; the paper goes into considerable detail about the nature of this data, how it is collected, who controls it, and how it is shared. It also considers how this newly available data might actually be used by various parties and to various ends; this is an issue that is less often examined than the simple fact that data is being collected. It raises questions about how much loss of privacy readers will ultimately be willing to tolerate, particularly in light of these uses.

At another level, the paper can be read as a case study of the way that computers and telecommunications networks broadly, and most recently and powerfully the Internet, have led to the astounding explosion in data about individual behaviors, and the connection of this behavioral data back to individuals, the arc from collecting data about sales of objects to the collection of individual purchasing (and other interaction) patterns.

Two detailed case studies are revisited repeatedly throughout the paper: electronic books, primarily mass-market consumer e-books, and academic and scholarly journals. These two very different worlds provide some illuminating contrasts.

Here, broadly, is the roadmap for the rest of the paper. The rest of this introductory section makes some general observations about the shifts in the landscape over the last century, and the societal role of libraries within this landscape. In the remainder of this paper, I’ll explore some of these developments, and try to provide a framework, or calculus, that may be helpful for further analysis. I’ll look at how data is collected, in terms of what’s being read, who’s reading it, how they found what they are reading, and how this information is, or potentially can be, exploited by various parties.

Section 2 studies the collection of data: what is being read, who is reading it, and how readers are finding what they are reading. The first question leads, in Section 2.1, into a history of the collection and reporting of sales data at various levels of aggregation, and of usage data for journals and journal articles. We then examine the shift from sales data to actual more detailed information about actual reading. Section 2.2 examines how readers find what they are reading, and the ways in which these activities can be captured and repurposed; the systems supporting discovery have changed enormously over the past three decades, and the implications of these changes are not well understood. The emphasis here is on user-driven discovery, as opposed to broad based advertising. Section 2.3 concentrates on who is doing the reading, and the association of reading data with identifiable individuals. In order to really understand the situation in this area, we delve into how identity management is conducted in academic and public library settings, the former emphasizing journals and the latter e-books. Section 2.4 looks briefly at the way information about reactions to reading (as opposed to the simple act of reading) can be captured and shared; Section 2.5 looks at the collection of information on aggregations of reading material such as personal reading histories and the contents of personal digital libraries.

With a good understanding of what information is, or may be, available, Section 3 turns from collection to exploitation, examining the interested parties and how they might make use of data if they can get it. Section 3.1 is a brief, specific study of digital textbooks and some of the challenges and opportunities these present. While the first part of Section 3 focuses on the commercial sphere of publishers, authors, readers, and various intermediaries, Section 3.2 turns to government interests in the cornucopia of data.

Finally, Section 4 offers some closing thoughts, putting these developments in the context of much broader social shifts in understandings about privacy, and outlines directions for future research and inquiry. The footnotes are extensive, and not only provide references but explore many topics in additional detail, or point to related developments that may be of interest.

1.2. A shifting landscape: Authors, readers, publishers, libraries and retailers

Particularly in the latter part of the twentieth century and continuing into the present, one of the most profoundly important social functions of libraries was actually to offer readers such privacy as part of their broad commitment to intellectual freedom, most notably by ensuring that circulation records were not retained over the long term, and thus not available for the authorities, but also by declining to cooperate with authorities by spying and informing on their patrons absent legal compulsion, and by facilitating anonymous use of material within the physical boundaries of the library [2]. And, of course, by building broad and diverse collections that incorporated controversial topics and points of view. They served as intellectual sanctuaries. Many booksellers (and particularly those offering controversial materials) also strove to protect the privacy of their patrons, though their goals and, in some cases, their capabilities were more limited [3]. Post 9/11 legislation (such as the notorious USA PATRIOT Act) and the courts (when consulted at all) typically assigned little respect or value to reader privacy, and were more than willing to sacrifice it to the interests of the state. Those trying to protect reader privacy gradually realized that the best guarantee of such privacy was to collect as little data as possible, and to retain what had to be collected as briefly as possible. The hard won lesson: if it exists, it will ultimately be subpoenaed or seized, and used against readers in steadily less measured and discriminating ways over time. Situations in which this very private data can be demanded rapidly deteriorate from presumably extraordinary national security considerations involving existential threats or thousands of lives to routine and mundane local law enforcement and even civil litigation [4]. As we will see, today there is considerably more information about individuals’ reading choices and behaviors than has been available historically, and much of it is being collected for new reasons and by new parties, but we need to consider the extent to which it represents an attractive new target for various agencies of government.

For the past century or so, we have seen simultaneous growth in the extent and sophistication of data collection and aggregation by both publishers and the various players that serve as their primary channels to their customers, including bookstores, bookstore chains, fulfillment companies (“book jobbers,“ subscription agents, etc.), and others. These players have changed over the years, with the rise of large national bookstore chains and their subsequent decline in the face of competition from Amazon in the consumer marketplace; in the academic marketplace serials jobbers play a much smaller role than they have historically. Both publishers and market channel participants have steadily sought to understand more about the demographics, geographic distribution, affiliations, and even the identities of their customers when possible. This has been historically complicated by the disconnections between publishers and end consumers in markets ranging from consumer books to scholarly journals, the anonymity of cash transactions, the diversity of channels, and other factors.

Historically, the authors of published books generally received almost no information about who was reading (or at least purchasing) their work, and how those readers responded to their work: perhaps some not-very-timely details of sales data as part of reports or royalty statements from their publishers, book reviews, the odd appearance of their work on a best seller list [5], fan mail, and speaking invitations. There was even less information available for academic papers: authors might find some citation counts from Science Citation Index in the scholarly world starting in the 1960s, and there were reviews as well as speaking invitations at academic conferences; occasionally there would be discussion in the popular press. Historically, also, to the extent that reading behavior was shared, it was shared voluntarily and socially with friends, colleagues, and neighborhood booksellers and libraries, and not aggregated.

Today, however, the transformation of all kinds of publishing, including scholarly articles and monographs, mass-market fiction and nonfiction books, newspapers, etc., from traditional print markets to the digital world is well advanced (although a perhaps surprising portion of the material actually persists in print as well as digital forms, meaning, among other things, that the picture provided by the digital versions is incomplete and possibly distorted). Scholarly journal publishing is rapidly leaving physical print behind, and digital books have captured a steadily growing share of mass-market book publishing (though this proportion now seems to have leveled off, at least for the time being, outside of certain kinds of genre fiction [6]). One of the byproducts of this transformation is a major restructuring of ideas and assumptions about reader privacy in light of the availability of information about what is being read, who is reading it, and (a genuinely new development) exactly how it is being read, including the end to frustrating reliance upon purchase, borrowing, or downloading as surrogate indicators for actually reading the work in question. The ability to capture, control and share this information is being exploited as a competitive advantage by various players in the new digital ecosystem, and we will examine this phenomenon in more detail later. There are interesting research questions here as well: the assumption always seems to be that more detailed and comprehensive data is better, but one might wish for more than sparse anecdote on the ways and extents to which very detailed data on how a given book is (or is not) read, and by whom, actually benefits the various interested parties: authors, publishers, retailers, platform providers, and even readers.

The nature of the way that electronic content is acquired, distributed, and read also calls into profound question the ability of libraries to continue to serve effectively in their role as intellectual sanctuaries.

I want to note a shift in the public rhetoric that underscores the changes that are taking place. Historically, most of the language has been about competing values and how they should be prioritized and balanced, using charged and emotional phrases: “reader privacy,” “intellectual freedom,” “national security,” “surveillance,” “accountability,” “protecting potential victims” and more recently “personalization” perhaps. These conversations are being supplanted by a sterile and anodyne, value-free discussion of “analytics:” reader analytics, learning analytics, etc. These are presented as tools that smart and responsible modern organizations are expected to employ; indeed, not doing analytics is presented as suggesting some kind of management failure or incompetence in many quarters. The operation of analytics systems, of course, implies and assumes the collection and retention of data of various kinds, and tends to shift discussions from whether data should be collected to what we can do with it, and further suggests that if we can do something with it, we should. I believe that we should be very cautious in accommodating such new rhetoric.

There has been a welcome renewed interest very recently in roles libraries can and should play in supporting patron privacy in this age of digital resources and expectations about personalized information systems (see, for example, the spring 2015 issue of Information Standards Quarterly from the National Information Standards Organization and the June 2015 E-Content Supplement to American Libraries). While these issues are certainly considered here, my focus is mainly on the reader, as consumer as well as library patron, and also considers the interests of the full range of parties involved in the path from author to reader in both consumer and library-mediated scenarios.

 

++++++++++

2. Collecting data

2.1. What is being read, and how widely?

Data about what is being sold, at least at an aggregate level, has long been collected in forms such as best seller lists, publisher book sale data, and subscription information (which was made partially public by the forms that had to be filed for periodicals that enjoyed discounted postal rates). Bookstores tracked what they were selling, and, as large national bookstore chains emerged, and computerized inventory management and point of sale systems became more sophisticated, they aggregated this data as well as faceting it geographically. In library markets, book jobbers and subscription agents also functioned as data aggregators; it’s unclear how much of this information was shared back to publishers (and whether this occurred in frequently or in a timely manner). Little of this data was publicly available and some of it was likely viewed as proprietary; probably magazine, journal, and newspaper publishers had the most solid figures on actual sales. Book publishers had to deal with the (sometimes lengthy) delays introduced by returns from the retail channels that actually sold almost all of their books. A book author would get a royalty statement from his or her publisher, but the delays here would be substantial and the data sometimes questionable, and typically would not include details such as where the book was selling best [7].

Keep in mind that all of this data was about sales; it only suggests what was being read to the extent that a purchase can be used as a surrogate for actual reading. This was especially problematic for publications like magazines and journals that aggregate content frequently read at the article rather than the issue level. Over time, particularly in the scholarly world, additional measures, such as citations to articles, came to complement journal subscription counts as ways of estimating how much a given article was being read, and more broadly, a given articles level of influence. And, truth be told, there has always been a certain amount of uneasiness, cynicism, and speculation about how often, and how extensively, certain popular books and widely-praised classics are actually read as opposed to merely purchased for display on living room bookshelves [8]. In the digital world, it seems that making a personal statement through the acquisition and prominent display of books that may not be much read by their owner may actually become largely a thing of the past [9].

In the digital environment the situation has changed substantially. Sales data in consumer markets has become much simpler to understand in some ways since there are no retailer returns on electronic publications. There has been so much concentration of the consumer market into a very small number of retailing channels (e.g., Amazon, or, to a lesser extent, iTunes) that, for electronic publications, the retailers know as much or more than the publishers, particularly in terms of timely data [10]. Reporting of anything beyond the most basic fact of the purchase (such as purchaser demographics) is presumably the subject of intense and sometimes acrimonious negotiation between the retail channel and the publishers. Sometimes, the retail channel is now also the publisher as well (Amazon is moving aggressively in this area, for example).

A history of the recent developments in tracking mass market printed and e-book sales is very complex and obscure, and beyond the scope of this paper, but I’ll attempt a brief and imperfect summary here (I do not know of an authoritative and comprehensive source on this topic; an analysis of the e-book marketplace in particular has become a byzantine blood sport). In 2001 (an incredibly late date, when one considers that a parallel database for recorded music sales, SoundScan, goes back to 1991), Bowker launched a database called BookScan, which used national point of sale data to track mass-market book sales [11]. This database powers the Wall Street Journal bestseller lists, for example. There are some well-understood limitations about classes of sales not covered by the database (such as library sales). BookScan was sold to the Nielson Corporation (of TV Nielson rating fame, but by then already becoming a powerhouse in tracking consumer interactions with and across all media) in 2013. BooksScan covered print sales; in 2011 an additional service called PubTrack was launched which covered e-book sales; as best as I can tell this data is based on publisher (not platform) reporting about sales, so it includes, for example, major mass market publisher sales via platforms like Amazon and Apple, but not Amazon independent authors [12] (as a marketing matter it seems to have now been subsumed into BookScan). One key and contentious question is how much of the e-book marketplace is controlled by independent authors (as opposed to authors with traditional publishers) using Amazon, Apple, and a host of much smaller sales channels to see their works directly; this appears to be very large (at least for certain mass market genre areas), perhaps three-quarters of the marketplace, and apparently not represented fully in reporting to Bookscan’s e-book coverage components [13]. Based on late 2015 data, very roughly speaking, Amazon dominates the e-book marketplace, with perhaps 74 percent of unit sales, followed by Apple with 11 percent, the Barnes & Noble Nook at eight percent, Kobo at three percent, Google at two percent and everyone else at noise levels; the picture by revenue is a bit different, with Amazon at 71 percent, Apple at 12 percent, B&N Nook at nine percent, etc. [14]

Some sales data (again, think of Amazon sales rankings here) has become much more public and timely, though one might reasonably worry that it is being selected and shaped by the retail channels with their own goals in mind. Publishers have opened up their sales data to the authors and the public as well by contributing timely data.

Not all sectors have become more transparent. For example, if one looks at the academic journal marketplace, the role of subscription agents has substantially diminished, and the deals are struck between individual publishers and major research libraries or consortia of such libraries. We have seen the rise of the “big deal” (bulk licensing of huge numbers of titles from the largest publishers, such as Elsevier) and of confidential contracts (though the days of such contracts may be coming to an end) [15]. It’s no longer clear who’s subscribing to what, or what they are paying. Major publishers often hold this data as proprietary, and the gap between knowing the number of subscribers (in the sense of potential readers authorized by institutional site licenses) and understanding the actual use and impact of various journals remains problematic.

The connection between what’s being sold and what’s being read has become very interesting in the digital environment. One can think of the activity of reading as splitting into two spheres: monitored and unmonitored. At a high level, think of unmonitored environments as those that are private: you import a work into one of these environments (by downloading it, for example, or by printing it) and what subsequently takes place between you and the work is invisible to other parties. By contrast, in a monitored environment, your interactions with the work are subject to continuing detailed data collection and reporting. These are probably not the descriptive terms that publishers, retailing channels, and delivery platforms would choose, but I will use them here. In addition, the players in the delivery chains, who prefer the monitored environments, would stress the “opportunities” for social interaction such as recommendations and annotation (including shared annotation) that are part of that environment, as well as affordances such as synchronization across multiple devices (though it should be simple enough to synchronize without passing all this information to a central publisher site in unencrypted form).

In some situations readers really don’t have much choice. If you “purchase” (actually license) an e-book from Amazon, unless you are technically sophisticated and ready to do some work (perhaps of dubious legality under the Digital Millennium Copyright Act [DMCA]), you are only going to be allowed to read it in a monitored environment: on a Kindle e-book reading appliance, or on a Kindle software app on your laptop, tablet or smartphone. Amazon can know what pages you looked at, and when you made transitions from one page to another. There are what would seem to be very good heuristics building on this data that would indicate whether you finished the book, or at what point you gave up on it; perhaps a little less convincingly, how quickly you moved through different segments of the book. If you choose to make annotations or underline passages, this is not only recorded but potentially shared with other readers (leading to an interesting recent experiment of estimating when readers “give up” on books based on this publicly available data on the number of highlight annotations; the density of such annotation in the text is used as a weak surrogate for the reading trail data that Amazon holds internally but does not make public [16]). In fact, in 2015 Amazon announced a plan to pay some authors publishing directly through them by the number of pages read [17].

The body of academic research in interaction of readers with mass-market publishing in digital form is very limited. Broussard and Doty [18] offer a helpful recent high-level survey. Access to data, particularly in U.S. markets, has been very limited. There’s an interesting recent example of analysis based on data from Russia by Braslavski, et al. [19]

In other situations, the reader can genuinely choose to interact with material in a monitored or unmonitored environment. (As far as I know, there is little public data about which choices readers make, or why they make these choices, though there is plenty of anecdote: I frequently still hear the “I print out articles for intensive reading” assertion about scholarly articles, though, almost never “I want to hide my interactions with a work.” And there is plenty of proprietary data about the actual choices readers make, of course.) A scholarly publisher might present an on-screen abstract for an article, and then a series of choices that included downloading a PDF (and thus moving further interactions into an unmonitored environment, since there are general purpose PDF viewers that are generally believed not to spy on their readers; though in a post-Snowden world, the level of confidence here may not be what it once was [20]). For such downloaded PDFs, reading (or even printing) rates are unknown; there’s anecdotal evidence that the traditional academic office full of dusty preprints and article copies presumably absorbed by osmosis has metamorphosed into a laptop full of ever more downloaded article PDFs the owner really genuinely intends to read, but never quite gets to.

There is a rich and extensive literature about the extent to which scholars read articles from the literature, but almost all of this relies on data gathered by surveying readers. See the International Association of Scientific, Technical and Medical Publishers (STM) report [21] for a good survey of this work. Carol Tenopir and Donald King, with various colleagues, have conducted a very extensive longitudinal series of studies about such practices. Indeed, a recent survey, Tenopir, et al. argue that download counts underestimate article impact, again based on self-reported data rather than actual data on reading patterns [22]. A very important and unusual data point here is the Nicholas and Clark study which offers data suggesting that about half of the downloads of articles are never subsequently read [23].

Another thread of analysis has explored and questioned downloading statistics, leaving aside the downloading-reading correlation, particularly in the context of institutional repositories. The overestimation problem involves filtering out downloads by robots; the underestimation problem involves accounting for pathways to PDF files that aren’t counted by standard analytic tools [24].

Other choices offered might include reading the article online, which would keep the reader in a monitored environment, and include the ability to capture not just the kinds of basic viewing trails one does on an e-book reader, but also activities like skimming illustrations, following references, navigating links (from, say, a protein name in an article to an entry in the Protein Data Bank) or employing other browsing or analysis tools offered by the online user interface. Digital critical editions of texts are likely best supported in monitored environments, as these grow more sophisticated and more inherently networked with other reference sources.

It is also worth noting that as we move away from digitally rendering scholarly articles based on centuries-old conceptual models of print pages to much more complex, multidimensional, interlinked, networked and interactive objects, the boundaries between the monitored and unmonitored environments become more complex [25], the reader’s ability to move to an unmonitored environment may become more constrained for a specific research object (to use the current popular generic term), and the interaction records between reader and article may become much more complex. It’s likely that new “value-added” environments are going to be monitored and, increasingly, public environments. Recognize that, at least in scholarly contexts, the values and goals infusing many of these emerging communications environments are really about sharing, collaboration, open science, and ensuring replicability and reproducibility much more than about protecting reader privacy. Also, the various new metrics (altmetrics) of “scholarly impact” that are being promoted widely impact metrics (particularly new ones, the so-called “altmetrics”) go hand in hand with the ability to track behavior; capturing such measurements is an explicit goal of the transformation of scholarly communications and implicit in many new communications environments [26].

Broadly speaking, in these more interactive reading environments, the privacy that readers have will be about withholding identity details from platforms (itself a tradeoff in terms of functionality and personalization, and relying on mediated scenarios with a purchasing agency like a library rather than the individual licensing access where necessary), and, most importantly, about policies pertaining to privacy and data collection, and the extent to which delivery channels can be trusted to adhere to these policies. It will not be based on making data collection technically impossible, because this will be beyond the reader’s capabilities. And recall here the fundamental rules discussed earlier: if the data exists, or, indeed, if it can be captured at least by commercial interests [27], and if it exists, the government will always ultimately demand it if they think they can exploit it.

We can think, then, of a record of the interaction of a reader and a text: it would involve what parts are viewed, navigation with timestamps, and perhaps additional actions such as underlining, annotation, or saving or following links. All book reading hardware platforms and software emulating such devices on other platforms like laptops can, in theory at least, capture these records. As far as I know, there is little public information about how systematically and extensively these are actually captured, collected, aggregated, stored, and shared by the delivery platform and channel providers [28]. Third parties can also get in on the act: consider the analog of a laptop or desktop keylogger for an e-book reader, smartphone, or similar platform, or “man-in-the-middle” type attacks on network connections.

It’s also important to recognize the range of personal identifiability in data collection, which might go from a series of anonymous interaction records about a specific text at one extreme, all the way through a deep record (really, a database) about a specific customer of a given service that aggregates everything he or she has read through that service for a decade or more, with detailed interaction records for each individual work, as well as the searches or other mechanisms used to discover what he or she selects for reading. It is clear, however, that the sharing of such information with traditional publishers, much less authors, cannot be assumed: it’s a matter of contract and negotiation. I will return to the important questions about what this data is actually good for, and to whom, later in this paper.

There’s something strange happening here. Amazon and Apple, both major controllers of the e-book marketplace, can exercise very tight control over both their online “bookstores” and their follow-on reading environments; yet these sites are heavily populated by third-party trackers of various kinds [29]. Amazon, at least, will find “you” on many other sites on the Web. The data interchange agreements and flows between the third parties and the primary publishing platforms seem to be totally opaque here, other than to note that the privacy “policies” on the host sites explicitly recognize this type of data exchange.

Note that while some genres of material (scholarly journals, the sciences in particular being an excellent example) have now moved almost entirely to digital formats, and thus understanding what’s happening in the digital marketplace essentially tells the whole story, this is far from true for other genres, notably mass-market books and even scholarly monographs. Here matters are very complicated indeed: there is very different data, and data availability, for the physical print and electronic markets for the same works, and there is the additional complication that there are at least two partially distinct reader communities that may be expressing rather different reading patterns and behaviors with regard to those works. Pricing differences between printed and electronic versions may further confuse matters. An author (or publisher) trying to operate in both worlds faces genuine challenges trying to make sense of the various datasets and data sources available.

2.2. How do readers discover or choose what they read?

Publishers and authors have always wanted to know how readers discovered and chose the books that they purchased (and presumably read). Historically, very little was known about this question except for occasional, laborious reader surveys that were largely focused on understanding the effectiveness of advertising campaigns, the impact of choices in cover art, or on trying to demonstrate the importance of retail outlets featuring and promoting books appropriately. Word of mouth, book reviews, reading clubs, and similar routes also played a role in discovery, though typically almost impossible to measure case by case. There really wasn’t very much data to share, though the data that did exist was generally privately held; I do not know the extent to which data about the effectiveness of a publisher’s advertising campaign, for example, might historically be shared either with an author or a bookseller (though certainly the level of investment in an advertising campaign is traditionally a subject of negotiation between publisher and potential author, and between publisher and bookstore chains in placing pre-orders for a new title).

The information discovery behavior of scholars has been repeatedly studied. It’s clear that word of mouth among colleagues is very important; scanning tables of contents for current literature (at least for what are perceived as key journals) as they are published is also important. Faith in technology-based solutions such as recommender systems is very limited, though the hopes of startups springs eternal as they try seek the right mix of social networking/community systems and recommenders.

For scholarly materials, one could also count citations and hope that this might be one route to discovery by readers, or try to estimate the general impact of a journal as a debatable proxy for a discovery mechanism; these were third-party functions, so the data was broadly available (though not free in most cases) but perhaps not fully transparent or readily reproducible independently [30]. Matters became more interesting as electronic abstracting and indexing services, and then, later, full text searching became important (indeed, primary) methods of discovering articles: here the platform provider knew what search terms the user employed to discover the article in question. Historically these abstracting and indexing services were distant both from publishers and authors; think of a system like Dialog in the 1970s [31], or the programs to mount various abstracting and indexing (A&I) databases as extensions to large-scale online catalogs like the University of California’s MELVYL system in the 1980s and early 1990s [32]. In this situation there was no real path for search terms used in discovery to be propagated back to either publishers or authors. Savvy authors were, of course, mindful that the choices of keywords and of words appearing in titles and abstracts were important in making their articles visible to or discoverable by their desired audiences. Of course, this was in a simpler time, when retrieval was essentially deterministic term matching, before relevance ranking algorithms, page rank, and search engine optimization.

Matters have become more complex as these proprietary A&I services have been largely supplanted by free specialized tools like Google Scholar or Microsoft Academic, as well as more general purpose services like the basic Google or Bing search engines. Some long-standing A&I databases have gone out of business or been acquired and consolidated. In some cases, it is possible for the site hosting materials of interest to obtain the search terms that brought the user to this site through information in the HTTP access request (the REFER field). Policies at hosting sites with regard to recording and sharing this information doubtless vary widely, and are not very transparent (even for non-profit scholarly societies) with the exception of some government sites like the National Library of Medicine’s PubMed. Certainly also broad commercial sites like Clarivate Analytics’ (formerly Thomson Reuters’) Web of Science or Elsevier’s ScienceDirect and Scopus, as well as more specialized sites like the American Mathematical Society’s Mathematical Reviews, capture a great deal of information about search terms used for discovery, but it’s unclear whether this information is passed back to publishers, and additional upstream propagation to authors seems very unlikely. One potential special case is where the publisher is the platform operator, and can readily gather this information and share it with authors.

In the consumer world, Amazon, of course, knows a vast amount about how its customers discover and make choices about potential purchases (through mechanisms like recommendations and reviews as well as traditional search), and it has now established a very substantial historical database; in some cases Amazon also knows actual reading patterns for these purchases. I don’t know what, if any, of this information is shared with publishers, or with authors that use Amazon as a publisher. It is doubtful that the data is being passed along to the traditional publishers using Amazon as a retailing channel; it is also unlikely that the information is sent, ultimately, to authors. There are some very interesting and hard research questions about how one might model and abstract user interactions within a very rich and history-dense discovery and purchase environment such as Amazon in ways that might be usefully shared, transferred, or aggregated across such environments; even in the hugely improbable situation where Amazon was willing to make this available, it’s not clear how they would actually do so. Though I have seen little in the literature, presumably there is industry-based research on the very significant problem of how to model, represent, and structure this kind of deep, complex personal interaction history information (think of the user-related data held by organizations like Google, Facebook, and Amazon) for transfer at least across versions of internal systems over time (if not across organizational boundaries or, more probably, internal system silos).

A final set of points should be made here regarding disaggregation. For scholarly journals, there has long been a market in individual articles, and virtually all of the search and discovery mechanisms disaggregate journals into collections of articles, from which individual articles of interest are selected. In the consumer marketplace, perhaps with the rare exception of short story collections and a few nonfiction works, books are not subdivided: one purchases or licenses the entire book as a unit. In the scholarly arena, however, there has long been interest in being able to discover and obtain individual book chapters. As these monographs finally become e-books, and collections of such e-books form searchable databases, discovery of individual chapters or even smaller units like paragraphs becomes very feasible alongside the traditional discovery of works at the monograph level; some university presses are already selling individual chapters as well as complete monographs. Discovery terms at these more granular levels are likely to be of some significant interest to both authors and publishers of academic monographs, and sharing of these queries may well become subjects of negotiation between publishers and monograph platform providers on one hand, and between publishers and their authors on the other.

2.3. Who is doing the reading?

In the digital world, the most common case, particularly in consumer environments, is that the platform provider knows exactly who is doing the purchasing, the downloading, and the reading, at least to the point where material is exported from the commercial platform and into an unmonitored reading and use environment. They have a name and user ID, a credit card number, and perhaps a customer history; they can make commercial arrangements with other organizations to obtain additional information about the customer if they want it. Sometimes consumers provide more information than they realize: for example, a seemingly innocuous act like using a corporate rather than personal credit card, or providing a workplace delivery address, can be quite revealing, particularly when aggregated [33].

For material that is available for free public access on the Internet, there’s at least the hope (optimistic dream?) of anonymous access, though this is far harder to genuinely accomplish than generally realized. Many free and public sites actually apply extensive and sophisticated technologies to track and identify visitors; reader privacy requires tools to defeat these technologies. And then there are various forms of third-party monitoring or redirection of network traffic, or even government subversion of hosting sites. We will not pursue the broad question of the possibility of genuinely anonymous access to open networked information resources further here other than to recognize that it is not as easy as one might think [34].

The mythic possibility of anonymity of retail in-person cash transactions with a (presumably large, busy, unfamiliar) big-city bookstore is almost impossible to re-create on the Internet, and the online channels really don’t want to facilitate the re-creation anyway. They want to know all about you, and as part of this offer many new amenities (such as discounts, recommendations, pre-ordering, etc.) that are predicated on knowing your identity. And honestly, routine interactions with one’s local bookstores have been a bit more suspect in terms of genuine privacy through anonymity for a while: credit card records, computer-based store sales records, temptations of frequent reader loyalty programs and their discounts, closed-circuit television (CCTV) tapes, clerks who remember you and might even get to know your interests.

Historically, the essential institution that has protected the privacy of readers is the library, perhaps most notably the public library, but also the research libraries that primarily serve specific academic communities, but may also support important broader secondary communities of citizens at large, particularly in the case of research libraries that are part of public state universities, and thus have an explicit policy function serving as the final line of support for the statewide public library system in conjunction with the state library. But, as discussed earlier, they were mainly trying to protect readers from government, not from other actors within the publishing ecosystem [35]. Corporate and similar special libraries also, of course, have concerns about maintaining confidentially overall, sometimes for competitive reasons involving corporate strategy, patent filings, etc. Research libraries also have to manage confidentiality concerns related to researcher precedence in discovery and publications.

As materials have shifted to electronic form [36], however, much more interesting and complex developments have taken place that redefine the library’s abilities (and inabilities) to serve as a protector of patron privacy. The situation is still unfolding, but we’ll explore two cases in point here: the relatively mature case of research libraries and their role in providing access to electronic scholarly journals in academic settings, and the very rapidly evolving situation where (primarily public) libraries are trying to provide patrons with access to mass-market e-books. In my descriptions of both cases I’ve suppressed a lot of details, special cases, and counter-examples, but I hope that the simplified summaries here provide a good sense of the overall trajectory of developments in each arena.

2.3.1. Identity management: Research libraries and scholarly journals

As scholarly journals became available electronically, they were first mounted locally by various research libraries (in experimental or prototype contexts, see for example the University of California system-wide IEE (Institution of Electrical Engineers)/IEEE (Institute of Electrical and Electronics Engineers) collaborations, the multi-institution TULIP project with Elsevier, Carnegie Mellon’s Project Mercury, Red Sage at the University of California San Francisco, or other experiments that took place in the mid to late 1990s [37]. In this context the libraries had complete control over authenticating users and logging user activity (though contractually they might be obligated to maintain effective authentication mechanisms and to report certain use data to the publishers). With a very small number of exceptions, like Los Alamos National Laboratory or a consortium in the Canadian province of Ontario, this model was quickly abandoned because of high costs and poor scalability. Journals moved to publisher or third-party access platforms on the Internet that served the entire research and higher education communities.

If we try to survey the landscape today (particularly with regard to various metrics about market share, such as number of journals hosted, revenue, etc.), specific data is very elusive, but some broad contours can be established. The largest journal publishers have internally developed systems (for example Elsevier’s ScienceDirect or Springer Nature platforms; Elsevier still uses Atypon as a platform for some of the society journals they publish). The smaller journal publishers aggregate their journals primarily on Atypon or HighWire’s platforms, with Silverchair as an additional significant player. Interestingly, in summer 2016, Wiley purchased Atypon and announced its intention to transition their own online journals to the platform, raising some very interesting questions about “Chinese walls” and the potential flow of data from all Atypon users to Wiley, among other issues; a full analysis of this issue goes beyond the scope of this essay (Wiley and Atypon, of course, said that they would maintain the historic confidentiality of Atyphon customer data).

As the marketplace restructured over time, the requirement for platform operators was to decide whether a given “user” should have access to a particular journal that was licensed by his or her institution through the research library. The first approach to this question was to look at the originating Internet protocol (IP) address, and for the platform operators to maintain tables of IP address ranges that belonged to specific organizations. This method worked, but had significant limits in both functionality and privacy. For fixed workstations on campus, access could be well correlated with individuals. From an access perspective, it was a very coarse-grained means of control; furthermore, it did not accommodate off-campus access by legitimate users without the introduction of some form of proxy mechanism so that the user appeared to the platform provider as originating from an authorized IP address. Ultimately the alternative solution that evolved was largely based on a mechanism called Shibboleth and inter-organizational trust and identity federations [38]; the idea here is that the user would authenticate to his or her home institution, and then that home institution (through a trusted server) would release “attribute” information for the user to a publisher or platform provider which would allow access decisions to be made within the context of a pre-established trust relationship.

Depending upon the complexity of the contractual provisions, the attributes released could vary considerably, i.e., “this is an authorized user covered under the current contract between University X and Publisher Y,” or “this user is a current member of the University X faculty,” or even “this user is a current member of the Law School Faculty.” There might also be a session identifier or a user identifier which was opaque to the publisher, but could be used in communicating with the university for debugging or investigating various types of improper access (credential theft, mass downloading [39], etc.). Contracts and internal university policy decisions would determine how long logs mapping these identifiers to actual internal identities were retained, and the circumstances upon which such a mapping would be revealed.

It has been unclear how and in what mix the IP (and proxy) based authentication, and the attribute passing approaches, were being applied for licensed content resources, and what attributes were being passed. In June-July 2016, my own organization, the Coalition for Networked Information (CNI), conducted a survey of its academic members to gather some actual data about this [40]. Briefly, this survey strongly suggests that attribute passing to external content resource providers is in extremely limited use indeed, and that most access is handled by proxies and IP-based authentication. In cases where attributes are released, there are a large variety of practices about what attributes are being released, and whether personally identifiable attributes are passed. Incidentally, there are also only rather limited efforts being made to contractually control data collection, retention, and reuse.

Note that DRM technology, such as that found in the mass-market publishing world or (now, thankfully, almost entirely historically [41]) in digital music markets, never gained much of a foothold in the world of scholarly journals. The licenses for scholarly journals are normally for unlimited concurrent use by a given institutional user community and also permit downloading of unencumbered PDFs (though not necessarily in bulk).

Here, licensing by the library has provided a fairly good outcome in terms of reader privacy; even if the publishers or platform operators log actions, they cannot routinely associate them with individuals. One might like to see more widespread routine Secure Sockets Layer/Transport Layer Security (SSL/TLS) encryption between readers and publishers/platform operators to minimize public disclosure of reader behavior via third-party eavesdropping; routine publisher best practices and university-publisher license agreements should at least include provisions for HTTPS as well as HTTP access to publisher sites. Relatively standardized and widely used contractual language about data collection, retention, and reuse would be desirable.

There are, of course, pathways to short-circuit these barriers to propagating user identity to the publisher, but perhaps they are mostly the result of explicit user choices, such as signing in to get additional services or personalization, or allowing cookies or other persistent tokens to be set on the user’s machine (though this method is perhaps less transparent and hence less explicit to most users).

In fact, there are three predominant methods for re-identifying users at various levels: the first and best known is through active identification, via cookies or other identifiers, as I have just discussed; a second is passive identification, via browser fingerprinting; the third is via third-party data sharing agreements and trackers [42]. In the first instance, it is clear (and quickly verified) that cookies are being used; as for passive identification, there is no easy way to know about fingerprinting techniques that are being deployed. Finally, regarding the third strategy, we know nothing about the data that’s moved between platform providers and third-party trackers, neither about the agreements, nor about how or where the data is moved how. We do know, however, that these third parties are abundant in watching interactions with scholarly publishers [43].

And there are other gaps: for example, site licensed commercial resources that are part of university collections but whose primary market is not the higher education community, such as the New York Times or the Financial Times; these kinds of information providers are trying to force users to set up personal accounts and self-identify, in addition to making heavy use of other techniques. The dependence on advertising revenue for these publications makes this strategy attractive (even though they are also collecting site license fees from the host research libraries).

Almost all publishers invite readers to create IDs or profiles, often linked to e-mail accounts, so that they can obtain recommendations, receive tables of contents for new issues of journals as they appear, post comments on articles, and gain access to a variety of other services; these choices create cookies or other persistent data. This direct publisher-reader relationship bypasses the library-brokered relationship, or, perhaps more correctly, short-circuits it. For convenience, cookies are placed in browsers so that returning users are automatically recognized. Many readers seem to want these amenities, and short-circuit the anonymity that the library-provided access offers. Often readers seem comfortable trusting the publisher with their tracking data, and they presume that the publisher will use that data in reasonable and responsible ways, and indeed with results that are useful to the reader; additionally, there’s a sense that the risk to readers (and perhaps the reader’s host institutions) is minimal. (There are exceptions: intelligence community users [44], or researchers involved in research with commercial or competitive implications, for example.) Publisher sites usually have privacy policies, though a reading of a sampling of these suggests that they explicitly recognize not only re-identification of users, and all that that implies, but also participation in third-party data collection and interchange. It would be very interesting to know the proportion of the visitors to major scholarly publisher Web sites that are individually identifiable by the site via these short-circuits or by other means; this information would help provide a view into user behavior. Publishers have this data, but I am not aware of any that have made it public. It is also interesting that in the 2016 CNI survey mentioned above, we found virtually no attempts to address this kind of re-identification of users and what is subsequently done with the data. It’s essential to recognize that questions about re-identification are essentially orthogonal and decoupled from questions about attribute sharing (though they are certainly potentially in-scope for contractual negotiation).

Another perspective on the privacy calculus is equally important here. I’ve discussed how a university library might seek to protect patron privacy by anonymizing authorization and controlling attribute release. The idea here is to limit what the publisher (or access platform provider) can know about readers. But libraries making choices about resource allocation and justifying licensing decision also need to know about who is reading what, so that they can demonstrate value in undergraduate education or capturing increased research grant funding, enlist individual departments or professional schools to help support and even subsidize these licenses, and perhaps, speculatively, even to make local authors happy by providing audience information. So libraries are motivated to control reader identity information flowing to publishers but perhaps at the same time want as much information as possible about the behavior of these readers back from these same publishers or platform providers [45]. Frameworks like NISO’s SUSHI work (see www.niso.org/woorkrooms/sushi/faq/general) and Project Counter (www.projectcounter.org) provide reporting intersection points between what’s being read and who’s doing the reading, but it’s clear that at least some publishers are willing to do additional de-aggregation on one or both dimensions. Further, libraries are starting to realize that if they are willing to integrate information from multiple local sources (such as proxy access logs and wireless access points) they can gain an immense amount of detailed information about users and their access patterns [46]; fortunately the sources for this information integration aren’t readily available to third parties (publishers, government agencies, commercial trackers) except under some fairly paranoid scenarios.

2.3.2. Identity management: Public libraries and mass-market e-books

The story of reader privacy in the context of university-publisher relationships for scholarly journals is primarily a positive one; other than some serious contention about prices and profit margins, there’s actually a strong (though frequently unacknowledged), well-aligned set of fundamental values that have made discussions and joint library-publisher action on issues like privacy and provisions for credible preservation of digital journals quite productive.

Contrast this to the discussions between major commercial publishers and public libraries [47]. The opening proposition from the publishers on this discussion was that libraries have always been thieves, stealing sales from publishers by circulating books [48], and finally, at long last, as the world transitions to e-books, to licensing in lieu of purchase, to DRM controls, publishers can regain control of the situation and find justice. Thus, several major publishers variously took positions that they would simply not make e-books available to libraries under any conditions; that they would price library e-books much higher than those in the commercial market; they would make library e-books self-destruct or expire after a certain number of uses or a certain period of time, etc. [49] Over a period of years and under public relations pressure that included a major campaign by the leadership of the American Library Association from around 2012 to 2015, the major trade publishers moderated their positions a bit, from not making e-books available at all to agreeing to make them available under often economically punitive conditions. The compromises demanded from the libraries went beyond economic terms to conditions on delivery and use environments: e-books in the library environment would come with tight and cumbersome DRM constraints, minimal and hard to verify reader privacy, and very awkward connections to the ecosystem of e-book reading that was developing in the consumer marketplace. (I will omit a discussion of other problems with the publisher positions, most notably the refusal to recognize and honor the role of libraries collectively in preserving the cultural record, which will ultimately wreak great damage on our society, as these are peripheral to the issues under consideration here.)

After a considerable amount of public shaming and pleading for content on any basis, no matter how punitive, more (but far from all) of the mass-market books are starting to arrive in public libraries in greater quantity. The compromises that public libraries have had to accept to obtain these materials are horrible. But the public libraries often felt themselves to be under extreme pressure to demonstrate continuing relevance and responsiveness to their user communities by providing circulating e-books, to challenge a deliberate campaign to marginalize public libraries, cast them as obsolete institutions, and ultimately defund them.

It is also vital to note that while considerable progress was made with the major traditional publishers, there are many, many smaller additional publishers who were not part of the discussion and still make little place for libraries. Also, critically, Amazon (and Apple, a much smaller player) made very little accommodation.

I believe that libraries (and readers) need to take stock of the technical delivery mechanisms and licensing agreement provisions for reader privacy here. It’s not reader-friendly, and it’s opaque. Public libraries have focused on price negotiations in light of a certain sense of genuine desperation about being able to offer anything to patrons; other terms and conditions, such as privacy protections, have generally received much lower priority. Without going into a lot of (constantly changing) detail, it seems clear that OverDrive, the predominant provider in this area, is leaking a great deal of information, and it’s not clear what information it’s collecting [50]. The reader privacy problems with the Adobe Digital Editions (DE) platform, (used in conjunction with OverDrive, but it is really a quite separate problem), have been well publicized recently (and already discussed in detail; see note [28]). Amazon’s Kindle environment, because of its increasingly dominant market share and relatively good ease of use in the consumer market, has been able to set its own terms for technical integration with library-circulated e-books and terms and conditions regarding privacy.

While I am hopeful that over time the technical integration of reading platforms and circulating e-books will improve, as long as publishers insist on aggressive DRM, it seems likely that privacy will be very weak and unverifiable in the sense that publishers and perhaps others along the delivery chain will know a lot about reader choices and behavior, and we’ll have to take their word for the specifics and for the terms and conditions under which they will sell, share or otherwise disclose this information. Protecting this information against eavesdroppers while in transit is an easier technical problem, and one that is easier to externally evaluate; this will definitely improve over time if readers and libraries continue to press the issue. We should also remember that it’s been demonstrated recently, again and again, how difficult it is to protect confidential corporate or other institutional information about anything: products, customers, or anything else. It’s only a matter of time (and probably not long) before we see a major data breach dealing with customer purchasing and use of e-books.

As a final note here, it would seem that from a governmental perspective, the level of interest in tracking reading of mass-market books is actually pretty low. Almost by definition, reading of mass-market books isn’t very dangerous, and valuable only in the occasional really heavy handed purge, or for embarrassing people, or for helping to identify potential targets of interest; and, if the need arises, there are plenty of commercial data sources that can provide data, for a fee. These days, records of access to more esoteric information (scholarly journal articles on synthetic biology, epidemiology and the genetic evolution of infectious diseases, or nuclear weapons engineering; or Web sites or YouTube videos glorifying jihad and beheading) are probably of much greater and more urgent interest to the state than your reading of Fifty Shades of Grey or even older classics like Mein Kampf or the Communist Manifesto; a lot of people look at the Quran for one reason or another. But from a commercial basis (including private or politically-driven investigations), or from the perspective of readers embedded in a very controlling or restricting social environment (for example, adolescents dealing with over-controlling parents, religious institutions, local communities or schools), there’s plenty of interest in digging into public library circulation records and individual reading behaviors. Plus, of course, it helps to target advertising more precisely, and thus to generate profits.

2.4. Reactions to reading and a brave new world

Publicized reactions to reading in the form of book reviews have been around for centuries [51], but in the last 20 years they have become highly democratized. Book reviews in traditional mass media, such as newspapers, have become the exception rather than commonplace as many papers have dropped or outsourced book reviews overall, or eliminated Sunday book review sections. But commercial retailers/aggregators such as Amazon, as well as Web sites like Goodreads and innumerable bloggers, have greatly democratized the ability to write and share book reviews. Perhaps most importantly Amazon and Apple make reviews widely visible as part of the book (and other content) selection and acquisition process, as well as abstracting and dumbing down these activities into summarized star ratings. There’s a lot of evidence that these democratized reviews and comments matter, at least for mass-market books, including the efforts that authors and publishers are making to solicit and engage them, and the pleas that authors regularly include in their e-books for favorable reviews. It is much less clear what effects experiments in scholarly publishing with open peer review, post-publication peer review and commentary, and related activities, are actually having on choices readers make about their consumption of articles.

Reading pathways (which I’ll subsequently use as a generic term for data on how something is read in a monitored environment, along with ”reading trails“) also hint at reader response. Rapid page turning, very lengthy periods of engagement with an e-book (to the extent of reading an e-book in one long sitting), or abandoning a book after the third chapter, not to return for many months (if ever) — all these suggest reader reactions to what is being read, though these actions need to be interpreted through many assumptions and guesses. Some of these interpretations become less speculative in the presence of “big data,” of many reader pathways documenting encounters with a work, and the statistical correlation of these pathways with other external indicators (reviews, rankings, ongoing purchase histories or engagement with other works).

The future will likely get creepier. The “Internet of Things” — perhaps more descriptively termed the “Internet of Insecure Things that We Don’t Control” — is moving into our homes and our lives very rapidly, whether we want or need it or not. Consider for a moment the current convergence that is being promoted between personal communications devices (such as smart phones) and personal health sensors (Apple Watch, Fitbit, etc.). Apple is apparently a great proponent of this vision. Under what circumstances might we see sensor data connected to the reading experience? [52] Can we imagine e-book reading appliances incorporating or networking with health monitoring sensors (think particularly here about the elderly or infirm who are often avid readers, frequently exploiting adaptive technical affordances such as print enlargement or spoken text, and who are also the target of very detailed tracking of various vital signs)? What sorts of incentives might authors or publishers offer for an “instrumented” reading trail from platform providers in these contexts? Might we see caregivers censor material available to these readers because it upsets or over-excites them?

We are seeing suggestions of how computers (perhaps embedded in other objects, such as robotic pets, dolls, etc.) will help to both engage and care for the lonely elderly in a society that does not have enough human caregivers. Imagine an application that’s part of an e-book reading environment that talks to the reader about how he or she liked the book, and runs voice stress analysis on the conversation as an extra bonus, or perhaps correlates to medical sensor telemetry. This is yet another stream of data surrounding reading experiences. And the elderly are not alone: for children we have a new generation of “smart toys” like the notorious “Hello Barbie” coming into the marketplace [53]. How long will it be before a plush toy offers to read to children interactively? Note also the recent emergence of various general-purpose, voice-activated devices for the home (Amazon Echo, Google Home, etc.) [54], voice-driven “assistants” like Apple’s Siri or Microsoft’s Cortana, and specialized voice-controlled devices like the Samsung television offering, which provide yet one more pathway for surveillance and function as controllers to various forms of media (music, video, but not yet audio-books, apparently) [55], and the recent interest in using transmissions at inaudible frequencies to let independent devices (or sites, or applications using those devices) handshake and realize that they are both monitoring the same user in different contexts. Also worth noting is the recent settlement concerning sex toy data collection [56].

Some forms of genre fiction are famous for lavish, clichéd advertising rhetoric that describes “pulse-pounding” action and suspense, “steamy” romance and sex, heartbreak; books that keep you up all night, hold you on the edge of your chair, make you jump at shadows. We are perhaps near to the time when it’s feasible to measure and document these effects, and correlate them very closely to the reading of texts (or the viewing of videos). There are numerous parties that will have an interest in this kind of documentation of reading or viewing experiences at scale and in the wild, rather than in, at best, a few individuals wired up to sensors as part of an explicitly organized focus group.

2.5. Collecting contextual data on reading

In the 1980s, Marvin Minsky famously said, “Can you imagine that they used to have libraries where the books didn’t talk to each other?” [57] Clearly, data about what other books you own, and what books keep company with other books, are of great interest; the first implies that the reader is identifiable, and the latter is about co-occurrence and does not require specific individual identification of the reader. Understanding the nature and composition of personal digital collections or digital libraries, and indeed the ability to do various kinds of computation over their contents and/or the metadata characterizing them, is very powerful. Recognize that both dedicated (e.g., Kindles, Nooks) and shared multi-purpose reading platforms (e.g., tablets, iPhones, computers) can be busy, complicated and crowded places [58], where content can actually be added from multiple sources; I am not confident that most users of these devices have firm control over how effectively and aggressively each individual application is sandboxed and constrained. Note that one of the key distinctions here is that this information transcends the purchase histories that are maintained by individual content suppliers, and indeed at least potentially captures not only “published” content but various kinds of private or semi-private materials. The ability to propagate this information back to various constituent content providers contributing to a personal digital library is very attractive indeed, at least to some commercial players and perhaps to some readers.

Content providers are not the only interested parties here. Readers also want to better understand the characteristics of the personal digital libraries that they have amassed and curated over the years, and what that might tell them about what they know, the evolution of their intellectual interests, and unknown materials that may be of interest to them. LibraryThing, for example, has made a business of providing a platform for people to document personal collections and to do various kinds of comparisons and sharing with others.

The role of new environments for sharing, discovering and reading scholarly literature such as Mendeley (now owned by Elsevier), academia.edu (a for-profit company despite it’s misleading name [59]), ResearchGate, SlideShare or figshare, is also going to be a fascinating situation to study. Not only do these environments promote themselves as platforms for sharing materials and places to build personal digital libraries for readers, but they are increasingly recognizing that they can capture (and potentially monetize) unique system-specific reader data (now being promoted as “altmetrics”) that they capture within these “walled gardens” [60].

In the 2014 Adobe DE debacle discussed earlier, it is interesting to note that allegations (to the best of my knowledge never fully substantiated) abounded that the software was actually enumerating a catalog of all e-books on the user’s reading device (from whatever source, including public domain repositories) and reporting this catalog back to Adobe’s central servers. I believe this is an issue that will re-emerge; there are already interesting echoes in, for example, Apple’s efforts to match local music collections on a personal computer with holdings in their cloud services under the dual justifications of saving local storage and improving quality of the sound files.

 

++++++++++

3. Exploiting data

Thus far we have looked at the current state and prospects of data availability and collection, mostly by the organizations in the most privileged positions: platform providers, and sometimes publishers (most commonly when simultaneously playing roles as platform providers, though we have also looked at the peculiar case of distributed knowledge in institutional site licensing scenarios). We’ve explored what these players can know. Now we need to look at the organizations these primary data collectors might share data with, under what terms, and, most importantly, what actual use the various parties can make of the different kinds of data in play within the ecosystem.

Its worth enumerating the interested parties: there are the platform providers themselves (the Elsevier Scopus and ScienceDirect systems, various publisher Web sites, Amazon, Apple iBooks, etc.). There are the authors. There are readers. There are (usually more traditional) publishers that sit between platforms and authors, and work with authors. There are libraries that connect platforms and readers, and work with readers (but also with authors); in the case of research libraries, there is also typically the parent university, which is not necessarily aligned with the library due to various additional competing interests (risk management, student analytics, fund raising, etc.). And there are various third parties that provide marketing and business intelligence to the players in the ecosystem: market research firms, collectors of metrics of research impact, and the like. A second and very shadowy group of third parties collect general information on consumers and it’s very unclear what information they can get, and how they integrate this information about reading interests in various contexts into the constant, omnipresent, broad-based profiling, dossier building, and reselling of information about consumers and their broad behavior (for example, making inferences about possible travel destination interests, financial challenges, medical conditions [61], charitable causes, etc., based on what they can learn about reading behavior [62]). These third parties are, of course, often highly interconnected with various advertising delivery networks; in some cases they are one and the same. The relationships between publishers, platforms, and these third-party information collectors are also totally opaque; it’s growing increasingly clear that this third-party tracking data is being widely integrated, and that feedback circuits abound [63]. And, of course, government instrumentalities that may demand data or simply steal available data anywhere in the ecosystem for whatever purposes they may deem justifiable (and we should never forget that not only can they obtain access to it, but national or foreign government instrumentalities can also fabricate such data). The distinctions between the commercial third parties and governmental players get very hazy at times, as it is very clear that governmental agencies regularly exploit this kind of commercial data. Also, commercial players exercise control of platforms (or complexes of interconnected and cooperating platforms) that function as effective monopolies, and thus further control the marketplace through their comprehensiveness.

In fact, the platform providers face a genuine dilemma: they have too much data. If they share all they have with authors (or publishers, or librarians for that matter), their readers will likely rebel. Similarly, it is possible that they can offer readers more data than they need or want. Nobody knows where the acceptable boundaries lie in various markets today, or how stable these boundaries are from one genre to another. The degree of overlap between readers and authors (or authors-in-training, also known as graduate students) in various marketplaces is an additional important factor that we need to understand. The platform providers can exploit the data at their disposal internally to the (hopeful) benefit of both readers and authors but they have to be very careful about how to do this without making everyone uncomfortable [64].

In the consumer world there are many more readers than authors, and those readers typically stand at a distance from the authors they read. In contrast, the academic world often features a high overlap and degree of interaction between the community of authors and the community of readers. In this latter case, it’s clear that some of the major industry players are viewing their dual role as publishers and platform providers as a genuine strategic advantage and opportunity both in attracting authors and adding value for readers. If readers effectively access papers as identifiable individuals (as opposed to being shielded by the relative anonymity of an institutional library subscription), then the service operator/publisher knows exactly who they are, and perhaps also contextual information (such as what they have themselves published, how they got to the paper they are reading, how frequently they read papers, etc.); they may even get a reading trail if it’s read online. As scholarly identity and professional biography gets linked closely to bibliography, this passes a lot of information. A great deal of infrastructure, such as the various standards for name identifiers and the related databases (ORCID, ISNI, and the like), and software platforms for sharing research events such as publications (VIVO, the various Current Research Information Systems [CRIS] that are emerging, SHARE, etc.), is rapidly enabling progress in this area, which can provide the author with more information about his or her readers. Very large publishers like Elsevier are already exploring how to exploit these positions [65].

To the extent that libraries want to function as publishers, or to integrate the functions of university presses, or that scholarly societies want to compete with commercial journal publishers, they need to consider the extent to which their values and operational constraints (such as institutional policies or even state laws governing public institutions), as related to both authors and to readers, will allow them to provide competitive levels of information about readers to their prospective authors. If they take conservative or perhaps principled positions here, will this represent a competitive disadvantage? This will create some very interesting tensions with the traditional library policy positions on privacy.

The issues aren’t just limited to the academic world, with its characteristic high author/reader overlap and social interconnection. Consider the recent controversy where Amazon, a platform provider with fairly extensive information about authors and readers (but perhaps more limited knowledge of the relationships between the two than one might find on an academic platform), has tried to police the submission of book reviews in cases where the reviewer knows the author, claiming this protects the integrity of the reviews. The specific details of what Amazon is doing seem to be very sparse, but this is a case study that demands some careful consideration [66]. For those familiar with the academic publishing world, parallels with evergreen problems about crony networks of reviewers and even fake reviewer identities exploited by authors to undermine the peer review process should be evident.

While it’s commonplace to talk about reader interests in the context of privacy, of who else is using data about them, we should recognize that readers, increasingly, also will have an interest in access to their own data and the ability to easily move that data from system to system in forms that are easily re-used and re-combined. The most trivial cases are simply having a record of one’s own reading, or one’s purchase history. But there are potentially more ambitious goals when these histories, collectively, are viewed as representations of key elements in the digital records of people’s intellectual lives: a lifetime of personal reading history most probably of value in the aggregate, rather than as a series of isolated records of transactions, and perhaps most interesting when connected to other personal data such as calendars or diaries. Norms, best practices, or even possibilities here are still in their infancies, though this will be hugely important to future scholars [67]. Such records are of value also for various kinds of personalization, or as data that personal search agents might exploit. The documentation of the evolution and scope of personal libraries has long concerned archivists, biographers and other scholars [68]; as we move towards a future where personal libraries are increasingly digital in nature these questions of not just transactional reading activities but documenting the broader collecting and curation context will, I think, become increasingly important to individuals, to archives that may acquire and care for their records, and to scholars and biographers studying them [69].

3.1. Pathways to payoffs

Roughly speaking, we can segment the opportunities involved in the exploitation of all the various different types of data related to reading into three distinct sub-areas: understanding the audience for existing content; revising content to be more effective; and making choices about investments in new content (predictive analytics).

There is a high level of interest in understanding audience and audience reaction for existing content (sometimes in actionable ways, and perhaps in correlating these behaviors with other data about the members of the audience). This is the realm of intelligence-gathering or market research about an author’s readers, re-orienting advertising campaigns, gathering analytics of various sorts, and system tuning; often aggregated data in this area is good enough for user experience/user interface optimization, for example. Geographic aggregation may be helpful in planning book tours, readings, and promotional campaigns, but it is perhaps of more interest to publishers than to authors [70]. In some cases specifics really matter: a candidate for a faculty position may be very interested in exactly who is reading his or her work, for example, though there’s no thinking about how this work might be revised (other than the context in which it might be presented, such as a document describing such a faculty candidate’s evolving research agenda). The special case of textbooks is discussed in more detail below.

Of course authors, publishers and platform providers all want to build up contact lists and communities of readers for future marketing purposes, and to own and control these relationships. It’s even better if they can tell if the reader actually liked the work and might be interested in participating in such an ongoing relationship.

A second area is using this data to guide revisions to actual existing content. This strategy might include an author changing his or her work in response to reader behavior; focus group previews, test readers, and director’s cuts in audio or video products are good examples of where revising practices are well established. There is a real difference, however, in both the amount of detailed data potentially available, and the opportunity to sample behavior across a much broader group of people; conducting controlled preview experiments with volunteer test audiences (who may explicitly waive various forms of privacy, and who may or may not reflect real audience or potential audience demographics [71]) is quite distinct from actual data collection based on real mass audiences, in the real world, following commercial release. And the lines between the two are shifting.

The importance of data to revise works is very closely tied to genre culture and practices, which in turn are closely connected to genre economics (movies provide a really good contrast to mass-market novels here). Obviously, it needs to be technically and economically feasible to make revisions and re-issue a work. There are questions of agency and responsibility (a work produced by one or two authors with the help of an editor, as opposed to the huge teams that produce major movies, with all the debates about creative control, who selects and authorizes the final cuts that are distributed for various channels, for example). There are issues about the level of investment to produce a work of a given sort and the need to ensure a good reception for it, and of the work’s purpose: artistic or scholarly success, economic return, etc. Authors may prize the fixity and closure of a completed work, or they may relish the ability to revise and update at will.

For textbooks, it's common to do this every few years, both to gratuitously obsolete, older editions (which may be eating into publisher sales through growing availability on the used book market) to preserve profits [72], and, hopefully, to improve the text itself by reflecting both experience teaching from it and new knowledge in the field where relevant. For journal articles, post-publication revision is almost unheard of (except for scandals such as retractions of articles, or supplements to articles noting errata), though changes due to preprint servers and shifts in culture and distribution are starting to reconfigure this landscape [73]. For scholarly monographs, it’s currently rare (usually only in a second or later edition) because the economics don’t support it. Overall, the new landscape of scholarly communications that is developing looks like it may well be more accommodating for author revision cycles.

For traditional mass-market works, revision and re-publication has also been rare, except in the case of a revised edition when a book moves from hardcover to paperback. Self-published e-books are also changing the assumptions here, as it is easy to put out a version 1.1 or 2.0 of an e-book, and this is not uncommon.

The last area for exploitation is in making choices about the creation and marketing of new content based on the reception and reading experiences surrounding existing content; this is perhaps particularly true for relatively mature and now final-form, existing material. In other words, in effect using this information to predict the reception for future works. Here, successes and failures, however defined (economically, artistically, intellectually, etc.), shape choices about investments (including author time) in creating and disseminating new content, and in selections about which articles or monographs to accept for publication. Authors and publishers often care about what sells, and what’s likely to sell in the future. Exactly how this issue changes from one market sector to another within publishing is an interesting and poorly understood question.

One question we don’t seem to have engaged very well, given these various uses of data, is what authors actually want, on a practical basis, in various sectors of publishing, and how much they care: what information actually really matters, as opposed to being merely “interesting.” For academics, will reader data availability shift the balance from one journal to another as the preferred destination for a paper? To what extent will a mass-market author trade royalties for data? We don’t seem to know very much about these questions.

Picture yourself as an author, academic, or mass market: there’s a publisher-provided dashboard in front of you showing real-time displays of download numbers, page turns, geographic, and demographic characterizations of readership for your book or article, perhaps even lists of purchasers and/or readers that you can click on to find out their names, when they read your work, what else they have read recently, what they do, etc. Do you want to publish with the publisher than can provide the best dashboard? How much do you care about the quality, timeliness and comprehensiveness of information? Why do you care? How do you balance the availability of a very rich dashboard against other factors, both those that are relatively traditional (impact factors, prestige of the journal, reputation of the publisher), as well as those that are more pragmatic (policy on open access to publications, speed of review and publication, support for marketplace visibility, promotional investments, royalty arrangements)?

3.1.1. A case study: Tracking student interactions with textbooks in the digital age

While most reader telemetry is of speculative value to authors (at best), there are special cases where the situation is much clearer and the value proposition is more obvious. One such case involves interactions with textbooks. This is a large, profitable, and competitive marketplace at the K-12 and college (particularly undergraduate and community college) levels, and one where the ability to combine information about reading patterns with other data (typically captured and held by the school or university) about the student’s background and preparation, academic performance, class attendance, participation, scores on specific homework or test questions, and interactions with online learning systems of various kinds, creates a powerful tool both for refining subsequent editions of a textbook (and related teaching materials), and as a sales and marketing tool. Note that the real money here is in textbooks and related learning materials that are used in introductory classes taught every year at many, many institutions to very large numbers of students.

A full analysis of this issue moves into a thicket of questions about the rights of students (and their parents), and the appropriate roles for academic institutions and faculty at various levels within the primary, secondary, and post-secondary education systems with regard to education data [74]. These issues also combine with related questions about data coming from online courses, particularly large scale ones such as massive open online courses (MOOCs), adaptive learning systems, institutional learning management systems and other technology-based teaching and learning tools that are becoming increasingly commonplace. Teachers are also interested in whether each student has done the reading before a given class, how much time he or she spent on the problem sets, etc.

The case of data about how textbooks are read, particularly if this data is individually identifiable by reader, is of very clear and compelling financial interest, particularly if it can be connected to or integrated with a growing mass of other data about student outcomes: not just how well they did in a given class, but very detailed information on which specific quiz questions they did well or poorly on; telemetry from MOOCs or other interactive teaching tools; patterns of interaction with learning management systems; background information about student demographics; performance in prior courses, etc. There is a strong interest in improving textbooks in an iterative fashion (now perhaps as part of a more holistic iterative improvement of teaching and learning practices, content, and technological environments) based on experience. Textbooks are routinely and regularly revised, and indeed commercial publishers have financial motivations to do this every couple of years to eliminate profits lost to the resale of used textbooks (though this may be moot as textbooks move to the electronic realm).

I heard a sales presentation a few years ago (and I’m paraphrasing here, in part to protect the guilty) pitched to higher-education faculty. “Imagine, Professor, you walk into your Tuesday morning class and pull up your dashboard when you come into class.” You have your class roster (with pictures for all those students whose names you can’t remember), along with all the other data about how your students are doing; you have information on whether and when they have read today’s assignment, and if so, what parts they have read and how much time they spent reading it. This will help you decide who to call on in discussions, or whether it’s time for an unannounced quiz. I will leave the reader to decide if this is a vision of heaven or hell [75].

In electronic textbooks, the lack of transparency about the collection, use and retention of student data, and what data is being shared outside the institution, the overall legal framework, implicit or explicit coercion of student readers, etc., are all major issues. Should higher education be held to a higher standard than the consumer marketplace, particularly as these higher education institutions adopt institutional positions and strategies about the deployment and adoption of electronic textbooks, learning management systems, and related delivery platforms?

I will simply note in passing that while I am mainly familiar with the higher education framework, K-12 electronic textbooks and learning management systems share most of the same issues, but adoption decisions are often made at a statewide (or at least school district) level rather than on an institutional basis (and are thus higher-stakes), and that there is far less attention paid to the rights (of any kind) of the students, but there is also greater parental sensitivity about how student data is used and the risk of ancillary data breaches or re-uses.

3.2. Government interests in the new data about reading and readers

This essay began with a historical discussion of reader privacy and freedom to read in the face of government scrutiny, framed in the context of a conflict of very long standing. But, as we have described, the new trails of reading in the digital environment are being collected and shared primarily within the private sector author-delivery platform/channel-reader ecosystem; though make no mistake, this data is readily available to interested government parties under a wide range of scenarios ranging from outright theft through various kinds of court orders to comfortable commercial licensing arrangements [76]. It’s appropriate to conclude by trying to assess the prospects for the uses of this new data by the state. To the extent that such data exists, it’s of course available for subpoena, seizure, hacking, or sometimes just upon request by various government agencies.

I suspect, however, that this new explosion of more exquisitely detailed data may offer more nuance than the government can be troubled to bother with, at least on a systematic basis for national security purposes, or at least as long as very large scale collection and computational data mining on reading records at the same level as, say, telephone calls, remains infeasible [77]. Though, no doubt, it will be very welcome as a source in individual investigations once people are identified by other means, or in forensic analysis. While authors and publishers are passionately interested in understanding if and how purchased materials are read, most government purposes are happy to regard purchase or downloading as equivalent to actual reading, and to assign suspicion or guilt on that basis; the same would be true for private investigators seeking embarrassing or controversial behavior. We might even see a few more court cases that argue that people should perhaps be judged innocent because they hadn’t actually read things they acquired or downloaded and in fact didn’t even understand the nature of the materials in question, or challenging the traditional argument that reading habits don’t represent indisputable and conclusive evidence of various crimes. But these are details: so much of what happens in the government sphere is about guilt by association, publicity, or broad behavioral patterns, or efforts to identify targets for other investigations, prosecutions, or persecutions, legal or extralegal. Exploiting nuance may be profitable in the private sector, but government increasingly doesn’t seem to need to bother most of the time. We can view massive surveillance as a genuine public-private partnership today [78].

For the vast majority of the population today, movie and video viewing through sites like YouTube, Web surfing, social media activities and even TV-watching habits (now more accessible than ever with smart cable boxes) reveal a lot more that’s of interest to both government and commercial profiling than old-fashioned reading activities. Further, there’s a sense in which government (at least in places like the United States) is interested today in individual quests for relatively esoteric knowledge, rather than gathering evidence to sort, say, one major religious group from another in preparation for large scale purges and pogroms of various kinds. Or at least I’d like to hope it is.

It’s about beacons, classifiers and differentiators, signatures, finding the “bad guys,” allocating relatively scarce intelligence and surveillance resources, for doing the “panoptic sort” [79] efficiently. While reading has historically had a sacrosanct aura because of the written words role in the transmission of both ideas broadly and also religious thought in the form of sacred texts, this is changing as literacy shifts and textual literacy diminishes; it seems to be fading rather than translating to more broadly relevant contexts today such as video viewing. And textual communication is still the primary mode of communication for science and technology. Debates about consumer and citizen privacy in seeking information (reading) may well shift increasingly, over the next few decades, to a more general conversation about privacy in use of media of all kinds, with a focus on exchanges in more ambiguous than formal, open to the world publications (e.g., social media). We are already seeing the focus on tracking and understanding recruiting and radicalization of potential terrorists shifting from textual materials to video sources, social media and related Web sites.

I would suggest that most of the disruption caused by the emerging calculus is contained within the author-reader ecosystem, the commercial and cultural sphere; it is about how comfortable readers are with authors, publishers, and other entities knowing of their behaviors and even identities. Note that there’s been a lot more discussion about opportunities for authors to connect with readers (blogs, Web sites, mailing lists, previews of new works, etc.) than there has been about the ways in which platform providers can broker or shape this relationship, or what various sectors of the reader community for a given work actually want. Much of what’s happening is very speculative: readers may care about what authors think of them or they may not; publishers and platform operators may be mainly interested in just offering good recommendations. Perhaps the most truly novel and interesting things developing here are the renegotiation of anonymity and knowledge of behavioral data between the reader and the author, with intermediaries along the pathways also playing important roles. There’s a lot of uncertainty about when the right point of departure is opt-in or opt-out in various relationships. We are challenged to move from abstract moral outrage to understanding genuine harm or vulnerability.

 

++++++++++

4. Some closing thoughts

At some point it’s worth asking what readers of various kinds will actually tolerate before the creepiness factor becomes overwhelming and repulsive. Suppose, as a thought experiment, that Amazon said it would share every purchaser’s e-mail address with the author of books they purchased? (Pick your own opt-in or opt-out boundary conditions). How about sharing this information with the book’s publisher? Would a discount on the purchase price or some other reward make most readers more comfortable? What conditions (enforceable or not) might be imposed on the author or the publisher regarding reuse of this purchaser information, and would this make any difference to readers’ comfort levels? What choices would customers make in such situations?

Many mass-market authors, particularly (but far from exclusively) those who are bypassing the traditional publishers for various self-publishing arrangements, are intensively building e-mail lists of devoted readers by other means: invitations to subscribe in the front or back material in their books (e-books or otherwise), which may include free short stories, announcements, preview chapters from new books, and the like. They are also making similar outreach through author-managed websites. What proportion of their readers care enough or are comfortable enough to join these lists, or create other, harder to quantify connections such as Facebook or Google Plus connections with authors, follow author Twitter streams, etc.? There’s the old tradition of author tours, readings, and autographing opportunities in bookstores and other venues like universities; now many of these events are moving to virtual environments, and it’s not clear how the sociology of participation is changing. If publishers or platform operators become complicit in forming these connections, how many readers will want to opt out?

All of this re-negotiation of reader needs must be considered against the backdrop of a huge shift in public expectations about privacy, particularly taking place along generational lines. It’s obvious that privacy is under serious pressure from many angles, though it’s not evident how widespread the public understanding of this transition is. If the position of the U.S. body politic (as opposed to their elected “representatives”) may be ambiguous, other nations are increasingly comfortable with historically astounding levels of government surveillance: consider for example the very recent enactment of the U.K. “Snooper’s Charter.” [80] Recent generations, raised on social media, the Web, mass surveillance, and personalization by both commerce and government, view this world differently. This is a rich and hugely complex topic that is far beyond the scope of this paper, but is critical. One caution: the commonplace notion that the “younger generation” doesn’t care about privacy, or believes that there isn’t any, is clearly (at best) massively oversimplified (see, for example, danah boyd’s work with teenagers and privacy expectations [81]). A very careful and nuanced analysis is needed here.

There are very interesting parallels to what publishers (or journals) rather than authors are trying to do in the scholarly context, offering free articles, early access to selected articles, journal tables of content, notifications about comments on articles, and similar amenities. The next generation of advanced scientific information management and research support environments (which of course interconnect to the literature) emphasize collaboration, personalization, and similar features. Some dream of replacing or augmenting pre-publication peer review in scholarly communication with an ongoing post-publication commentary and discussion; I doubt this will actually happen due to the scarcity of attention, but curiously, it seems to be taking hold for books on Amazon (including self-published genre fiction). It would be very informative to be able to compare behaviors across the consumer and academic marketplaces in this area.

One more counterpoint: some organizations, looking at the landscape of unending data breeches, are realizing that huge internal databases of user behavior are both an asset and a liability; if they are penetrated and these internal databases are stolen and re-sold, or (perhaps worse) publicly divulged, they may face a possibly existential loss of trust from their customer or user base. Older data, of ever-diminishing value, takes on characteristics of toxic waste when retained indefinitely. Perception of risk/reward balances are beginning to change, and may shift further away from massive data retention in future. If we don’t know enough about data collection and retention, data disposal is even more opaque.

Finally, remember that knowledge of actual reading activity rather than simply knowing what texts have been accessed or acquired still does not guarantee understanding of the values, beliefs, opinions, or intentions within a given human mind. We can only hope that governments, and commercial data collectors and exploiters, know this as well.

Its striking how often phrases like “I don’t know,” or observations that extensive data exists but is proprietary, appear in this paper. The new publishing ecosystems are being shaped by an astounding number of ever-changing opaque, proprietary and secretive commercial arrangements and closely held data resources. There are dozens of master’s and doctoral theses, and important research papers just waiting to be written in this area, as well as some timely investigative journalism compiling anecdote and documentation (both on and off the record) to help illuminate some of these marketplace practices.

I have used the twin examples of mass-market e-books and academic journals as case studies. In my view, valuable insights could be gained by similar analyses that can and should be done for other genres and marketplaces, such as video games, music, movies, and the like [82], though I do not pursue the prospects for such efforts here other than to note that major cable companies have been avid collectors of customer data [83] and recent shifts to streaming services create unprecedentedly rich new data collection environments for audio and video content [84]. There are also case studies worth doing of additional genres within the publishing world that I have not covered here. For example, news and newspapers migrating to the digital environment is a very fruitful area to analyze, particularly in light of the important role of advertising revenue, personalization, and the very rapid content creation and consumption cycles. End of article

 

About the author

Clifford Lynch has been the Director of the Coalition for Networked Information (CNI, https://www.cni.org) since July 1997.
E-mail: cliff [at] cni [dot] org

 

Acknowledgements and historical notes

I first thought this would be a short paper that would find its way to completion and publication quickly. The more I’ve explored these issues, however, the more complex it’s become, and, as with so many things involving electronic publishing, the world keeps changing even as I keep revising the paper. I started working on this in very early 2014. It’s now early 2017.

I have benefited greatly from several opportunities to explore these ideas with different audiences, at the Fiesole Collection Development Retreat held in Cambridge, England in April 2014, and at the Buckland-Larsen-Lynch “Friday Seminar” at the University of California, Berkeley’s School of Information on a number of occasions over the past two years, and most recently at the Jisc/CNI joint meeting in Oxford, England in July 2016. I’m deeply grateful to those who participated in these discussions. I’d also like to thank Jean-Gabriel (JG) Bankier of bepress and Jacob (Jake) Hartnell (a participant in the spring 2014 Berkeley seminar) for some very valuable and thought-provoking conversations connected to these topics. Many people made tremendously helpful comments on the draft, notably Don Waters, Elliott Shore, Joan Lippincott, J.-G. Bankier, Michael Buckland, Bernie Riley, David S.H. Rosenthal, Cecilia Preston, and Roger Schoenfeld. Additional thanks to Gary Price for his insightful reporting and advocacy on many issues discussed here, and Vivien Petras of Humboldt University for being so gracious and understanding when this paper outgrew both the scope and timelines involved with a festschrift that she was preparing. Finally a special thanks to Diane Goldenberg-Hart for both her comments and her extensive editorial help as this paper went through seemingly innumerable versions.

 

Notes

1. This is not intended to be a systematic legal review and evaluation, and I am not an attorney. There’s substantial law review literature arguing for a federal right to privacy in reading deriving, at least in part, from constitutional considerations. Places to start would include Bradley Schaufenbuel, “Revisiting reader privacy in the age of the e-book,” 45 J. Marshall Law Review 175 (2011), http://repository.jmls.edu/lawreview/vol45/iss1/8/, in particular p. 179 ff., and, of course, the seminal Julie E. Cohen, “A right to read anonymously: A closer look at ‘copyright management’ in cyberspace,” 28 Conn. Law Review 981 (1996) 1003–19, http://scholarship.law.georgetown.edu/facpub/814/. A more recent study, particularly relevant to the issues covered here, is “Confidentiality and the problem of third parties: Protecting reader privacy in the age of intermediaries,” BJ Ard, Yale Journal of Law and Technology, 16:1 (2014), http://digitalcommons.law.yale.edu/yjolt/vol16/iss1/1/. But there is a critical distinction between the arguments and analyses of legal scholars and actual case law precedents or clear legislation. There is more clarity and stronger rights in some state (as opposed to federal) law, though federal statues like the USA PATRIOT Act trump state legislation. Note also that, in fact, some recent legal developments suggest that attempts to conceal or destroy records of one’s reading history or other trails of access to information is now itself a very serious federal crime, according to some aggressive prosecutors, based on (of all things) a provision in the Sarbanes-Oxley legislation aimed at cleaning up corporate accounting and auditing fraud. See Juliana DeVries, “You can be prosecuted for clearing your browser history,” Nation (2 June 2015), http://www.thenation.com/article/208593/you-can-be-prosecuted-clearing-your-browser-history. There isn’t much real case law here, either, and one might hope that the First, Fourth and Fifth Amendments might provide at least some protection. Finally, I’ll note the peculiar status of videotape circulation records (no longer much of an issue), which are protected under federal law as a result of the exposure of proposed Supreme Court Justice Bork’s borrowing records during the Reagan administration.

2. It should be noted that special collections and archives have always kept detailed access records, but they are mediating public access either to rare and valuable material or semi-private materials that may be constrained by various embargo and confidentiality agreements; they have a very different purpose and set of motivation and risk considerations.

3. Notable pre-9/11 milestones here include Kenneth Starr’s attempt to subpoena records from Kramerbooks & Afterwords in Washington, D.C., as part of his investigation of President Bill Clinton in 1998, and the attempts in 2000 by the Drug Enforcement Agency (DEA) and the Denver police to subpoena records from the Tattered Cover Book Store and Borders bookstore for narcotics investigations.

4. A full review of this issue would call for a very lengthy article in its own right, but here are a few illustrations: Charlie Savage, “Obama Administration set to expand sharing of data that NSA intercepts,” New York Times (25 February 2016), http://www.nytimes.com/2016/02/26/us/politics/obama-administration-set-to-expand-sharing-of-data-that-nsa-intercepts.html, and Radley Balko, “Surprise! NSA data will soon routinely be used For domestic policing that has nothing to do with terrorism,” Washington Post (10 March 2016), https://www.washingtonpost.com/news/the-watch/wp/2016/03/10/surprise-nsa-data-will-soon-routinely-be-used-for-domestic-policing-that-has-nothing-to-do-with-terrorism, give some insight into current developments in the United States on a policy basis, thought there is also a long series of cases playing out in the courts about the U.S. DEA and other agencies using intercept data and concealing its use. See also Tim Cushing, “Appeals court affirms NSA surveillance can be used to investigate domestic criminal suspects,” Techdirt (19 October 2016), https://www.techdirt.com/articles/20161009/10195135753/appeals-court-affirms-nsa-surveillance-can-be-used-to-investigate-domestic-criminal-suspects.shtml. Abroad, these examples are too good to miss: Paul Farrell, “Lamb chop weight enforcers want warrantless access to Australians’ metadata,” Guardian (18 January 2016), https://www.theguardian.com/world/2016/jan/19/lamb-chop-weight-enforcers-want-warrantless-access-to-australians-metadata and Mike Masnick, “UK councils used massive surveillance powers to spy on ... excessively barking dogs and illegal pigeon feeding,” Techdirt (29 December 2016), https://www.techdirt.com/articles/20161227/23480736353/uk-councils-used-massive-surveillance-powers-to-spy-excessively-barking-dogs-illegal-pigeon-feeding.shtml.

5. The history of the best-seller list is very interesting. Until fairly recently, when many lists shifted to being based on actual large-scale sales data (such as the Wall Street Journal which uses BookScan, which I’ll discuss later), these were derived by phoning around to a small number of sales outlets to gather what was fundamentally anecdotal evidence; even worse, the identify of the sales outlets was carefully concealed to avoid authors and publishers attempting to “game the system.” The New York Times Best Sellers list (see, for example, Michael Korda, Making the list: A cultural history of the American bestseller list 1900–1999, New York: Barnes and Noble, 2001) is the poster child for this later behavior. There are some very interesting parallels here with what happened to sound recording sales data, where the early Billboard top lists were based on similar anecdotal information from a few radio stations and record stores; later they were converted to make use of the SoundScan lists (based on national sales data in the early 1990s), and this transition totally restructured the perception of the music industry and the public about popular genres and performers. I’ll also return to the question of identifying best-selling e-books later. See Laura Hazard Owen, “Taking on e-book bestsellers” (3 February 2011), http://www.publishingtrends.com/2011/02/taking-on-e-book-bestsellers/, for a snapshot of this situation.

6. See Association of American Publishers, “U.S. publishing industry’s annual survey reveals nearly $28 billion in revenue in 2015” (11 July 2016), http://newsroom.publishers.org/us-publishing-industrys-annual-survey-reveals-nearly-28-billion-in-revenue-in-2015/, but note carefully that these are numbers reported by traditional publishers and count sales rather than the actual amount of reading; they do not include, for example, e-books published through Amazon directly. See, for example, Dan Cohen’s analysis at http://www.dancohen.org/2016/07/12/whats-the-matter-with-ebooks-an-update/ for additional perspective. Note also the tendency to confuse market share as a portion of revenue with the share of the unit book or e-book sales. This is important because in recent years, with the establishment of agency pricing, which allows the publisher rather than the retailer (like Amazon) to control the price of e-books, traditional mainline publishers have been raising the price of e-books to the point that they are often more expensive than print hardcover editions (an obviously bad deal for consumers given the restrictions on electronic books, such as the impossibility of resale and the fact that the publishers have essentially no “copy cost” for digital materials), so, not surprisingly, the (unit) sales e-books have declined but revenue has stayed stable or gone up. Independently published e-books typically sell for much lower prices, and the unit sales numbers have gone up, though the details of this are complex and a bit obscure. I will have more to say to about e-book sales data later.

7. Today, this has all changed. Amazon-published independent authors, for example, get access to Bookscan data; other authors can purchase access to this database for reasonable fees. See below for a discussion of the sources and history of this database.

8. This actually goes back, mutatis mutandi, to ancient Rome. See Frank Furendi, “Bookish fools,” aeon (20 October 2016), https://aeon.co/essays/are-book-collectors-real-readers-or-just-cultural-snobs.

9. Ironically, “coffee table” books, with their emphasis on beautiful visual reproductions and attractive artifacts, but also playing a role as symbols of cultural sophistication, seem to continue to persist and indeed to thrive as physical artifacts.

10. There’s a rather pathetic history of individual mass-market publisher efforts to make much more extensive direct connections to readers and to provide reader platforms that give them direct and primary access to reader data; for example HarperCollins has built an e-book reader app for both Android and Apple’s iOS and tried to leverage a collaboration with News Corporation (e.g., Wall Street Journal) to drive traffic to it, but single-publisher platforms seem doomed to fail, particularly in the absence of highly compelling financial benefits for readers (which doesn’t seem to be the case here). It’s interesting to note the argument that publisher insistence on attaching digital rights management (DRM) to their content has been key in creating and protecting the controlling intermediary niche that Apple and particularly Amazon now occupy. Also worth noting here is the work of the reading analytics firm Jellybooks (see www.jellybooks.com and Alexandra Alter and Karl Russell, “Moneyball for book publishers: A detailed look at how we read,” New York Times (14 March 2016), http://www.nytimes.com/2016/03/15/business/media/moneyball-for-book-publishers-for-a-detailed-look-at-how-we-read.html) which provides data to publishers by making instrumented advanced reading copy e-books of their publications available to test readers.

11. Kurt Andrews and Philip M. Napoli, “Changing market information regimes: A case study of the transition to the Bookscan audience measurement in the U.S. book publishing industry,” Journal of Media Economics, volume 19, number 1, pp. 33–54 (2006). This also has a good summary of the impact of the earlier SoundScan system on the music industry.

12. The publicly available data on this is astoundingly limited: I invite readers to inspect the impressively uninformative pages on the Nielson site, for example. Nielson seems open to offering private communications from time to time, however, see for example the very informative Getting the best from BookScan, Linda Rosen, December 2015, IBPA independent Web site (www.ibpa-online.org); it’s anyone’s guess whether what’s reported here is totally accurate, or still true.

13. The data here includes the recent Bowker report (see Self-publishing in the United States, 2010–2015, Print and Ebook, http://media.bowker.com/documents/bowker-selfpublishing-report2015.pdf) on self-published books and e-books (only covering those that have ISBNs): they count 727K in 2016, 600K in 2015 new titles. A very major part of these are independent publishing platforms aimed at Amazon: CreateSpace, Smashwords, Lulu, etc. For speculations on the extent of Amazon and Apple dominance of genre markets in e-publishing, see the periodic reports at www.the-digital-reader.com and www.authorearnings.com. Amazon seems to be establishing increasing control over genre marketplaces, with perhaps three quarters of the market.

14. See “October 2015 — Apple, B&N, Kobo, and Google: A look at the rest of the ebook market,” AuthorEarnings (9 October 2015), http://authorearnings.com/report/october-2015-apple-bn-kobo-and-google-a-look-at-the-rest-of-the-ebook-market. I have not seen more recent figures. Note that if you restrict to the established publishers and ignore self-published independent authors, Amazon’s dominance is considerably diminished.

15. Theodore C. Bergstrom, Paul N. Courant, R. Preston McAfee, and Michael A. Williams, “Evaluating big deal journal bundles,” Proceedings of the National Academy of Sciences of the United States of America, volume 111, number 26 (1 July 2014), pp. 9,425–9,430, http://www.pnas.org/content/early/2014/06/11/1403006111.

16. See Jordan Ellenberg, “The summer’s most unread book is ...,” Wall Street Journal (3 July 2014), http://www.wsj.com/articles/the-summers-most-unread-book-is-1404417569; see also Francine Prose, “They’re watching you read,” New York Review of Books (13 January 2015), http://www.nybooks.com/daily/2015/01/13/reading-whos-watching/.

17. See Peter Wayner, “What if authors were paid every time someone turned a page?” Atlantic (20 June 2015), http://www.theatlantic.com/business/archive/2015/06/amazon-publishing-authors-payment-writing/396269/.

18. Ramona Broussard and Philip Doty, “Towards an understanding of fiction and information behavior,” Proceedings of the 79th ASIS&T Annual Meeting: Creating Knowledge, Enhancing Lives through Information & Technology (ASIST ’16, 14–18 October 2016), Copenhagen, Denmark (article number 66), American Society for Information Science, Silver Spring, Md., 2016, >http://dx.doi.org/10.1002/pra2.2016.14505301066.

19. Pavel Braslavski, Valery Likhosherstov, Vivien Petras, and Maria Gäde, “Large-scale log analysis of digital reading,” Proceedings of the 79th ASIS&T Annual Meeting: Creating Knowledge, Enhancing Lives through Information & Technology (ASIST ’16, 14–18 October 2016), Copenhagen, Denmark (article number 44), American Society for Information Science, Silver Spring, Md., 2016, http://dx.doi.org/10.1002/pra2.2016.14505301044.

20. There are, certainly, open source PDF viewers. Recently there has been considerable controversy about the collisions between DRM technologies, particularly those embedded in the W3C HTML5 standard (which are inherently not open source) and open source software, which raise interesting tensions here. One also hears interesting discussion about the existence of very large archives of very international (anonymized) reader behavior from Bluefire, a provider of “white-label” e-book reading software for many channels.

21. Mark Ware and Michael Mabe, The STM report: An overview of scientific and scholarly journal publishing, Fourth Edition, March 2015, p. 52, http://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf.

22. Carol Tenopir, Lorraine Estelle, Wouter Haak, Suzie Allard, Lisa Christian, David Nicholas, Anthony Watkinson, Hazel Woodward, Peter Shepherd, Robert Anderson, and Suzan Ali Saleh, “The secret life of articles: From download metrics to downstream impact,” Proceedings of the Charleston Library Conference (2015), http://dx.doi.org/10.5703/1288284316325.

23. David Nicholas and David Clark, “‘Reading’ in the digital environment,” Learned Publishing, volume 25, number 2, pp. 93–98 (April 2012), http://ciber-research.eu/download/20120328-Reading_in_the_digital_environment.pdf.

24. See Patrick OBrien, Kenning Arlitsch, Leila Sterman, Jeff Mixter, Jonathan Wheeler, and Susan Borda, “Undercounting file downloads from institutional repositories,” Journal of Library Administration, volume 56, number 7, pp. 854–874 (2016), http://dx.doi.org/10.1080/01930826.2016.1216224, and Patrick OBrien, Kenning Arlitsch, Jeff Mixter, Jonathan Wheeler, and Leila Belle Sterman, “RAMP: The Repository Analytics and Metrics Portal: A prototype Web service that accurately counts item downloads from institutional repositories,” Library Hi Tech, volume 35, number 1 (2017), pp. 144–158, http://dx.doi.org/10.1108/LHT-11-2016-0122.

25. Clifford A. Lynch, “The shape of the scientific article in the developing cyberinfrastructure,” CT Watch, volume 3, number 3 (August 2007), pp. 5–11, http://www.ctwatch.org/quarterly/articles/2007/08/the-shape-of-the-scientific-article-in-the-developing-cyberinfrastructure/. See also the ongoing work of the Force 11 “Beyond the PDF” programs (https://www.force11.org/meetings/beyond-pdf-2).

26. For a comprehensive discussion of the efforts to measure scholarly impact, see Yves Gingras, Bibliometrics and research evaluation: Uses and abuses, Cambridge, Mass.: MIT Press, 2016.

27. There is actually an important distinction here. On the one hand, the government can demand existing private data, like logs or unpurged circulation records; this is the framework of subpoenas and to some extent National Security Letters. But there are also parallel legal provisions that allow the government to demand that engineering changes and special “backdoors” be made to systems to capture additional data or to allow real-time monitoring of users (the tradition here is of wiretapping and related approaches); these tendencies are also reflected in legislation like the Communications Assistance for Law Enforcement Act (CALEA) and certain provisions of the Patriot Act. In recent times, as content and platform have converged, particularly in large social media environments like Twitter or Facebook, we see both approaches being applied in parallel.

28. In mid-2014, we had a very interesting case study in the scandal over security and privacy problems in a release of an Adobe software product called Digital Edition (DE); see Gary Price, “New and old: Serious reader privacy concerns both inside and outside the library,” INFOdocket (7 October 2014), http://www.infodocket.com/2014/10/07/new-and-old-serious-reader-privacy-concerns-both-inside-and-outside-the-library/, and April Glaser and Alison Macrina, “Librarians are dedicated to user privacy. The tech they have to use is not,” Slate (20 October 2014), http://www.slate.com/blogs/future_tense/2014/10/20/adobe_s_digital_editions_e_book_software_and_library_patron_privacy.html. Briefly, Adobe DE is used as the local DRM enforcement tool for public library patrons who want to download e-books to most platforms other than Kindles; OverDrive, the main e-book delivery system used by public libraries, relies on it. In a genuinely impressive failure of programming and release quality control, Adobe shipped a new version of this software that reported information back to central Adobe servers in the clear, so that anyone who could monitor these transmissions could see what information was being reported about, not only an individual’s reading habits, but apparently also about digital e-book collections on their reading devices. People were aghast at the amount of information being reported, as well as the failure to protect this information from third-party inspection by neglecting to encrypt the channel back to Adobe’s servers. After some denial and a good deal of confusion, Adobe responded to the situation in two ways. One was to release a new version of the software that encrypted transmission back to Adobe; the good news here is that reader privacy is protected against eavesdroppers, though perhaps not from third parties that license the data from Adobe, but the bad news is that we now have no idea what information the DE software is passing back to Adobe central. The other response was a few stunningly pusillanimous press releases that argued, in effect, that they were only collecting data in support of the DRM and licensing arrangements that support any sort of business model they could imagine a potential publisher might possibly want to employ, without providing much more detail about retention, resale, or other issues related to this data collection. This situation highlights a number of problems; as one example, it would be very desirable to have an audit mechanism to provide some genuine transparency into what data Adobe DE is collecting and transmitting back to Adobe on an ongoing basis. The public only gained some transient insight into this issue due to a security failure that has since been (appropriately) patched.

29. I invite the reader to install a browser extension like Ghostery to take a snapshot of what’s actually happening at various sites in your own experience. The trackers seem to be abundant and omnipresent.

30. Historically this has been a monopoly created by the information scientist Eugene Garfield, who founded the Institute for Scientific Information (ISI) in the 1960s and developed first print and then later computer-based citation indices. ISI was sold to Thompson (later Thompson-Reuters) in 1992, and then resold to a group of private equity interests in 2016 who operate the enterprise under the umbrella of Clarivate Analytics. In the early 2000s, Elsevier created Scopus, which has developed into a formidable competitor to what’s now called Web of Science; some argue that it is both more extensive in its scientific coverage and better in quality. Other competitors, though somewhat different in character, include Google Scholar and Microsoft Academic Research. Given the growing importance of various quantitative impact factors to evaluating both individual scholar and institutional research impacts, there has recently been a great deal of concern expressed about the lack of reproducibility and the possible shortcomings of the metrics computed on these databases, leading organizations such as Jisca in the U.K. to develop their own open citation databases, though these of course lack the historical depth of Web of Science or even the more recently developed Scopus.

31. See Charles P. Bourne and Trudi Bellardo Hahn, A history of online information services, 1963–1976, Cambridge, Mass: MIT Press, 2003.

32. See Clifford A. Lynch and Michael G. Berger, “The UC MELVYL MEDLINE System: A pilot project for access to journal literature through an online catalog,” Information Technology and Libraries, volume 8, number 4, 371–383 (1989).

33. In Amazon’s early days, before Kindles and e-books, it offered a feature where it would prominently display book titles that were heavily ordered for shipment to specific zip codes. Since universities (or even units within universities like law schools) and large corporate facilities often have their own zip codes, this turned into “what they are reading this week at Intel’s corporate headquarters” or “what they are reading at Harvard Law School.” Such a feature sits at the borderline between advertising and recommendation, and also is of real interest to at least some authors, publishers, and other third parties; needless to say, the feature was quickly discontinued, apparently over corporate secrecy concerns.

34. One illuminating case to consider is a technologically sophisticated person who maintains a personal Web site. To be specific, let’s think of a postdoc or junior faculty member, or perhaps an information technologist, looking for a job. He or she may study the logs from that personal Web site very carefully and gain invaluable intelligence to help their job search. Contrast this to the access tracking data that the same person might get back from their institutional repository about readership, which is trying to balance reader privacy and author interests. From the author’s perspective, the case for a personal Web site with no meaningful privacy policy is compelling.

35. The actual details of the history and timeline here are complex; an excellent reference is Joan Starr, “Libraries and national security: An historical review,” First Monday, volume 9, number 12 (2004), http://dx.doi.org/10.5210/fm.v9i12.1198 or http://firstmonday.org/article/view/1198/1118. It’s worth noting that during World War II the general consensus in the library community seems to have been to sacrifice privacy for victory; in the early post-war period much of the focus seems to have been on the right to read and freedom of speech broadly, as opposed to the right to read privately. It’s really in the late 1980s, largely in response to federal snooping and perhaps building on the broader social distrust of government rooted in the aftermath of Watergate that reader privacy emerges as a major policy agenda item for libraries. Also important to note here, but not really covered in Starr’s survey, is the puzzling relationship between the changes in practices surrounding circulation records and the evolving privacy agenda. Historically, at least in some U.S. research libraries, individual, specific copies of books came with documentation about the history of their readers (in the form of circulation cards with names and dates in the back of the books); in the late 1960s, and on an increasingly more widespread basis throughout the 1970s, libraries replaced these with various kinds of automated circulation management systems involving punch cards and, later, barcode labels (and now radio frequency identification [RFID] tags) on books, along with new and evolving best practices about the retention of data in these computer-based systems. The early systems in U.S. public libraries were mostly protective of privacy, apparently; in addition, in the U.K. a privacy-preserving system called the Browne system was in wide use. I am indebted to Michael Buckland for his insights here; it would also be very helpful to see some detailed and rigorous scholarly analysis and literature review on this evolution. Arguably, the driver of changes in the U.S. was the availability of more cost effective technology to manage circulation, though it’s attractive to make a revisionist argument that the transition was based on a new policy agenda about reader privacy. I am indebted to both Jerome McGann and Cecilia Preston for illuminating discussions on aspects of this issue. Finally, it is worth noting that library special collections are a major remaining bastion of record-keeping about reader activities, though arguably with good reasons related to risk-reward tradeoffs.

36. Without going into detail, I want to simply recognize here that patron anonymity is more and more at risk even in consulting physical books within the library (or bookstore): CCTV (and networked video cameras), radio-frequency identification (RFID), cellphone micro tracking of various types, and similar technologies, all create opportunities to identify and track library patrons; some involve balancing library and librarian safety with patron privacy.

37. For more information about some of these projects, see: Library Hi Tech, volume 13, number 4 (1995), most of this issue is devoted to TULIP; William Y. Arms, Thomas Dopirak, Parviz Dousti, Joseph Rafail, and Arthur W. Wetzel, “The design of the Mercury Electronic Library,” Educom Review, volume 27, number 6 (1992), pp. 38–41; Richard E. Lucier and Peter Brantley, “The Red Sage Project: An experimental digital journal library for the health sciences,” D-Lib Magazine (August 1995), http://www.dlib.org/dlib/august95/lucier/08lucier.html.

38. For additional information on these methods see: R.L. Morgan, Scott Cantor, Steven Carmody, Walter Hoehn, and Ken Klingenstein, “Federated security: The Shibboleth approach,” EDUCAUSE Quarterly, number 4 (2004), pp. 12–17, http://er.educause.edu/~/media/files/article-downloads/eqm0442.pdf, and Shelton Waggener, “Federated identity as an enabler for education,” EDUCAUSE Review (January/February 2015), pp. 58–59, https://er.educause.edu/~/media/files/article-downloads/erm1517.pdf. The major research and higher education federation in the U.S. is InCommon (see www.incommon.org).

39. While there have been many earlier incidents of mass downloading, the recent Sci-Hub scholarly literature pirate site has given this issue new prominence; Joseph Esposito, “Postscript on Sci-Hub: The University Press Edition,” Scholarly Kitchen (20 April 2016), https://scholarlykitchen.sspnet.org/2016/04/20/postscript-on-sci-hub-the-university-press-edition/, provides a good overview of some of the coverage and discussion. The work of Balazs Bodo, readily available via Google query, is also very informative here.

40. Clifford A. Lynch, Report on the CNI Authentication and Authorization Survey 2016 (August 2016), https://www.cni.org/go/report-authentication-survey-2016, and Clifford Lynch, “Report on the 2016 CNI Authentication and Authorization Survey,” EDUCAUSE Review (12 December 2016), http://er.educause.edu/articles/2016/12/report-on-the-2016-cni-authentication-and-authorization-survey.

41. Though DRM is implicitly coming back as streaming subscription services supplant digital “purchase” (e.g., download under license) offerings from sources like Amazon, Apple, Pandora, Spotify, and the like (see Bill Rosenblatt, “The myth of DRM-free music,” Copyright and Technology (31 May 2015), http://copyrightandtechnology.com/2015/05/31/the-myth-of-drm-free-music/, and perhaps more importantly “The myth of DRM-free music, revisited” (16 February 2017), https://copyrightandtechnology.com/2017/02/16/the-myth-of-drm-free-music-revisited/, which suggests that DRM based content has now eclipsed DRM-free downloads).

42. For an overview of the technical mechanisms here see for example Artur Janc and Michal Zalewski, “Technical analysis of client identification mechanisms,” http://www.chromium.org/Home/chromium-security/client-identification-mechanisms. Another recent example would be Yinzhi Cao, Song Li, and Erik Wijmans, “(Cross-)browser fingerprinting via OS and hardware level features,” NDSS Symposium 2017, https://www.internetsociety.org/sites/default/files/ndss2017_02B-3_Cao_paper.pdf. For a sense of the actual deployment of these technologies, see Steven Englehardt and Arvind Narayanan, “Online tracking: A 1-million-site measurement and analysis,” https://webtransparency.cs.princeton.edu/webcensus/. Eric Hellman, on his blog www.go-to-hellman.blogspot.com, has some specific studies of major scientific journal publishing sites and their use of third-party trackers, in particular “16 of the top 20 research journals let ad networks spy on their readers” (12 March 2015), https://go-to-hellman.blogspot.com/2015/03/16-of-top-20-research-journals-let-ad.html; see also a very disturbing past posting, “Sci-Hub, LibGen, and total information awareness” (21 March 2016), https://go-to-hellman.blogspot.com/2016/03/sci-hub-libgen-and-total-information.html.

43. Several of the postings just mentioned at Eric Hellman’s blog (https://go-to-hellman.blogspot.com/) deal with this issue.

44. There are very interesting anecdotes about the propensity of the intelligence community to replicate data within its own security perimeters, sometimes at very substantial expense, rather than expose their queries either to external hosting services or to potential third-party eavesdroppers.

45. There are two very distinct approaches here; both require attribute passing, but they use different attributes. One is to pass attributes to the platform provider that should be returned to the library as part of reporting and statistical aggregation (e.g., this is a physics graduate student); here there’s a shared or common collection of statistics known to both content provider and the library. The other is to pass a one time unique ID to the platform operator, maintain a local database that links this ID to attributes, and demand that the platform operator pass the ID back; the library can then reconnect the attributes and do whatever aggregated statistical analysis it wishes. The content provider gets very little data about users. Obviously, the information technology infrastructure demands on the library to meet the requirements of the latter scenario are much more challenging.

46. A splendid example of this is the work of Sam Kome at the Claremont Colleges; see “Patron activity monitoring and privacy protection” (20 November 2016), http://scholarship.claremont.edu/library_staff/51.

47. It’s very important to recognize that while public libraries have been on the front lines of these problems, they are going to be very serious issues for research libraries as well in the coming years. Research libraries need to be able to acquire, provide access to, and, essentially, preserve, very large segments of the mass-market book literature. Also note that, unlike scholarly publishers, who immediately recognized that they could not attract authors without a credible preservation strategy, at least some mass market publishers seem to be utterly unaware or indifferent about the central problem of preserving the scholarly and broader cultural record in a world that is moving to increasingly digital-only content. In the discussion that follows, I’ll focus only on the access issues related to mass-market e-books, though in the long run, the preservation related ones might be even more serious.

48. This is very much a U.S. centric view. In the U.K., Australia, Canada, and many European nations there are provisions for what are called Public Lending Rights (PLR). These arrangements, which in some cases date back to the late 1940s, reimburse authors (and sometimes also publishers) for the use of works by public libraries, counted either by holdings or by actual circulation transactions. The details are complex and vary greatly from nation to nation. In some cases they are linked to national copyright law; in others they are largely about supporting national authors and literatures, particularly in limited markets like the Scandinavian nations, and some programs, notably in Germany, establish social welfare funds for authors in need. Authors, not publishers, have been the greatest public advocates of these arrangements, and generally get most of the revenues; in the U.K. the late John Fowles was a great advocate. A good intellectual and political (not narrowly legal) history of development in this area is Thomas Stave, “Public lending right: A history of the idea,” Library Trends, volume 29, number 4 (1981), pp. 596–582, http://hdl.handle.net/2142/7164. Daniel Y. Mayer, “Literary copyright and the public lending right,” Case Western Journal of International Law, volume 18, number 3 (1986), pp. 483–500, http://scholarlycommons.law.case.edu/jil/vol18/iss3/4, provides a very helpful but more legal analysis, including details on the arrangements in Germany, and an account of the abortive U.S. efforts to establish similar arrangements (in direct conflict to the first sale doctrine of copyright, if one links PLR to copyright, rather than to subsidy and support programs for authors).

49. Clifford A. Lynch, “Ebooks in 2013: Promises broken, promises kept, and Faustian bargains,” Digital content: What’s next, Supplement to American Libraries (May 2013), http://www.cni.org/wp-content/uploads/2013/05/ALA-Ebooks-Paper.pdf.

50. The path breaking and best work on the privacy issues around OverDrive is that of Gary Price; see his postings on www.infodocket.com, as early as 27 September 2011 and “A note on library patron and student privacy” (4 October 2011).

51. There are many issues related to reviewing that I won’t address here: the qualifications and nature of reviewers and the purpose of the reviews (adding value and context and critical analysis of a work as opposed to simply helping a purchase/reading time allocation decision) and how these two activities distribute across various media channels, but for two short introductions to these questions see Jane Hu, “A short history of book reviewing’s long decline,” The Awl (15 June 2012), https://theawl.com/a-short-history-of-book-reviewings-long-decline-9293ee01a1a6#.7vjsi55yv, and Sarah Fey, “Book reviews: A tortured history,” Atlantic (25 April 2012), http://www.theatlantic.com/entertainment/archive/2012/04/book-reviews-a-tortured-history/256301/, and the citations therein.

52. We are already seeing a number of disturbing developments where corporations are issuing Fitbit-type devices and linking them to health insurance and “wellness” programs (as well as amusing efforts to game these developments: see Rachael Backman, “Want to cheat your Fitbit? Try a puppy or a power drill,” Wall Street Journal (9 June 2016), http://www.wsj.com/articles/want-to-cheat-your-fitbit-try-using-a-puppy-or-a-power-drill-1465487106. Of particular interest here are the Fall 2015 developments at Oral Roberts University, where Fitbits were issued to students and connected to learning outcomes data. See Dian Schaffhauser, “How data from your LMS can impact student success,” Campus Technology (2 December 2015), https://campustechnology.com/Articles/2015/12/02/How-Data-From-Your-LMS-Can-Impact-Student-Success.

53. See Karl Bode, “Mom, my Barbie needs a better firewall,” Techdirt (2 December 2015), https://www.techdirt.com/articles/20151130/12190132946/mom-my-barbie-needs-better-firewall.shtml, and also James Vlahos, “Barbie wants to get to know your child,” New York Times Magazine (20 September 2015), http://www.nytimes.com/2015/09/20/magazine/barbie-wants-to-get-to-know-your-child.html. More recently, Karl Bode, “Another lawsuit highlights how many ‘smart’ toys violate privacy and aren’t secure,” Techdirt (8 December 2016), https://www.techdirt.com/articles/20161206/05111336202/another-lawsuit-highlights-how-many-smart-toys-violate-privacy-arent-secure.shtml. For a broader look at this issue, see Melanie Bates, “Kids & the connected home: Privacy in the age of connected dolls, talking dinosaurs and battling robots” (1 December 2016), Future of Privacy Forum, Family Online Safety Institute, https://fpf.org/2016/12/01/kids-connected-home-privacy-age-connected-dolls-talking-dinosaurs-battling-robots/. It’s not just Barbie: see for example the Cloudpets scandal (Lorenzo Franceschi-Bicchierai, “Internet of Things Teddy Bear leaked 2 million parent and kids message recordings” (27 February 2017), https://motherboard.vice.com/en_us/article/internet-of-things-teddy-bear-leaked-2-million-parent-and-kids-message-recordings, or Troy Hunt, “Data from connected CloudPets Teddy Bears leaked and ransomed, exposing kids’ voice messages” (28 February 2017), https://www.troyhunt.com/data-from-connected-cloudpets-teddy-bears-leaked-and-ransomed-exposing-kids-voice-messages/, or Kimiko de Freytas-Tamura, “The bright-eyed talking doll that just might be a spy,” New York Times (18 February 2017), on the banning of the Cayla Doll in Germany as an eavesdropping device. For a broader look at child reading in the cloud, see also the important paper by Eric M. Meyers, Lisa P. Nathan, and Casey Stepaniuk, “Children in the cloud: Literacy groupware and the practice of reading,” First Monday, volume 22, number 2 (2017), http://dx.doi.org/10.5210/fm.v22i2.6844 or http://firstmonday.org/article/view/6844/5845.

54. We’ve just had what I think is the first legal case about access to data collected by these kind of in-home always-listening “microphones:” see Russell Brandom, “How much can police find out from a murderer’s echo?” The Verge (6 January 2017), http://www.theverge.com/2017/1/6/14189384/amazon-echo-murder-evidence-surveillance-data. At least at present, these devices usually use a keyword of some sort to recognize that you want their attention, and then upload your speech to network-based servers for processing; this article has some interesting information about retention and access policies for this uploaded information.

55. See for example BBC News, “Not in front of the telly: Warning over ‘listening’ TV” (9 February 2015), www.bbc.com/news/technology-31296188 and “Samsung tweaks television policy over privacy concerns” (10 February 2015), https://bits.blogs.nytimes.com/2015/02/10/samsung-tweaks-television-policy-over-privacy-concerns/. Another particularly outrageous example of this is the recent Vizio scandal; see Sapna Maheshwari, “Vizio collected viewer data without consent: What you should know,” New York Times (8 February 2017), p. B3.

56. Kimiko de Freytas-Tamura, “Maker of ‘smart’ vibrators settles data collection lawsuit for $3.75 million,” New York Times (14 March 2017), https://www.nytimes.com/2017/03/14/technology/we-vibe-vibrator-lawsuit-spying.html.

57. I have a long personal history with this quote, which I first used in “The battle to define the future of the book in the digital world,&rrdquo; First Monday, volume 6, number 6 (2001), http://firstmonday.org/article/view/864/773 or http://dx.doi.org/10.5210/fm.v6i6.864. I believe that I heard this statement made by Alan Kay in a keynote address at an EDUCOM conference. I’m indebted to Edward J. Valauskas for an actual reference to the same statement in print, though attributing it to Minsky rather than Kay. See Ray Kurzweil, The age of intelligent machines, Cambridge, Mass.: MIT Press, 1990, p. 328.

58. See the discussion in “The battle to define the future of the book in the digital world,” op. cit., about the critical misunderstanding that confuses e-books and their reading devices (as successors for individual books) with personal digital libraries.

59. Kathleen Fitzpatrick, “Academia, not edu,” Planned Obsolescence (26 October 2015), http://www.plannedobsolescence.net/academia-not-edu/.

60. See Roger C. Schonfeld, “Who has all the content?” (23 February 2017), http://scholarlykitchen.sspnet.org/2017/02/23/who-has-all-content/. See also Jeffrey Perkel, “‘Kudos’ promises to help scientists promote their papers to new audiences,” Nature, volume 536, number 7614 (4 August 2016), pp. 113–114, http://dx.doi.org/10.1038/536113a.

61. One can only imagine the data bonanza that is offered for example, by a completely out of historical profile search for rather specific medical information by a member of the broad public, and the potential value of such data to advertisers and many other parties.

62. For more information on this poorly understood sector, see “Protecting consumer privacy in an era of rapid change: Recommendations For businesses and policymakers,” U.S. Federal Trade Commission (March 2012), https://www.ftc.gov/reports/protecting-consumer-privacy-era-rapid-change-recommendations-businesses-policymakers (see also “FTC‘s privacy report: Balancing privacy and innovation,” https://www.ftc.gov/news-events/media-resources/protecting-consumer-privacy/ftc-privacy-report and the investigative reporting done by the Wall Street Journal (“What they know,” http://www.wsj.com/public/page/what-they-know-digital-privacy.html). Page 187 of Schaufenbuel (op. cit.) contains a provocative enumeration of creative abuses of consumer reading information by third parties.

63. An important inflection point here is Google’s October 2016 low-key announcement that it was dismantling the “wall” between the database maintained by DoubleClick (which it acquired in 2007) and its other data collection tools in the interest of providing more targeted advertising. See Julia Angwin, “Google has quietly dropped ban on personally identifiable Web tracking,” ProPublica (21 October 2016), https://www.propublica.org/article/google-has-quietly-dropped-ban-on-personally-identifiable-web-tracking.

64. See Phil Jones, “Outliers and the importance of anonymity: Usage data versus snooping on your customers,” Scholarly Kitchen (16 February 2015), https://scholarlykitchen.sspnet.org/2015/02/16/outliers-and-the-importance-of-anonymity-usage-data-versus-snooping-on-your-customers/ (see also comments on this post) about the mediating role and balance that publishers play between reader privacy and author added value in the scholarly setting.

65. Wouter Haak, “The world is changing for the researcher: What can we do?” presentation given at the LIBER 43rd Annual Conference, Riga, Latvia (2–5 July 2014), http://liber2014.lnb.lv/programme/sponsor-presentations/abstracts-and-biographies/#WouterHaak; “Journey to research impact,” The Hunter Forum 2015 at the American Library Association Midwinter Conference, Chicago (31 January 2015), https://libraryconnect.elsevier.com/articles/live-event-journey-research-impact-ala-midwinter. Unfortunately, neither talk seems to have been recorded, or at least I cannot find them anywhere online as of late 2016.

66. See Bruce Schneier, “Amazon is analyzing the personal relationships of its reviewers,” Schneier on Security (8 July 2015), https://www.schneier.com/blog/archives/2015/07/amazon_is_analy.html, and Chris Morran, “Amazon is data mining reviewers’ personal relationships,” Consumerist (6 July 2015), http://consumerist.com/2015/07/06/amazon-is-data-mining-reviewers-personal-relationships/, for what seems to be known about this.

67. For example: how effectively can we preserve the digital record of a personal calendar across time?

68. A few have even been preserved intact, such as the library of King George III of England at the British Library, or of John Adams at the Boston Public Library. Inquiries about the contents and subsequent fate of the libraries of important figures has long been of interest to scholars and book collectors; a nice recent example is the exhibition on the library of the Elizabethan figure John Dee in 2016 at the Royal College of Physicians in London (https://www.rcplondon.ac.uk/events/scholar-courtier-magician-lost-library-john-dee) or Phillip Ball, “Archive of wonders,” Nature, volume 529 (28 January 2016), p. 464, http://dx.doi.org/10.1038/529464a.

69. See Clifford A. Lynch, “The Future of Personal Digital Archiving: Defining the Research Agendas,” In: Donald T. Hawkins (editor). Personal archiving: Preserving our digital heritage. Medford, N.J.: Information Today, 2013, pp. 259–278, https://www.cni.org/wp-content/uploads/2013/09/Personal-Digital-Archiving-Cliff-Lynch-Oct-29-2013.pdf.

70. While authors (as opposed to musicians) usually don’t make much money touring, it’s interesting to see that the growing adoption of music streaming services like Spotify offer very timely and detailed information about where, geographically, material from a specific musician is in high demand. In an environment where poorly compensated musicians gain much of their income from live performance, this kind of market data is likely to be of real interest, and would perhaps help to counterbalance the meager royalties that musicians gain from streaming. Note that this trend builds on a series of developments unfolding in the music world since the late 1990s, as performance revenues began to dominate revenues from recording sales as a primary source of income for a large number of musicians, leading musicians to invest in a variety of electronic and social media based connections with interested fans around the world who represented potential members of live audiences.

71. I simply note here the difference between trying to understand the reception of a work by an intended or expected audience and trying to understand the nature of the actual audience that a work finds. For pre-release reader groups, there are also questions about the size and composition of these test-reader groups that are not well explored, and the biases that they may introduce. One should also be aware of built-in bias against complex, uncomfortable, or troubling works, or, simply, works considered unlikely to appeal to very many people that may ultimately find, perhaps, a substantial audience. Also, we should recognize that there are a number of e-book platform and service providers targeted at preview readers, such as Jellybooks (www.jellybooks.com).

72. As just one example, see State Public Interest Research Groups, RIPOFF 101: How the publishing industry’s practices needlessly drive up textbook costs, second edition (February 2005), http://www.studentpirgs.org/reports/ripoff-101-2nd-edition.

73. The arXiv preprint server (https://arxiv.org) at Cornell University keeps all versions of a preprint, and it is very common to see authors post several versions over time as they revise the preprint.

74. For an introduction to the issues here and extensive pointers to the literature, see the work of Ithaka S+R and Stanford University: Sharon Slade, “Applications of student data in higher education: Issues and ethical considerations,” ITHAKA S+R (6 September 2016), https://doi.org/10.18665/sr.283891, and Rayane Alamuddin, Jessie Brown, and Martin Kurzweil, “Student data in the digital era: An overview of current practices,” ITHAKA S+R (6 September 2016), https://doi.org/10.18665/sr.283890. For a look at some of the publisher-university negotiations, see Carl Straumsheim, “Is ‘inclusive access’ the future for publishers?” Inside Higher Ed (31 January 2017), https://www.insidehighered.com/news/2017/01/31/textbook-publishers-contemplate-inclusive-access-business-model-future.

75. There is some question about whether passive monitoring and inference about student reading by tracking how long they have texts on screen, scrolling patterns, and the like are really sufficient. After all, the students might learn to deceive. So there are firms working on what’s variously described as “active engagement” or “skimming prevention” by embedding various questions or polls into complex texts, forcing readers to explicitly interact with this framework as part of the reading process. See, for example, the ForClass offering described in Carl Straumsheim, “Antiskimming software,” Inside Higher Education (15 July 2015), https://www.insidehighered.com/news/2015/07/15/business-school-software-doubles-skimming-prevention-tool.

76. Indeed, one of the convenient structural arrangements between the government and the private sector that seems to be gaining in popularity is asking (or requiring) the private sector to hold various collections of monitoring data so that the government, theoretically under various legal controls, can use these collections when needed; this is argued to be better than simply having the government archive the data internally.

77. There is another way to think about this: what’s the additional differential information gain from reading analytics for various government purposes? While there are lots of interesting edge cases where it’s particularly helpful, in general, I think that these are limited in the context of large-scale surveillance. Of course, for specific, in-depth prospective or forensic intelligence, or criminal investigations, this data can offer extensive new insights into what people are doing and how their thinking may be evolving. And one must never under-estimate the dangers of overly aggressive and overreaching prosecutors and investigators, often driven more by random prurient interests, ideology, or political ambitions rather than any concern with genuine national security interests or even reasonable law enforcement agendas. In these hands personal reading data can be incredibly dangerous.

78. For an excellent, broader look at this, see Maciej Cegowski, “What happens next will amaze you,” Fremtidens Internet, Copenhagen, Denmark (September 2015), http://idlewords.com/talks/what_happens_next_will_amaze_you.htm; I’m indebted to David S.H. Rosenthal for the pointer to this talk.

79. This is Oscar Gandy’s unforgettable metaphor; see Oscar H. Gandy, The panoptic sort: A political economy of personal information, Boulder, Colo.: Westview Press, 1993. Gandy, of course, repurposed Jeremy Bentham’s idea of the panopticon, which was returned to broad discourse by Foucault’s work Discipline and punish: The birth of the prison (New York: Pantheon Books, 1977). But as far as I know the “panoptic sort” comes from Gandy.

80. See, for example, Alan Travis, “‘Snooper’s Charter’ bill becomes law, extending UK state surveillance,” Guardian (29 November 2016), https://www.theguardian.com/world/2016/nov/29/snoopers-charter-bill-becomes-law-extending-uk-state-surveillance.

81. This has been widely studied, of course, in many different dimensions. Good places to start might include the Pew Internet and American Life surveys (http://www.pewinternet.org/) and danah boyd, It’s complicated: The social lives of networked teens. New Haven, Conn.: Yale University Press, 2014, http://www.danah.org/books/ItsComplicated.pdf, to name only two sources.

82. The somewhat broader question about what to do with really good customer or audience data is arising in a surprising number of settings. Consider the efforts to track, in very high resolution detail, the movements of retail store customers through their mobile phones, or, perhaps more relevant, the conversations about the growing ability of museums to track visitor behavior and what uses should be made of this data (see Ellen Gamerman, “When the art is watching you,” Wall Street Journal (11 December 2014), http://www.wsj.com/articles/when-the-art-is-watching-you-1418338759, for an overview of developments).

83. See, for example, Karl Bode, “Consumer groups say AT&T, Comcast violate privacy law by hoovering up cable box data without full user consent,” Techdirt (16 June 2016), https://www.techdirt.com/articles/20160609/12091134667/consumer-groups-say-att-comcast-violate-privacy-law-hoovering-up-cable-box-data-without-full-user-consent.shtml.

84. A useful introduction here is Jason Fox, “Why streaming services are so secretive,” BloombergView (3 August 2015), https://www.bloomberg.com/view/articles/2015-08-03/why-streaming-services-are-so-secretive-about-users.

 


Editorial history

Received 14 January 2017; revised 24 March 2017; accepted 28 March 2017.


Creative Commons License
This paper is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

The rise of reading analytics and the emerging calculus of reader privacy in the digital world
by Clifford Lynch.
First Monday, Volume 22, Number 4 - 3 April 2017
http://journals.uic.edu/ojs/index.php/fm/article/view/7414/6096
doi: http://dx.doi.org/10.5210/fm.v22i14.7414





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.