e-Exclusion and Bot Rights: Legal aspects of the robots exclusion standard for public agencies and other public sector bodies with Swedish examples
First Monday

e-Exclusion and Bot Rights: Legal aspects of the robots exclusion standard for public agencies and other public sector bodies with Swedish examples by Nicklas Lundblad



Abstract
Public sector use of the robot exclusion standard raises interesting questions about transparency, availability of public sector information and the principle of public access to information. This paper explores both actual examples of how public sector agencies in Sweden use the standard and an analysis of the legal problems related to use of the standard.

Contents

1. Introduction
2. The robots exclusion standard
3. The principle of public access to official documents
4. The Public Sector Information Directive
5. Use of the robots exclusion standard in the Swedish public sector
6. Discussion
7. An argument for conscious, careful and competitively neutral use of the RES
8. Conclusion

 


 

1. Introduction

The characterization of our society as an information society is unfounded. Indeed, there exists vast amounts of ever–growing information repositories around the globe in Web sites, databases and other digital forms, but this by itself does not provided the foundations for an information society. Instead, we are facing a noise society where the information we need is always out of grasp, in the exploding noise.

Under these circumstances search becomes an essential societal function. Between the sea of noise and the information society stand filters of different kinds, and search engines provide one very special kind of filtering tool. These filters allow us to filter out much of the information overload and find what we need, quickly as well as with some accuracy.

The growing utility of search also makes it important to ask questions about search engines and their practices. It becomes essential to examine the mechanisms of search, the consequences of being searchable as well as not being searchable and the long–term changes search policies can effect. The design of our search architecture becomes an important regulative fact (Lessig, 1999; Biegel, 2001).

This paper will focus on one very narrow aspect of this growing subject area, the relationship between the public sector and search, especially in terms of transparency. Public sector agencies and other bodies, such as public universities or research institutions, are under legal obligations to establish a certain amount of transparency. But to what extent can they actually choose not to be searchable? Not indexed or not crawled by popular search engines?

Is it possible for a public agency to exclude only one search engine, and welcome others?

And specifically: to what extent can a public agency or body use the so–called robots exclusion standard to escape being indexed? What different legal aspects exist on the use, in the public sector, of the robots exclusion standard? Is it possible for a public agency to exclude only one search engine, and welcome others? Can a publicly funded research institution exclude some parts of its Web site from indexing because researchers or their donors request this? Or is there a general obligation for public bodies to be searchable and indexed by as many robots and other agents as possible?

The paper focuses on a Swedish context, which holds some extra value, since Sweden has a strong principle of public access to information. It will concentrate on two legal aspects: the principle of public access in Sweden and the Public Sector Information Directive. A survey of different uses and practices at Swedish public agencies and bodies is presented.

 

++++++++++

2. The robots exclusion standard

2.1 How it works

The robots exclusion standard is very simple to understand [1]. It consists of a simple text file placed in the root directory of the server that one wants to protect. The text file is named robots.txt and defines a policy for search bots that signals whether or not a certain directory can or cannot be indexed by a robot.

The policy is formulated in a simple way, and the simplest possible application of the standard is a file that stops all bots. Such a file would be formatted thus:

# go away
User–agent: *
Disallow: /

This, put in a robots.txt file, signals that the site owner does not want the site to be indexed at all. The end result is a very simple standard that can be used by anyone, containing only two different functions: “user–agent” and “disallow”. These two functions take as arguments different user–agents, that is different bots from different search facilitators, and different directories or files. The “disallow”–function can be global, as it is in the example, or local, with defined directories that the site owner wants to be exempt from search.

2.2 Why use the robots exclusion standard?

Another important point to make is that using the robot exclusion standard is quite legitimate. There are cases where indexing becomes burdensome on server capacity or where one does simply not want certain folders indexed by public search engines. The method is not fool–proof, but offers an opportunity to control access to information and control server load.

2.3 Status

2.3.1 Formal standard, de facto standard or neither

The robots exclusion standard is not a formal standard. If anything it has become a de facto standard, respected by many search engines and effectively implemented in many pieces of search software. The Web site holding the standard notes [2]:

This document represents a consensus on 30 June 1994 on the robots mailing list (robots–request@nexor.co.uk) [Note the Robots mailing list has relocated to WebCrawler. See the Robots pages at WebCrawler for details], between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www–talk@info.cern.ch). This document is based on a previous working draft under the same title.

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

There exist some extensions and modifications of the standard, but these will not be mentioned in detail in this paper [3].

2.3.2 Other robot–management strategies

Aside from using the robots exclusion standard it is also possible to use so–called metatags on a page to stop bots from indexing the pages or following the links published within these pages. This method poses many of the same questions as the robots exclusion standard, but will not be dealt with here [4].

2.4 Legal implications of self–declared bot policies

2.4.1 Acceptable use and digital trespass

What, then, are the legal implications of using a robots.txt file on a Web server? Is there a legal obligation for bot to respect a self–declared standard that is neither formal nor codified in any of the larger standards organizations outside of the formal standards structure?

The short answer is that no one knows. However there are examples that can be thought of as analogous and that can be used to explore the issue. One close analogy can be found by examining “acceptable use” policies on Web sites, understanding their legal value.

An “acceptable use” policy is any kind of policy document available on a Web site, declaring how a specific site and its content may be used.

In a number of cases at least American courts have signalled that they take these policies seriously with important legal consequences. In particular, these policies have become associated with a legal notion sometimes called “cyberspace–as–place”, or the idea that a visitor to a Web site may well be, in some sense a trespasser (Hunter, 2002; Lemley, 2002; Cohen, 2007).

There are, of course, many differences between an acceptable use policy and a robots.txt file, not the least of which is that it is possible that a human visitor to a Web site never notices the robots.txt file (even if this might apply to some acceptable use policies as well, if they are hidden in strange places on the site). But these differences are not so large as to negate the possibility of an argument ex analogia where the robots.txt file is seen as an acceptable use policy for bots and crawlers of different kinds.

The important question, of course, becomes whether or not a bot can be seen as trespassing or if you have to be human to be able to trespass. Could a train trespass? A car? What, then, about a bot? These seem almost childish questions, but they illustrate the great difficulties in solving problems related to the new technological architecture we are facing. Curiously, the idea that bots could trespass seems to strengthen the legitimacy of the cyberspace metaphor, and of the image of cyberspace as a separate domain so enamoured by the early cyberlibertarians (Barlow, 1996; Post and Johnson, 2001).

Since in at least one case a bot was seen as trespassing (Lemley, 2002 supra note 22) it seems reasonable to assume that ignoring the robots.txt file could be seen as digital trespass, even if it has not been established, or at least not established in Swedish law.

2.4.2 Self–declared compliance

The question about the legal status of self–declared bot policies depend on two parties: the Web site being indexed and the search engine. Up until now we have assumed that the party behind the indexing crawler has declared nothing about its position on the use of the robots exclusion standard. But what if a search engine has explicitly recommended the use of the standard and publicly said that it will respect the standard?

This is the case with, for example, Google. In their advice to webmasters they clearly state [5]:

To remove your site from search engines and prevent all robots from crawling it in the future, place the following robots.txt file in your server root:
User–agent: *
Disallow: /
 
To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:
User–agent: Googlebot
Disallow: /
 
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you’d use the robots.txt files below.
 
For your http protocol (http://yourserver.com/robots.txt):
User–agent: *
Allow: /
 
For the https protocol (https://yourserver.com/robots.txt):
User–agent: *
Disallow: /

Does this affect the legal analysis of the status of the standard? Is this support page a binding promise from Google to respect the robots exclusion standard? If this is assumed to be the case a host of new questions become important: how soon after a change in the bot–policy of a Web site must this change be reflected in the index at Google? And is this self–declared compliance binding over time? Or can Google at anytime change its compliance?

It is reasonable to assume that this support page has some legal relevance. A one–sided attempt at committing Google to follow the standard may seem a weak basis to build the legal effects of the robots exclusion standard on, but it remains a possible step in establishing that Google, and other search engines which commit in the same way, actually assumes a legal obligation to the respect the standard when they point to it as an important measure available to Web sites unwilling to be indexed.

2.4.3 Binding or not?

In the end it will be up to courts to decide what the status of a robots.txt file should be. In Sweden such legal precedents are lacking and we are left to guesswork. This need not be a problem. In this paper we will assume that the use of the robots exclusion standard constitutes binds a search engine or other search software legally. This premise will be used to concentrate on the true subject of our research — if public agencies and other public sector bodies can indeed use the standard and if so, in what ways.

The question is not limited to the robots.txt file. Any other self–declaring policy that is machine–readable will have to be examined in much the same way. One example could be the envisaged standard DPRL — Digital Rights Property Language — designed to express rights about particular digital objects (DPRL 1998; Stefik, 1998).

Any other new standard that expresses rights and obligations would have to be examined specifically for its use in the public sector.

 

++++++++++

3. The principle of public access to official documents

3.1 History and background

A short introduction to the principle of public access to official documents is probably a good starting point for the discussion later. Those readers well familiar with the Swedish regulation may well skip this section.

Sweden has one of the longest legal traditions in the world when it comes to public access to official documents. The first constitutional law that contains the principle was crafted and put into force back in the year 1766. This law contained a right to public access to official documents that has since been expanded and renewed. In the current legislation t is found in the first paragraph of the second chapter of the legislation governing the freedom of the press in Sweden, Tryckfrihetsförordningen, the Freedom of the Press Act:

To encourage the free exchange of opinion and availability of comprehensive information, every Swedish citizen shall be entitled to have free access to official documents.

The principle is very extensive and has a wide scope. There are several precedents and a large body of literature on the application of the process.

3.2 Legal rules on public access

The principle is fairly simple to explain, but the devil is in the details. Somewhat simplified, all Swedish citizens have the right to public access to official documents if these have not been classified or if giving them out would constitute an infringement of privacy. A public official that refuses to give out information must show clearly on what exact legal grounds his decision is founded. A failure to comply with the principle is a crime and is punishable by criminal law.

Much of the law and precedent hinges on the definition and understanding of what actually constitutes an official document or act. This is partly difficult to decide, but the concept has been applied broadly. It is uncontroversial to note that documents or other information on a computer is encompassed by the definition. The content of a Web site is normally to be regarded as an official document.

The exemptions allowed for in the Freedom of the Press Act are narrow and the most practical exemption — besides classified information and certain privacy infringements — are draft documents and memoranda.

It would take us too far to delve more deeply into the principle and the associated literature here. Strömberg (1999) offers one of the basic texts (in Swedish) and it provides a thorough introduction.

3.3 Public access and digital formats

There is no obligation to allow access to information or official documents in a certain format. It does not follow from the principle of public access that a public agency must employ the broadest possible means of information access to all its information. But if information is available in a certain format, it normally means that a citizen has the right to acquire it in that format.

It follows that there is no obligation for public sector information holders to convert, scan or otherwise manipulate data in order for it to be available on the Web. This is an important fact to keep in mind when discussing the question of a duty to be searchable.

 

++++++++++

4. The Public Sector Information Directive

4.1 History and background

The European Commission has stated that it believes that public sector information could be used by entrepreneurs to open up a whole new market for public sector information in various forms. To ensure that states make information available, and to safeguard private enterprise against unfair competition from public sector actors that sell information, the Commission has crafted a directive on the re–use of public sector information. The result, Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003 on the re–use of public sector information, is now being implemented throughout the European Union [6].

4.2 Content of the Directive

The Directive is applies to all public sector bodies but exempts a number of different sectors. Research, archives and public service broadcasters, for example, do not have to comply with the Directive, or at least not to the extent they act within these roles and do not themselves engage in commercial re–use of their own information [7].

The general principle is found in Article 3:

Member States shall ensure that, where the re–use of documents held by public sector bodies is allowed, these documents shall be re–usable for commercial or non–commercial purposes in accordance with the conditions set out in Chapters III and IV. Where possible, documents shall be made available through electronic means.

The Directive then sets out specific rules on requests and conditions for reuse, as well as non–discrimination and fair trading. These rules lay down principles on prompt reply to requests, fair conditions, and competitive practices that all aim to create the preconditions for entrepreneurial innovation in the public sector information field. We will return to the individual articles later, in the discussion.

 

++++++++++

5. Use of the robots exclusion standard in the Swedish public sector

5.1 Selection, survey and intention

The general question of this paper could have been discussed and examined without empirical evidence. The empirical evidence, however, helps to suggest different categories of use of the robots exclusion standards, and gives good examples to work with in the discussion. Therefore a short empirical investigation was initiated.

A number of public agency Web sites were searched for the robots.txt file. Where such a file was found it was copied and printed out. The different files were then examined and categorized.

The survey was not complete, nor was the intention to make sure statements about the frequency of use of the robots exclusion standard. The collection of samples is merely intended to give a rough idea of different practices and formulations of the robots.txt file. The public sector bodies examined and found to have a robots.txt file were:

Uppsala University
Riksdagen — National Parliament
Lantmäteriet — manages the Swedish cadastral system
Regeringen.se — Government official Web site
Riksarkivet — National archives
Statistiska Centralbyrån — Statistics Sweden
ESV — Swedish National Financial Management
Konkurrensverket — Swedish Competition Authority
KTH — Royal Institute of Technology
KI — Karolinska Institute
HSV — Swedish National Agency for Higher Education
Arbetsmiljöverket — Swedish Work Environment Authority
Bolagsverket — Swedish Companies Registration Authority
Datainspektionen — Swedish Data Inspection Board
Försvarsmakten — Swedish Armed Forces

These public sector bodies all used a robots.txt file in their root directory. It should be noted that I do not imply that the use of the robots.txt file has been decided upon by the leadership at these agencies and bodies. Indeed, I think that many have not even reflected on its use at all.

5.2 Complete exclusion

Some of the surveyed Web sites completely disallows any use of robots. Examples of complete exclusion were found at the Swedish National Financial Management Authority and the Swedish National Agency for Higher Education. Interestingly, none of these attempts at complete exclusion were successfully formatted. If checked with a simple tool they turned out to be defective [8]. The erroneous file looked like this in both cases:

User–agent:
Disallow:

To be formatted correctly the functions would have to have had arguments or placeholders (* and /) to signify that all content should be exempt from indexing.

5.3 Partial exclusion of content

Many of the surveyed Web sites excluded parts of their sites. The examples of what was excluded varies.

The Swedish Comptetion Authority has excluded a directory called “Beslut” which in English translates to “Decisions”:

# robots.txt for http://www.kkv.se/
User–agent: *
Disallow: /beslut

Uppsala University has excluded a number of different directories, for sometimes unclear reasons [9]:

User-agent: *
Disallow: /Titlepage/
Disallow: /Education/Program/
Disallow: /Education/Kurser/
Disallow: /coop/skandia/
Disallow: /Forskning/
Disallow: /oktober/
Disallow: /Internt/old/
Disallow: /Library/
Disallow: /navbar/
Disallow: /Postcards/

Another example of an research facility is the Karolinska Institute. The Institute has an incredibly old robots.txt file, amply commented upon:

# robots.txt for http://www.ki.se/
# Ulf Kronman 18 jul 95 - 97

User-agent: * # all robots
Disallow: /SFgate/ # WWW-WAIS scripts
Disallow: /cgi/ # scripts
Disallow: /cgi-bin/ # scripts
Disallow: /sgi/ # scripts
Disallow: /form/ # mallar
Disallow: /demo/ # not for external use
Disallow: /gif/ # pictures - nothing to index
Disallow: /it/kurs/test/ # course testing - not for external use
Disallow: /it/select/ # KI software - not for external use
Disallow: /it/torget/ # KI software - not for external use
Disallow: /sys/ # system docs - not for external use
Disallow: /templ/ # templates - not for external use
Disallow: /test/ # testing - not for external use
Disallow: /db/ # databases - not for external use
Disallow: /alex/ # not for external use
Disallow: /usage/# not for external use
Disallow: /usage2/# not for external use
Disallow: /ADSL/ # not for external use
Disallow: /webbval/# not for external use
Disallow: /statistik/# not for external use
Disallow: /webstat/# not for external use
Disallow: /kemi2/# not for external use
Disallow: /kemi1/# not for external use
Disallow: /static/
Disallow: /php/
Disallow: /styrelseval/

The last directory excluded — board election — is interesting. It points to a series of test elections for the board. It is unclear why they were excluded.

Regeringen.se — the government’s Web site — excludes all downloads:

User-agent: *
Disallow: /download/

The reason may be that much of the content consists of large PDF files, and that the indexing these consumes too much bandwidth, but the exemption is not commented upon.

Riksdagen.se — the National Parliament — excluded all URLs containing the phrase Media=Print by using the following code:

User-agent: *
Disallow: /*Media=Print

This, however, earns a warning in the syntax checker. The warning is interesting, because it makes clear that Riksdagen is not using the basic version of the robots exclusion standard, but a version commonly associated with Google:

Disallow: /*Media=Print
The "*" wildchar in file names is not supported by (all) the user–agents addressed by this block of code. You should use the wildchar "*" in a block of code exclusively addressed to spiders that support the wildchar (Eg. Googlebot).

This implies that the robots.txt file at the Parliament has been designed specifically for Googlebot and other bots that accept that dialect of the robots exclusion standard.

Riksarkivet, the National Archives, exclude a number of different directories:

User–agent: *
Disallow: /sok/
Disallow: /prog/
Disallow: /javascript/
Disallow: /ra/nad/

The last excluded directory is the directory of the National Archives’ database, which is commercially available through that URL.

Statistiska Centralbyrån, Statistics Sweden, has an extensive robots.txt file:

###############################
#
# disallow all folders but
# /gemensamma_filer/
# /grupp/
# /Statistik/
# /templates/
#
User–agent: *
#
# list folders robots are not allowed to index
#
Disallow: /admin/
Disallow: /aspnet_client/
Disallow: /backup/
Disallow: /bin/
Disallow: /CustomEdit/
Disallow: /Databaser/
Disallow: /edit/
Disallow: /help/
#Disallow: /images/
Disallow: /js/
Disallow: /lang/
Disallow: /stats/
#Disallow: /styles/
Disallow: /system/
Disallow: /upload/
Disallow: /wap/
#
#
###############################

The excluded directories again contain a database directory, but the main statistics directory is left open to indexing.

Bolagsverket, the Swedish Companies Registration Authority, has a very simple file, excluding some templates:

User–agent: *
Disallow: /_notes/
Disallow: /adm/
Disallow: /Templates/

It is far more interesting to examine the contents of the robots.txt file at Datainspektionen, the Swedish Data Inspection Board:

User–agent: *

Disallow: /cgi-bin
Disallow: /bilder
Disallow: /IMS
Disallow: /ims
Disallow: /javascript
Disallow: /sokhjalp
Disallow: /Templates
Disallow: /tomcat4–webapps/enkat
Disallow: /tomcat4–webapps/sok
Disallow: /om_datainspektionen/om_datainspektionen.shtml
Disallow: /fragor_svar/fragor_svar.shtml
Disallow: /temasidor/temasidor.shtml
Disallow: /lagar/lagar.shtml
Disallow: /lattlast
Disallow: /puo/puo.shtml
Disallow: /phpBB
Disallow: /phpBB
Disallow: /lists
Disallow: /phplist–2.8.6
Disallow: /cgi–bin
Disallow: /poll
Disallow: /lists
Disallow: /toppbanner
Disallow: /sokhjalp/sokhjalp.shtml
Disallow: /webbkarta
Disallow: /utbildning_konferenser/utbildning_konferenser.shtml
Disallow: /utbildning_konferenser/anmalan_inkasso2.shtml
Disallow: /nyhetsarkiv/nyhetsarkiv.shtml
Disallow: /anmalan_forhandskontroll/anmalan_forhandskontroll.shtml
Disallow: /tillstand_tillstandshavare/tillstand_tillstandshavare.shtml
Disallow: /info.shtml
Disallow: index4.shtml
Disallow: stilmall.css
Disallow: stilmall2.css
Disallow: fel.shtml

Why on earth would the Board want its FAQ to be excluded from search indexing? Or a page with laws on privacy? Or educational resources? Without commenting on the Data Inspection Board directly one could note that it is not too far–fetched to imagine a search engine company that sells software for searching. How would you improve search results (say compared to Google)? In order to accomplish this you may feel compelled to fix the competition. Disallow the pages you want to excel at finding with your search solution and then compare.

“Did Google find the FAQ? No, but our search software did.” — But only because it ignores the robots.txt file. This scenario is fictional, but not implausible.

5.4 Partial exclusion of content and of different crawlers

Försvarsmakten, the Swedish Armed Forces, have a simple robots.txt file:

User–agent: *
Disallow: /attachments/
Disallow: /images/

User–agent: sitecheck.internetseer.com
Disallow: /

The interesting thing is not that it excludes attachments and images, but rather that it excludes one particular search bot. The agent excluded comes from a site that offers monitoring services [10]:

The monitoring systems remotely check your website from several geographic monitoring stations at selected intervals. If the monitoring system is unable to reach the site, an email, cell phone or pager alert is sent to notify you of the problem.

It is unclear why this user agent is excluded from all directories, but perhaps there is a security problem here: the Swedish Armed Forces does not want anyone else to know when their site might be down, since this may reveal maintenance schedules and other information security weaknesses. Again, this is guesswork.

Another example of a Web site that excludes different agents is the Royal Institute of Technology. The Institute also has the by far largest robots.txt file:

User–agent: *
Disallow: /cgi–bin
Disallow: /cgi–perl
Disallow: /internt
Disallow: /info
Disallow: /kthprog
Disallow: /kthcd
Disallow: /student2
Disallow: /kthnytt
Disallow: /kth–nytt
Disallow: /utbildning/vidareutbildning01/
Disallow: /utbildning/vidareutbildning02/
Disallow: /utbildning/vidareutbildning03/
Disallow: /utbildning/vidareutbildning04/
Disallow: /html–files
Disallow: /src
Disallow: /smartsieve

User–agent: googlebot
Disallow: /kthprog
Disallow: /kthcd
Disallow: /student2
Disallow: /kthnytt
Disallow: /kth–nytt
Disallow: /html–files

User–agent: kth–crawler
Disallow: /student2
Disallow: /html–files

User–agent: SLCrawler
Disallow: /kthprog
Disallow: /kthcd
Disallow: /internt
Disallow: /info
Disallow: /organisation/institutioner.html
Disallow: /organisation/departments.html
Disallow: /utbildning/utlandsstudier/utbytesuniversitet.html
Disallow: /education/partneruni.html
Disallow: /education/contact_persons.html
Disallow: /utbildning/studinfo.html
Disallow: /kthnytt
Disallow: /fakulteter
Disallow: /conferences
Disallow: /aktuellt/kalendarier
Disallow: /forskning/safari/avd
Disallow: /utbildning/schema
Disallow: /organisation/uf/
Disallow: /kth–allm/karta
Disallow: /organisation/centra.html
Disallow: /organisation/centra–eng.html
Disallow: /organisation/utbildningsnamnder.html
Disallow: /student2
Disallow: /html–files

 

User–agent: Scooter
Disallow: /

This extensive exclusion fil distinguishes different user–agents, from internal crawlers to external crawler. It goes so far as to exclude one user–agent, Scooter (the old Altavista bot, rarely seen nowadays), from the Web site altogether. It also has a hierarchy of exclusion: first excluding certain files for all user–agents that are not defined, and then defining exclusion extensively for some identified user–agents. Among these we find Googlebot. Secondly it makes no comment on the directories excluded and thirdly it excludes both directories and single files.

 

++++++++++

6. Discussion

6.1 Use of the RES and the principle of public access

Is use of the robots exclusion standard possible under the principle of public access to official documents? Let us examine the arguments for and against.

The argument against use of the standard would consist of several steps.

In the information society the explosion of content has made search engines absolutely necessary to find anything on the Internet. All documents on at least the public part of a public sector body’s Web site must be presumed to be a) official documents; and, b) accessible to Swedish citizens under the principle of public access. The use of the robots exclusion standard efficiently makes documents — all or selection of them — unavailable. This is in direct conflict with the principle.

Furthermore using the standard selectively has the same effect as classifying some documents, since they will not be found by anyone if they are not indexed. This means that selective use of the standard cannot be allowed without support by legislation allowing classification of the documents in question as secret.

No public Web site should be allowed to use the robots exclusion standard for any other reasons than purely technical reasons.

In addition to this use of the robots exclusion standard is a practice that makes more difficult for citizens to exercise the right of review and control that was intended by the legislator in framing the Freedom of the Press Act. It circumvents not only the actual rules but also the original intent of the legislation. No public Web site should be allowed to use the robots exclusion standard for any other reasons than purely technical reasons. Webmasters for these sites should then be forced to declare publicly — on the site — what directories and files have been excluded from search engines and on what basis.

One could make several arguments to defend the exlusion of certain digital public documents.

Firstly, the principle of public access means a right to access certain public documents, not to access every document prodcued by an agency. It does not follow from the principle of public access that a given agency must keep a well–ordered and simple–to–search archive for every citizen that wants to pursue an investigation in general into a given agency’s activities.

Secondly, the principle applies only to Swedish citizens. There exists no obligation to provide public access to official documents to essentially the rest of the world. Indeed, search engines sometimes make information available in what in European privacy law is called third countries, and there then exists an obligation not to transmit personal data across borders if the level of data protection cannot be ascertained to be the same as in the European Union.

Thirdly, there exists no right to demand access to official documents in a given format. Nothing — except the sheer absurdity — stops agencies from providing access to a given site in printed form. There is no obligation for agencies to provide access to their sites in a certain specific form or format.

Fourthly, the principle of public access to official documents is not a right to have access to structured information. It is an access right and not a presentation right. Some assume that public documents should be easily searchable; there is no legal obligation for any agency to optimize their official public documents for searchability.

In summary, the selective use of the robots exclusion standard is a means for public agencies and departments to protect some information for global access, information for internal use, related to national security, or in draft or incomplete form.

6.2 Use of the RES and the PSI Directive

What, then, of the Public Sector Information (PSI) Directive? Could it be argued that there are obligations not to use a robots exclusion standard or to use it only in certain ways?

One could argue that the Directive makes it impossible to use the robots exclusion standard because searchability is an essential component in fair trading and equal treatment.

Let’s assume that a certain public agency has a lot of desirable map data, and that the agency sells this information on its own Web site. Further, it uses a proprietary search engine developed within the agency for searching and analysing geographic data. This search engine is strong and efficient, and gives excellent results. A company trying to compete with the agency is given access to the agency’s data — but not to its search engine [11]. This competitor is instead forced to use an external search engine approved by the agency, since the agency blocks all other search engines, including Google.

Would this be in conflict with the provisions of the Directive? It seems as if it would at least be possible to argue based on the principle of fair trading.

Another possible argument against the use of the standard can be found in Article 5, part 1 of the Directive, which states that:

Public sector bodies shall make their documents available in any pre–existing format or language, through electronic means where possible and appropriate. This shall not imply an obligation for public sector bodies to create or adapt documents in order to comply with the request, nor shall it imply an obligation to provide extracts from documents where this would involve disproportionate effort, going beyond a simple operation.

Note the obligation to use electronic means where “possible and appropriate.” This could at least be taken to imply that unnecessary and unwarranted use of the robots exclusion standard is problematic.

Furthermore, it seems possible to use the prohibition of exclusive arrangements in Article 11, part 1 to make the argument that the robots exclusion standard — used selectively — discriminates:

The re–use of documents shall be open to all potential actors in the market, even if one or more market players already exploit added–value products based on these documents. Contracts or other arrangements between the public sector bodies holding the documents and third parties shall not grant exclusive rights.

An agency that excludes certain companies — because they use proprietary indexing of publicly available data — should at least have a good argument for doing so under the Directive.

Overall it seems doubtful to exclude some user–agents, but not others.

In summary, there are numerous cases where the Directive could come into conflict with the robots exclusion standard. The outcome of such a conflict would depend on interpretation of not only specific articles in the Directive, but also of the role of searching and indexing , where re–use of public sector information is significant.

6.3 Other legal aspects on the use of RES

6.3.1 Digital rights management and robots exclusion standard

Should the robots exclusion standard and the robots.txt file be categorized as a technological measure as defined by European copyright law (WIPO, 1996)? If we accept such a categorization it follows that disobeying the robots.txt file would constitute illegal circumvention.

What about copyright? Let us assume that public agencies retain copyright to their information. Then the task becomes on of deciding whether or not the robots exclusion standard and the robots.txt file constitutes a rights management tactic or a technical measure. The Directive defines these two technical terms as [12]:

For the purposes of this Directive, the expression “technological measures” means any technology, device or component that, in the normal course of its operation, is designed to prevent or restrict acts, in respect of works or other subject–matter, which are not authorised by the rightholder of any copyright or any right related to copyright as provided for by law or the sui generis right provided for in Chapter III of Directive 96/9/EC. Technological measures shall be deemed “effective” where the use of a protected work or other subject–matter is controlled by the rightholders through application of an access control or protection process, such as encryption, scrambling or other transformation of the work or other subject–matter or a copy control mechanism, which achieves the protection objective.

Rights management information is defined as [13]:

For the purposes of this Directive, the expression “rights–management information” means any information provided by rightholders which identifies the work or other subject–matter referred to in this Directive or covered by the sui generis right provided for in Chapter III of Directive 96/9/EC, the author or any other rightholder, or information about the terms and conditions of use of the work or other subject–matter, and any numbers or codes that represent such information.

The first subparagraph shall apply when any of these items of information is associated with a copy of, or appears in connection with the communication to the public of, a work or other subjectmatter referred to in this Directive or covered by the sui generis right provided for in Chapter III of Directive 96/9/EC.

From one perspective, an intrpretation of “technological measures” means that it might be possible to argue that the robots exclusion standard fulfils the requirements in the Directive. However the robots exclusion standard seeks to prevent acts relating not to a specified set of works, but rather to a repository of works (if the Web site is not seen as a unitary work) and there is certainly the issue of the effectiveness of the measure.

The definition of “rights–management information” seems to provide a better option, in conjunction with “information about the terms and conditions of use of the work or other subject–matter.”

Perhaps the robots exclusion standard does not (normally) refer explicitly to works in any way. It refers simply to directories where works can be found. This may seem a slight difference, but it is meaningful. It is hard to find analogies, but we could ask if a lock on the door of a house is equivalent to a “no trespassing” sign. Then it is possible to argue that a “no trespassing” sign is neither a technical measure of “rights–management information”, since it does not refer to or relate to any specific property or work. The sign merely prohibits entry. It says nothing about what you may find on entry, should you disregard the sign.

Certainly this is a tenuous objection, but it could have some relevance. one could argue that the robots exclusion standard actually refers to specific works. It is quite possible to disallow indexing of specific files and works. An analysis of this use of the standard is even more complicated.

The matter of copyright and public information raises the question of whether the provisions of the Directive cover only works that are copyright–protected. Is it possible to protect content contractually where the copyright term has expired? Is it illegal to circumvent rights–management information or technical measures protecting such works? Rice (2001) discusses the contractual extension of protection; even if a public agency does not have copyright to its works, it may — in some cases — be illegal to circumvent rights–management information.

6.3.2 Intelligent agents and search bots — differences?

Does the robot exclusion standard applies solely to one kind of robots or crawlers? One possibility is to limit the applicability of the standard to simple search crawlers catalogued and self–declared as “user–agents.” This would then exempt more intelligent [14] agents from the standard (Groot, et al., 2003). Another possibility would be to extend the use of the standard to all automated software traversing the Internet.

If the standard applies only to simpler bots, then certain search engines (or even search engines in advanced search modes) could allow the user to ignore robots.txt files. This would then become a user choice to some extent. This, however, presupposes another kind of search and indexing technology than the one used today. Indexing takes time — and cannot be done on the fly.

 

++++++++++

7. An argument for conscious, careful and competitively neutral use of the RES

7.1 Legal uncertainty, risk and new technologies

Using the robots exclusion standard is, in a sense, no different from using other new and interesting technologies. It carries with it the risk of being afoul of legal rules interpreted in a novel way in a new environment. There are certain tactics that can be employed to minimize the risk of these new technologies and their associated legal uncertainties. Using new technologies consciously, carefully and in a competitively neutral way might be a good start.

7.2 Conscious use

The use of the robots exclusion standard should be a conscious choice. If it is a technological choice, this should be brought to the attention of management. The reason behind this is twofold. Management needs to have the information to dispel any notion that the standard is being used for wrong purposes. Additonally informing management reduces the risk that the standard is being used in non–acceptable ways (for example to boost search software installed locally).

Conscious use should be documented, and if possible communicated clearly on the Web site. One possibility that should not be excluded is that public sites post clear acceptable use policies, complete with use by bots as a special section.

7.3 Careful use

The robots exclusion standard should be used carefully in the sense that the organisation should be informed about its use and all possible disadvantages examined beforehand. Feedback from external parties should be welcomed. Consulting with technical expertise would eliminate some of the uncertainties associated with any use of the standard.

Considering the status of the standard, and its unclear legal situation, frankly it should not be used if there are not compelling reasons to do so.

7.4 Competitively neutral use

Any use of the standard must be competitively neutral and must not affect the market for commercial actors in any way. This follows not only for those agencies that compete in commercial information markets, but for all agencies. Should it become known that one search engine is regularly excluded, but another not, this would affect traffic to a given site and in the end financially affect the excluded search engine.

7.5 Further research

Further research obviously is needed. The use of filters of different kinds will become more of a necessity as the amount of public sector information grows. The design and use of these filters will pose many questions already asked in this paper.

The question of the robots exclusion standard is ultimately a question of how the public sector can limit access to and searchability of public information.

 

++++++++++

8. Conclusion

The fine–tuning of future search engine architectures will decide the shape and form of future democracy to some extent. The future may require not a principle of public access so much as a principle of public searchability. End of article

 

About the author

Nicklas Lundblad is chief of staff at the Stockholm Chamber of Commerce. He is also a member of the ICT policy taskforce of Eurochambres and a member of the Swedish ICT standardisation commission as well as a member of the Swedish government’s newly established ICT Council. He is a previous member of the e–Europe advisory group and a columnist in several Swedish trade magazines as well as the author of three books and several articles on ICT and law.
E–mail: nicklas [at] skriver [dot] nu

 

Notes

1. See http://www.robotstxt.org/wc/norobots.html, accessed 30 April 2007.

2. See http://www.robotstxt.org/wc/norobots.html, accessed 30 April 2007.

3. See http://www.conman.org/people/spc/robots2.html#format.directives.disallow, accessed 1 May 2007.

4. See http://www.google.com/support/webmasters/bin/answer.py?answer=61050.

5. See http://www.google.com/support/webmasters/bin/answer.py?answer=35302, accessed 30 April 2007.

6. The European Commission recently saw itself forced to open infringement proceedings against a number of countries who have not as of yet implemented the directive in national law.

7. This is a controversial point. One may argue that the exemptions are absolute and that archives, for example, are given a carte blanche to engage in unfair competition whenever they want to in selling information. This, however, seems an extreme interpretation of the exemptions.

8. I used http://tool.motoricerca.info/robots-checker.phtml to check the syntax of the robots.txt file.

9. The directory /oktober/ seems to contain an old backup of somebody called Pelle and a lot of old information.

10. http://www.internetseer.com/home/index.xtp;jsessionid=akTzxfN-93z9.

11. This would be quite possible under the assumption that the search engine is not publicly or openly accessible indexing and searching public information – a reasonable interpretation of the Directive.

12. Art 6, part 3 Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society Official Journal L 167, 22 June 2001 P. 0010 — 0019.

13. Art 7, part 2 Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society Official Journal L 167, 22 June 2001 P. 0010 — 0019.

14. This would presume that we agree on a technique to ascertain the cognitive ability of agents.

 

References

John Perry Barlow, 2001. “A Declaration of Independence of Cyberspace,” In: Peter Ludlow (editor). Crypto Anarchy, Cyberstates and Pirate Utopias. Cambridge, Mass. MIT Press, pp. 27–30.

Stuart Biegel, 2001. Beyond Our Control? Confronting the Limits of Our Legal System in the Age of Cyberspace. Cambridge, Mass.: MIT Press.

M.L. Boonk, D.R.A. de Groot, F.M.T. Brazier, and A. Oskamp, 2005. “Agent Exclusion on Websites,” LEA 2005 — The Law of Electronic Agents, pp. 13–20, and at www.iids.org/publications/Agent%20Exclusion%20Clauses.pdf.

Digital Property Rights Language: Manual and Tutorial — XML Edition Version 2.00 (13 November 1998), at http://xml.coverpages.org/DPRLmanual-XML2.html.

Dan Hunter, forthcoming. “Cyberspace as Place, and the Tragedy of the Digital Anticommons,” California Law Review, at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=306662.

Greg Lastowka, 2006. “Decoding Cyberproperty,” Indiana Law Review, at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=913977.

Mark A. Lemley, 2003. “Place and Cyberspace,” California Law Review, volume 91, and at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=349760. http://dx.doi.org/10.2307/3481337

Lawrence Lessig, 1999. Code and Other Laws of Cyberspace. New York: Basic Books.

Michael J. Madison, 2003. “Rights of Access and the Shape of the Internet,” Boston College Law Review, at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=346860.

David Post and David Johnson, 2001. “Law and Borders: The Rise of Law in Cyberspace,” In: Peter Ludlow (editor). Crypto Anarchy, Cyberstates and Pirate Utopias. Cambridge, Mass. MIT Press, pp. 145–196; see also David Johnson and David Post, 1996. “Law and Borders: The Rise of Law in Cyberspace,” First Monday, volume 1, number 1 (May), at http://www.firstmonday.org/issues/issue1/law/.

David A. Rice, 2001. “Copyright as Artifact: Foundation for Regulation of Circumvention Technologies and Contractual Circumvention of Copyright Limits,” BILETA 2001, Edinburgh 9–10/4.

Håkan Strömberg, 1999. Tryckfrihetsrätt och annan yttrandefrihetsrätt. Lund: Studentlitteratur.

World Intellectual Property Organisation (WIPO), 2003. “Web site,” at http://www.wipo.org, accessed 30 March 2003.

World Intellectual Property Organisation (WIPO), 1996. “WIPO Copyright Treaty,” at http://www.wipo.int/treaties/en/ip/wct/trtdocs_wo033.html.

 


 

Editorial history

Paper received 1 June 2007; accepted 10 July 2007.


Contents Index

Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 2.5 Sweden License

e–Exclusion and Bot Rights: Legal aspects of the robots exclusion standard for public agencies and other public sector bodies with Swedish examples by Nicklas Lundblad
First Monday, volume 12, number 8 (August 2007),
URL: http://firstmonday.org/issues/issue12_8/lundblad/index.html





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2019. ISSN 1396-0466.