Evaluating quality control of Wikipedia's feature articles
First Monday

Evaluating quality control of Wikipedia's feature articles by David Lindsey



Abstract
The purpose of this study was to evaluate the effectiveness of Wikipedia’s premier internal quality control mechanism, the “featured article” process, which assesses articles against a stringent set of criteria. To this end, scholars were asked to evaluate the quality and accuracy of Wikipedia featured articles within their area of expertise. A total of 22 usable responses were collected from a variety of disciplines. Out of the Wikipedia articles assessed, only 12 of 22 were found to pass Wikipedia’s own featured article criteria, indicating that Wikipedia’s process is ineffective. This finding suggests both that Wikipedia must take steps to improve its featured article process and that scholars interested in studying Wikipedia should be careful not to naively believe its assertions of quality.

Contents

Introduction
Research methodology
Results
Conclusions and recommendations

 


 

Introduction

Since its founding in 2001, the online encyclopedia Wikipedia has produced a phenomenal quantity of articles, including more than three million in its English language version. Wikipedia has also garnered some recognition for articles of high quality, most notably in 2005 when Nature declared that Wikipedia was “nearly as accurate as Britannica” on many scientific topics [1]. On the other hand, Wikipedia is often plagued by tremendous failures of quality. For example, in 2005, Wikipedia’s biography of the journalist John Seigenthaler alleged that Seigenthaler had been involved in the assassination of John Kennedy (Seelye, 2005), and in 2009, fake quotes inserted into the biography of composer Maurice Jarre were picked up mainstream media organizations and included in several obituaries (Fitzgerald, 2009). Other articles in Wikipedia are simply made up. For example, an article on the entirely imagined Baldock Beer Disaster (purportedly a tragic brewery accident) was briefly featured on Wikipedia’s main page (“Recent additions,” 2005 and “Articles for deletion,” 2007).

The featured article process is characterized by a complex bureaucracy and set of criteria (with four major requirements and seven sub–requirements). In order for an article to become “featured” it must be evaluated by a number of other Wikipedia contributors on the basis of those criteria. If a group of Wikipedia contributors agree that the article is of “featured” quality, the director of the process or one of his deputies officially awards “featured” status, and the article is recognized with a small bronze star in the upper left–hand corner. Featured articles are also eligible to be prominently displayed on the site’s main page as “Today’s featured article.”

The standards for featured articles are quite exacting. For example, the article must be “a thorough and representative survey of the relevant literature on the topic”, and its prose must be “engaging, even brilliant, and of a professional standard.” [2] As a result of these stringent criteria, at the time of this writing only 2,615 articles (less than one in 1000) had achieved featured status on Wikipedia. Because of the perceived strengths of the featured article process, many earlier authors have accepted, for the purpose of their research, that the featured articles are of high quality, including Huberman and Wilkinson (2007), Poderi (2009), and Blumenstock (2008). Like nearly all aspects of Wikipedia, however, the featured article process relies on anonymous volunteers, and there is no reason to assume that these individuals are in any way qualified to judge whether or not an article is in fact of high quality.

All of this, of course, begs a question: does the featured article process actually work?

 

++++++++++

Research methodology

In order to assess the effectiveness of Wikipedia’s featured article process, I contacted a number of subject matter experts by e–mail and asked each of them to assess a featured article on Wikipedia. The articles to be assessed were selected randomly, though I discarded those articles for which no qualified expert could be found (most such articles were those which lie far outside the boundaries of traditional academic inquiry). Each expert was asked to comment on the general quality and accuracy of the article he or she was assessing, to comment specifically on whether it satisfied Wikipedia’s own featured article criteria, to compare the article to other materials, and to rate the article on a scale of one to 10 (where 10 is best). The numerical rating was not directly connected to Wikipedia’s own criteria; instead, expert reviewers were asked to rate, in their own opinion, the “overall” quality of each article.

In all, I contacted 160 experts and received 22 usable evaluations of Wikipedia articles, spanning a wide variety of disciplines. All of the reviewers were initially contacted during the months of August and September 2009 and all responses were received by the beginning of October 2009.

 

++++++++++

Results

The evaluation results demonstrate a fundamental unevenness of quality among Wikipedia’s featured articles. Some of the articles assessed proved to be quite excellent. For example, Charles Esdaile (Professor of History at the University of Liverpool and author of several books on the Napoleonic Wars, including Fighting Napoleon: Guerrillas, Bandits and Adventurers in Spain, 1808–1814) wrote of the article on the Battle of Barossa, “I am glad to say that it is a very complete account which contains no obvious errors and is certainly more detailed than [its] rivals; moreover, it is both well written and accessible.” On the other hand, Grigory Ioffe (Professor of Geography at Radford University and the author of Understanding Belarus and How Western Policy Misses the Mark), wrote of the article on Belarus, “This is a piece of immature writing unusual even for the Wikipedia.”

Overall, of the 22 articles assessed, the expert reviewers found that 12 (54.5 percent) clearly passed Wikipedia’s criteria for a featured article. Another seven clearly failed the criteria (31.8 percent), and the remaining three were borderline cases.

As for the numerical quality score I requested, the scores assigned ranged from one to nine (no article received a perfect 10) and averaged a seven. In the table below, articles are ordered by the score they received in these assessments.

 

Table 1: Distribution of scores for selected “featured articles” in Wikipedia.
ScoreNumber of articlesArticle titles
11Max Weber
http://en.wikipedia.org/wiki/Max_Weber
20 
32Toru Takemitsu
http://en.wikipedia.org/wiki/Toru_Takemitsu
 
California Gold Rush
http://en.wikipedia.org/wiki/California_Gold_Rush
41Belarus
http://en.wikipedia.org/wiki/Belarus
51Alzheimer’s Disease
http://en.wikipedia.org/wiki/Alzheimer's_Disease
60 
72Ten Commandments in Roman Catholicism
http://en.wikipedia.org/wiki/Ten_Commandments_in_Roman_Catholicism
 
The Swimming Hole
http://en.wikipedia.org/wiki/The_Swimming_Hole
7.51Introduction to General Relativity
http://en.wikipedia.org/wiki/Introduction_to_general_relativity
88Four Times of Day
http://en.wikipedia.org/wiki/Four_Times_of_Day
 
Maiden Castle, Dorset
http://en.wikipedia.org/wiki/Maiden_Castle,_Dorset
 
Charles Darwin
http://en.wikipedia.org/wiki/Charles_Darwin
 
George Kennan
http://en.wikipedia.org/wiki/George_F._Kennan
 
Economy of India
http://en.wikipedia.org/wiki/Economy_of_India
 
Funerary Monument to Sir John Hawkwood
http://en.wikipedia.org/wiki/Funerary_Monument_to_Sir_John_Hawkwood
 
Global Warming
http://en.wikipedia.org/wiki/Global_Warming
 
Poliomyelitis
http://en.wikipedia.org/wiki/Poliomyelitis
8.51Georg Forster
http://en.wikipedia.org/wiki/Georg_Forster
95History of Solidarity
http://en.wikipedia.org/wiki/History_of_Solidarity
 
Rachel Carson
http://en.wikipedia.org/wiki/Rachel_Carson
 
Parallel Computing
http://en.wikipedia.org/wiki/Parallel_Computing
 
Sonatas and Interludes
http://en.wikipedia.org/wiki/Sonatas_and_Interludes
 
Battle of Barossa
http://en.wikipedia.org/wiki/Battle_of_Barossa
100 

 

It is worth noting that many of the articles assessed did score quite well, proving that Wikipedia’s contributors can produce very good articles. The articles receiving lower scores, however, show quite convincingly that Wikipedia’s attempt at quality control is failing. Even among those articles that scored highly, there was room for improvement. For example, David Archer (Professor of Geophysical Sciences at the University of Chicago and the author of Global Warming: Understanding the Forecast), scored the article on global warming at an eight and wrote that it was “very concise and clear”, but remarked that he could tell “it was not written by professional climate scientists” and noted an error in the way the article explained how clouds are included in climate models. Similarly, Jan Kubik (Associate Professor of Political Science at Rutgers and the author of The Power of Symbols Against the Symbols of Power: The Rise of Solidarity and the Fall of State Socialism in Poland) delivered a favorable review of the article “History of Solidarity”, scoring it at a nine, but noted three small errors in it.

Among the articles that did not score as well, several of the expert reviewers compared the articles to the work of high school students or university undergraduates. For example, Malcolm Rohrbough (Emeritus Professor of History at the University of Iowa and author of Days of Gold: The California Gold Rush and the American Nation) wrote that the article on the California Gold Rush was “written at about the level of a junior in high school.” Several others also noted the problems associated with non–expert authors, noting that the sources used were poorly selected and not representative of the broader