Manypedia: Comparing language points of view of Wikipedia communities
First Monday

Manypedia: Comparing language points of view of Wikipedia communities by Paolo Massa and Federico Scrinzi



Abstract
The three million articles of the English Wikipedia have been written in a collaborative fashion by more than 14 million volunteer editors. In each article, a community of editors strive to reach a neutral point of view, representing all significant views fairly, proportionately, and without biases. However, beside the English one, there are more than 270 editions of Wikipedia in different languages and their relatively isolated communities of editors are not forced by the platform to discuss and negotiate their points of view. So the empirical question is: Do communities on different language Wikipedias develop their own diverse Linguistic Points of View (LPOV)? To answer this question we created and released as open source Manypedia, a Web tool whose aim is to facilitate cross–cultural analysis of Wikipedia language communities by providing an easy way to compare automatically translated versions of their different representations of the same topic.

Contents

1. Introduction
2. Points of view and neutrality in Wikipedia
3. Language Wikipedia communities and their points of view
4. Manypedia
5. Possible uses of Manypedia and future work
6. Conclusion

 


 

1. Introduction

Wikipedia is becoming one of the most accessed Web resource for information needs. 53 percent of American Internet users look for information on Wikipedia as of May 2010 and this number increased from 36 percent in February 2007 [1]. A survey found that 88 percent of 2,318 university students use Wikipedia during a course–related research process, even if an instructor advised against it [2]. According to Alexa, Wikipedia is the sixth most visited site of the entire Web [3].

It is hence clear that a large share of people rely on Wikipedia for forming their representations of facts, of what is true and what is not. This fact is even more interesting considering that every single word on which so many people rely could have been added by anyone. In fact Wikipedia’s slogan is “the free encyclopedia anyone can edit” and indeed the more than three million articles of the English Wikipedia, since its inception in 2001, have received more than 453 million edits by more than 14 million registered users [4]. It is even possible to edit Wikipedia without performing the login with the personal username and hence to edit the encyclopedia anonymously. Despite this ultimate openness, the quality of Wikipedia articles is relatively high. A 2005 investigation by Nature found out that “Wikipedia comes close to Britannica in terms of the accuracy of its science entries”. They also report how more than 70 percent of Nature authors consult Wikipedia on scientific topics [5].

Given the importance in shaping the wisdom and world view of so many people, we believe it is important to raise awareness on who are the people who edit Wikipedia. This concern is fully shared by the Wikipedia community itself: for instance on Wikipedia there is a page in the “Wikipedia” namespace, the one devoted to policies and rules, titled “Wikipedia:Systemic bias” [6] which states “the Wikipedia project suffers from systemic bias that naturally grows from its contributors’ demographic groups, manifesting an imbalanced coverage of a subject, thereby discriminating against the less represented demographic groups.” The page clearly lists the main biases: “The average Wikipedian on the English Wikipedia is a male, technically inclined, formally educated, an English speaker (native or non–native), European–descent, aged 15–49, from a majority–Christian country, from a developed nation, from the Northern Hemisphere, and likely employed as a white–collar worker or enrolled as a student rather than employed as a labourer”. There is even a project, described and coordinated at the page “Wikipedia:WikiProject Countering systemic bias”, which lists what Wikipedians can do in order to counter this important issue.

In this paper we are not interested in the biases internal to a specific wiki such as the English Wikipedia but, on the other hand, we focus on the existence (or absence) of different biases in different language communities of Wikipedia. In fact, while the largest and oldest Wikipedia is in English language, there are currently more than 270 editions of Wikipedia in as many different languages, ranging from many with more than 700,000 articles such as the German, French, Polish, Italian, Japanese and Spanish ones, up to smaller ones in languages such as Wolof, Catalan, Piedmontese, Latin, Esperanto, Tibetan, Haitian and more.

So the motivating question for this contribution is “do people who self–elect for editing the page about Palestine in the English Wikipedia have and represent the same points of view of people who self–elect to edit the counterpart article on the Arabic Wikipedia or on the Hebrew Wikipedia?” We call this lens of investigation, Linguistic Point of View (LPOV).

This paper is structured as follows. First, in Section 2, we describe Wikipedia editing policies and in particular the neutral point of view (NPOV) policy and, in Section 3, we highlight the richness of the multi–cultural phenomenon represented by the communities of the different languages Wikipedias. Then Section 4 is devoted to presenting Manypedia, the Web tool we have created and deployed whose aim is to facilitate the comparison and analysis of different points of view represented in the equivalent articles about the same topic as they appear on two different language Wikipedias. We conclude by introducing examples of comparisons that show how the tool can be used for research and scientific investigation purposes and maintenance of Wikipedia as an healthy and balanced cross–cultural project.

 

++++++++++

2. Points of view and neutrality in Wikipedia

A project which exhibits such a large openness and inclusiveness such as Wikipedia would hardly be possible without precise rules and guidelines. In fact, over the years, a complex and vast set of rules were developed by the Wikipedia community through the distributed contributions and negotiations of thousands of people, just as content articles did. Policies and rules are pages on the namespace “Wikipedia:”.

Among the most important rules, there are the three core content policies reported on the pages “Wikipedia:Neutral point of view”, “Wikipedia:Verifiability” and “Wikipedia:No original research”.

The first one, neutrality, is the policy defining Wikipedia itself or, as Roy Rosenzweig in “Can History Be Open Source? Wikipedia and the Future of the Past” [7] puts it, the “founding myth” of Wikipedia. Its definition in a nutshell is “Editors must write articles from a neutral point of view, representing all significant views fairly, proportionately, and without bias.” The neutral point of view (NPOV) policy “says nothing about objectivity” and “in particular, the policy does not say that there is such a thing as objectivity in a philosophical sense — a ‘view from nowhere’ [8], such that articles written from that viewpoint are consequently objectively true.” “Rather, to be neutral is to describe debates rather than engage in them. In other words, when discussing a subject, we should report what people have said about it rather than what is so.” Once defined the goal, the page goes on pondering on the feasibility of such a task: “is it possible to characterize disputes fairly? This is an empirical issue, not a philosophical one: can we edit articles so that all the major participants will be able to look at the resulting text, and agree that their views are presented accurately and as completely as the context permits? It may not be possible to describe all disputes with perfect objectivity, but it is an aim that thousands of editors strive towards every day” (quotations from page “Wikipedia:Neutral point of view/FAQ”).

The two other content core policies are important as well. Verifiability refers to the fact that material written on Wikipedia must be attributed to a reliable, published source so that readers can check that the specific material has already been published; again, the goal is not truth but verifiability.

The “no original research” content policy states that “Wikipedia does not publish original thought: all material in Wikipedia must be attributable to a reliable, published source. Articles may not contain any new analysis or synthesis of published material that serves to advance a position not clearly advanced by the sources.”

Of course, writing “without bias” is “difficult” since “all articles are edited by people” and “people are inherently biased” [9]. This is testimonied by the many “edit wars” on Wikipedia pages [10]. An edit war occurs when two or more users who disagree about the content of a page repeatedly override (revert) each other’s contributions, rather than trying to resolve the disagreement by discussion.

In fact, consensus is the primary way in which editorial decisions are made on Wikipedia with the goal of establishing and ensuring neutrality and verifiability. Usually consensus is reached as a “natural and inherent product of editing; generally someone makes a change or addition to a page, then everyone who reads it has an opportunity to leave the page as it is or change it”. However, “when editors cannot reach agreement by editing, the process of finding a consensus is continued by discussion on the relevant talk pages“ (page “Wikipedia:Consensus”). In fact, each article page, fox example “Palestinian territories” has an associated discussion page in the “Talk” namespace such as “Talk:Palestinian territories” where editors can discuss changes and improvements.

Rosenzweig argues that the most frequent debate topic on discussion pages is whether the article adheres to the NPOV and cites the “Armenian genocide” page as one example of the fact “those debates can go on at mind–numbing length, such as literally hundreds of pages” [11].

There are articles, as we will see in the following, on which it seems to be harder to reach consensus and civil discussions. Many of these articles are linked from a page titled “List of controversial issues” and often are flagged with a warning message, signaling that at least one Wikipedian believes this pages in not neutral and sometimes these pages are even blocked in editing. In fact, users with additional powers (administrators) can block a page of Wikipedia, in order to stop edit wars and cool down discussions, during periods in which consensus is not reached and discussions are particularly heated. It is interesting to note again that editing an article and talk pages can be performed even by anonymous users, identified only by their Internet address.

Beside the apparent theoretical difficulty of reaching consensus among hundreds of editors sometimes very vocal about a certain article topic, perhaps surprisingly the social process works quite well and the community is, up to now, able to self–control itself and edit wars are very limited considering the dimension of the active community.

After this short description of the main collective process happening around each page and each edit, we would like to go back to the neutral point of view concept. “Neutrality requires that each article or other page in the mainspace fairly represents all significant viewpoints that have been published by reliable sources, in proportion to the prominence of each viewpoint” (“Wikipedia:NPOV”). Each article should “accurately indicate the relative prominence of opposing views. Ensure that the reporting of different views on a subject adequately reflects the relative levels of support for those views, and that it does not give a false impression of parity, or give undue weight to a particular view. For example, to state that ‘<According to Simon Wiesenthal, the Holocaust was a program of extermination of the Jewish people in Germany, but David Irving disputes this analysis>’ would be to give apparent parity between the supermajority view and a tiny minority view by assigning each to a single activist in the field”.

In fact by giving different relevant prominence to specific points of view it is possible to write different histories. As George Orwell wrote in 1944, during World War II, “a Nazi and a non–Nazi version of the present war would have no resemblance to one another, and which of them finally gets into the history books will be decided not by evidential methods but on the battlefield” [12].

Interestingly in 2011 we are in a different situation: paraphrasing Orwell, we could argue that what eventually gets into Wikipedia is decided, not on the battlefield, but in a bottom–up collaborative fashion by millions of people, through discussions and negotiations of different points of view.

This bottom–up freedom can be considered undesirable by some governments and central institutions which might want to give more emphasis to “official” top–down points of view or even censor some points of view. This is hard to do on Wikipedia since every article is the result of the negotiations among the points of view of the people who self–elect for editing it. An interesting attempt to detect which organization is behind anonymous changes to Wikipedia pages which can be classified as propaganda is Wikiganda [13].

Along a similar line, there are reports about states censoring and making unreachable Wikipedia possibly in an attempt to control the spread of unwelcomed information. It is not easy to track when a Web site is blocked on a country and if the block is for every page or just for some pages. An interesting example of this is China and an obviously partial report of the situation over the years is written at the Wikipedia page “Blocking of Wikipedia by the People’s Republic of China”. Sometimes the block is reported to be only partial on selected articles such as “Falun Gong” (a persecuted religious practice) and “Tiananmen Square protests of 1989”. In China, possibly also because of these blocks, there are two wiki–based knowledge repositories that are larger than the Wikipedia in Chinese language, at least according to the numbers reported by them: Hudong.com has more than four million articles and Baidu Baike has almost three million articles while Chinese Wikipedia has around 350.000 articles. As we will see, on the Chinese Wikipedia there is no evident bias on information coming from the government point of view.

Similar motivations might have pushed the government of Cuba to launches its own online encyclopaedia, Ecured, “with the goal of presenting its view of the world and history” [14]. Interestingly the wiki clearly mentions that Ecured is built “from a decolonizer point of view”. The entry on the United States, for example, describes it as the “empire of our time, which has historically taken by force territory and natural resources from other nations, to put at the service of its businesses and monopolies” and that “it consumes 25% of the energy produced on the planet and in spite of its wealth, more than a third of its population does not have assured medical attention” [15]. These quotations are a clear example of specific points of view and of the different prominence they can get in articles of online encyclopedias.

The large number of discussions about what is the major point of view and which are the minor ones and how much relative prominence they should receive is one the most discussed topics on Wikipedia talk pages [16], especially because it is hard for someone holding a certain POV to be neutral and balanced.

Moreover, as the Cuba example above summarizes, relevant viewpoints can be different for different communities and surely the prominence of each viewpoint can be very different. Anarchopedia (http://anarchopedia.org) and Conservapedia (http://conservapedia.com) are two online encyclopedias which constitute themselves as alternatives to Wikipedia. Their “founding myth” [17] is in fact a specific point of view, respectively the “anarchistic point of view” and the “conservative viewpoint”. While they gathered a community that is much smaller than Wikipedia, they exemplify a phenomenon: different histories are written depending on the point of view the community choose to adopt.

On the other hand, Wikipedia has neutrality as its “founding myth” and “it is an aim that thousands of editors strive towards every day” (“Wikipedia:Neutral point of view/FAQ”).

Some people are skeptical that neutrality can be reached. For example, Larry Sanger, one the two founders of Wikipedia but who left it in 2002, argues that “over the long term, the quality of a given Wikipedia article will do a random walk around the highest level of quality permitted by the most persistent and aggressive people who follow an article” [18].

Beside being optimistic or pessimistic about the fate of Wikipedia in the long run, we believe the simple act of discussing is central to democracy, an healthy global society and peaceful coexistence of different points of view. Rosenzweig considers that “those who create Wikipedia’s articles and debate their contents are involved in an astonishingly intense and widespread process of democratic self–education” and reports that the classicist James O’Donnell has argued that the benefit of Wikipedia may be greater for its active participants than for its readers: “A community that finds a way to talk in this way is creating education and online discourse at a higher level” [19].

These levels of discussions and interactions are indeed admirable but in this paper we ask if similar levels of intra–Wikipedia community negotiation and self–education happen also inter–Wikipedia communities. Is it the case that users who strive to reach consensus on the page “Palestine” in the Arabic Wikipedia discuss and try to balance their views with users who self–elect for editing the equivalent page “Palestine” in the Hebrew Wikipedia? As we will see, this process is not too encouraged by the current socio–technical platform powering Wikipedia. So our empirical question is “will relatively isolated language communities of Wikipedia develop their own divergent representations for topics? Their own Linguistic Point of View (LPOV)?” This question has important implications for the cross–cultural mutual understanding and peaceful coexistence of world communities.

To this end, in next section we report on the richness of the different language Wikipedia communities while in Section 4 we present Manypedia, the Web mashup we have created as a tool for helping in answering the previous question.

 

++++++++++

3. Language Wikipedia communities and their points of view

There are more than 270 language editions of Wikipedia. Some of the most active ones are reported in Table 1. The largest one is the English Wikipedia which, started in 2001, currently counts more than three million articles which have received more than 453 million edits by more than 14 million registered users, as of April 2011. The users who performed at least five edits in the past month, considered the active part of the community, are 154,633.

 

Table 1: Statistics of some of the most edited Wikipedia by language
(http://meta.wikimedia.org/wiki/List_of_Wikipedias accessed on 4 April 2011). Active users contributed five times or more in the past month.
LanguagePrefixArticlesEditsUsersActive users
Englishen3,599,764453,559,34414,269,619154,633
Germande1,209,19990,801,1341,199,25625,118
Frenchfr1,085,03167,718,0521,030,81017,006
Polishpl790,76327,374,358423,5895,505
Italianit787,21743,207,289617,2948,412
Spanishes745,11448,057,7671,778,50015,837
Japaneseja741,57737,677,940507,56210,639
Russianru695,10334,992,236646,75612,797
Dutchnl679,49725,218,039379,5655,203
Portuguesept679,46725,065,627848,6985,882
Swedishsv391,82014,450,578219,7343,550
Chinesezh350,55016,399,255978,9245,630
Catalanca314,9217,242,01584,1601,728
Turkishtr157,7169,826,582336,0272,558
Arabicar145,3277,858,304366,0072,500
Persianfa120,6956,300,171216,0321,859
Volapükvo118,8522,342,30211,05764
Malayms116,9231,667,63272,437347
Hebrewhe116,36110,945,514139,4621,996
Total 18,290,4911,096,374,92828,345,171 

 

The next largest Wikipedia communities assemble around languages very spoken in the world especially in countries with a good level of Internet penetration, such as German, French, Polish, Italian, Spanish and Japanese. On the other hand, a language spoken by billions of people, such as Chinese, has a relatively smaller community but we have already reported how there are two other online encyclopedias in Chinese which gathered more users.

For our purposes it is especially interesting to look at the column of users and in particular of active users in Table 1, referring to the current “force load” of the community. In fact, different language editions of Wikipedia started at different times and it is important to understand the current situation of the community of Wikipedians. For instance, the Wikipedia in Catalan can count on a relatively small number of very dedicated users and was able to create the thirteenth Wikipedia in terms of number of articles. The case of Volapük is similarly interesting, a tiny group of speakers of this constructed language created a large number of articles. These examples show that a small community of really dedicated users can generate a large number of articles, especially when they care significantly about their language and, probably, their cultural heritage and world view.

Each Wikipedia has its own history and, partially, its own community. In fact, each language edition of Wikipedia is an independent installation of the Mediawiki server software. A relatively new feature, “Unified login”, of Wikipedia allows to use the same username on all Wikipedias, as long as this is not already used by someone else. But, of course, this feature is used mainly by Wikipedians who know at least two languages and that are confident in contributing in both of them.

Different language Wikipedias are connected mainly, if not only, by interwiki links. In fact it is possible to link the article about, for example, “Palestine” in the Hebrew Wikipedia with its equivalent in Arabic Wikipedia simply by inserting an interwiki link of the form [[language code:Title]], for example [[ar:فلسطين]]. The Wikipedia server interprets this interwiki syntax and offers links to the equivalent page in the other language Wikipedia, on the left hand side of each Wikipedia page under a “Language” menu. These interwiki links must be inserted manually (or with the help of semi–automated programs called bots) by users who, at least in theory, know both the source and target language.

The page “Wikipedia” on Wikipedia reports that “translated articles represent only a small portion of articles in most editions” and also “in part because automated translation of articles is disallowed.” While this claim should be empirically validated, it is surely interesting that one policy warns against automated translations of articles and hence it is expected that each article, in each language edition of Wikipedia, is written by a human who knows, at least partially, the language. There are many examples of article topics that are present in many different language communities, for example the page about “Osama Bin Laden”, according to the interwiki links present in the English Wikipedia article, is present in 98 language Wikipedias and the page about “George W. Bush” in 134 Wikipedias.

An excellent analysis of diversity of knowledge represented in 25 different Wikipedias is presented in Hecht and Gergle (2010) and a surprisingly small amount of concept overlap is found between languages of Wikipedia, as over 74 percent of concepts are described in only one language and only 0.12 percent of them are described in all the 25 investigated language Wikipedias [20]. Moreover it has been found that each language Wikipedia exhibits a self–focus bias towards articles about regions where that language is largely spoken [21].

So the empirical question is: On articles that are present in different language Wikipedias and given also the fact that automatic translation of articles is discouraged, do different language communities develop very diverse versions of equivalent articles?

Actually the page “Wikipedia:Neutral point of view/FAQ” at the section “Anglo–American focus” states that

“Wikipedia seems to have an Anglo–American focus. Is this contrary to the neutral point of view? Yes, it is, especially when dealing with articles that require an international perspective. The presence of articles written from a United States or European Anglophone perspective is simply a reflection of the fact that there are many U.S. and European Anglophone people working on the project. This is an ongoing problem that should be corrected by active collaboration between Anglo–Americans and people from other countries. But rather than introducing their own cultural bias, they should seek to improve articles by removing any examples of cultural bias that they encounter, or making readers aware of them.”

And then “this is not only a problem in the English Wikipedia. The French Language Wikipedia may reflect a French bias, the Japanese Wikipedia may reflect a Japanese bias, and so on.” This is acknowledged also by Rosenzweig [22] when he states “but the largest bias — at least in the English–language version — favors Western culture (and English–speaking nations), rather than geek or popular culture.”

We call this phenomenon “Linguistic Point of View” (LPOV). The presence of diverse points of view on different language editions of Wikipedia would disprove the “global consensus hypothesis” which posits that “two articles about the same concept in two different languages will describe that concept roughly identically” [23].

On the page “Wikipedia:Describing points of view”, it is clearly written that “English language Wikipedia articles should be written for an international audience”. Two questions can arise from this aim. The first one is if this is really what is happening in the English Wikipedia and in the other Wikipedias: Are they written for an international audience or do they reflect a specific Linguistic Point of View? The second question is about the fact this aim is good for our world or not: Are we going towards a globalized knowledge losing specificities and traditions or do we risk to go towards fragmentation of society in language specific communities?

This paper focuses on the first question, in order to provide a tool which makes it easier to assess the current situation. Speculations and arguments about which path is better for the world can start as a natural consequence of an informed debate on the current situation.

Note that few studies started to emerge comparing different language Wikipedias. For example, Pfeil, et al. [24] compared French, German, Japanese and Dutch Wikipedia while Hara, et al. [25] analyzed English, Hebrew, Japanese, and Malay. Arabic, English, and Korean Wikipedias were compared by Stvilia, et al. [26]. These analysis were performed with manual content analysis of some article pages from different Wikipedias. Liao [27] focused on the Chinese Wikipedia, comparing regional differences of its contributors based on four regions of origin (Mainland, Hong Kong/Macau, Taiwan, and Singapore/Malaysia). Liao found that the main issue threatening the potential growth of the Chinese Wikipedia was not internal conflicts, nor external competition but the evolution of the newly established “Avoid Region–Centric Policy”. Nemoto and Gloor [28] compared English, German, Japanese, Korean, and Finnish Wikipedias, finding differences between egalitarian cultures, such as the Finnish, and quite hierarchical ones, such as the Japanese.

A specific analysis of the comparative cultural biases present in articles about famous persons in the English and Polish versions of Wikipedia was presented by Callahan and Herring [29]. Quantitative and qualitative content analysis revealed systematic differences related to the different cultures, histories, and values of Poland and United States.

These selected studies, involving manual analysis of Wikipedia content, required knowledge of all of the involved languages in order to compare the products created by different language communities.

Hecht and Gergle published research that is closer to ours in scope [30]. In Hecht and Gergle (2010) [31], they found a surprisingly small overlap in concepts in different language Wikipedias. Moreover, when the same concept existed in two different editions of Wikipedia, they found that sub-concept diversity, defined as overlap in links to other Wikipedia pages, was lower than expected. In Hecht and Gergle (2009) [32], each language edition of Wikipedia wass characterized for its level of self–focus bias, operationalized as number of links directed at articles located in the region of the world where that language is largely spoken. In both cases, their focus was at the level of characterizing the entire Wikipedia and the main considered element was the number of links to other pages and not the text on the page itself. On the other hand, our work aims at providing a tool for pairwise comparison at the level of single pages so that humans can investigate the presence (or absence) of different Linguistic Points of View and possibly improve, correct and discuss them.

In the next section we present Manypedia, a Web tool that, exploiting automated machine translation, aims at lower the bar for cross–cultural studies and research of different language Wikipedia communities.

 

++++++++++

4. Manypedia

Through Manypedia it is possible to compare Linguistic Points of View of different communities of language editions of Wikipedia. Manypedia is accessible at http://www.manypedia.com. Precisely, Manypedia can be used to search for a page title in a specific Wikipedia, for example the English one (left side of Figure 1), and to compare it with the equivalent page from another Wikipedia, for example the Chinese one, but automatically translated into English (right side of Figure 1). In this way, even if automatic translation, powered by Google Translate online service, is not perfect, the requirement of knowing the two languages for cross–language studies is relieved. We believe that being able to “understand” the result of hundreds of edits by Wikipedians who edited a certain page in, for example, Chinese (without knowing Chinese) using a single pairwise Web interface is a great opportunity for cross–cultural studies. Every link in Wikipedia articles is transformed into a new comparison so that navigation can conveniently continue inside Manypedia. Currently 56 languages are supported in translation both as source and target language, ranging from English, Spanish, German to Yiddish, Tagalog, Catalan, Swahili and more.

 

Manypedia comparison of Palestinian territories page on English and Arabic Wikipedia
 
Figure 1: Manypedia comparison of “Palestinian territories” page on English and Arabic Wikipedia (http://www.manypedia.com).

 

In Figure 1, there is a screenshot of Manypedia comparing the page “Palestinian territories” from English Wikipedia (left) with the equivalent page, titled “فلسطين المحتلة”, in the Arabic Wikipedia (translated into English).

On top of embedded Wikipedia pages (both left and right sides), Manypedia shows information which can help in forming an idea about the differences of knowledge products created by two different language communities. First of all, Manypedia finds images included in two Wikipedia articles and displays them on top of the page in order to get a first visual understanding of the points of view represented. A word cloud of the most frequent words is also presented in order to quickly spot main textual differences of two pages. Statistics about the pages are required at runtime via Ajax to operate PHP scripts running on toolserver.org, where a copy of Wikipedia databases are available. Statistics comprise number of total edits received, useful for comparing attention received by a page from several language communities, while taking into account that the English Wikipedia community is much larger than, for example, the Japanese Wikipedia community, that in turn is much larger than the Swahili one. The number of different editors who contributed to the page is shown as well. In general noting few edits by one or two editors could warn a Manypedia visitor about the possibility that the article does not reflect an at least partially shared vision, but only the points of view of a few involved editors. On the other hand, if the page has received a large number of edits by a large number of editors, it is more plausible to assume that the current page is the up–to–date neutral result of the negotiation of all the significant viewpoints about the issue shared by the specific language Wikipedia community. Creation date and creator are shown as well in order to provide evidence about the existence of a given document to generate attention and diverse points of view. The date of last edit allows us to ponder how much a given page has received recent attention by the community. Moreover, signs of vandalism or very biased points of view can be more easily found on pages edited very recently [33].

On top of the two pages, Manypedia also shows the five Wikipedians who edited a given page the most, with a link opening additional statistical data about them, along with the number of edits they contributed to the page. This information is useful in order to understand if there is one single user “owning” the page: Wikipedia policy clearly states that “you do not own articles”. Moreover it is possible to get an idea of the relative influence exercised by the top editors of a page by comparing their edits and the total number of edits: again, a small percentage might indicate a more shared and neutral point of view. Even more interestingly, there might be cases in which the same Wikipedian is one of the most active editors in both the articles from two different language Wikipedias. All the statistics shown on top of article pages go in the direction of improving transparency of Wikipedia pages by highlighting some important, but not so visible, aspects of the process involved in the creation and maintenance of a page by the community. This is similar to what the project Wikidashboard does with the goal of increasing social transparency [34].

An additional automatic instrument for comparing the two pages is the concept similarity percentage. This is computed at runtime based on the sub–concept diversity index introduced in Hecht and Gergle (2010) [35]. The concept similarity is computed based on outlinks, or links in one Wikipedia article pointing to another article. The intuition is that “if two articles on the same concept in two languages define the concept in a nearly identical fashion, they should link to articles on nearly all the same concepts. If, on the other hand, there is great sub-concept diversity, these articles would link to very few articles about the same concepts” [36]. The measure is not meaningful for each comparison because many factors are involved in the differences in links to pages, such as cultural differences but also differences in linking behaviours (a page might refer to a concept without linking to it while the other one links to it) [37]. Current work is ongoing with the aim of adding additional comparisons at the level of the meaning of each sentence.

Since Wikipedia articles are released under Creative Commons Attribution Share Alike License, anyone, including Manypedia, is allowed to copy, distribute, transmit and also remix the content as long as he or she attributes it to the authors and copyright holders: Manypedia does so by giving credit to the specific Wikipedia articles incorporated in each comparison specifying that the source is Wikipedia and linking to the specific article. As a consequence, the content of Manypedia is released under a Creative Commons Attribution–Share Alike License as well so that anyone, including researchers, can copy, redistribute and remix the content simply by citing Manypedia as a source.

The code powering Manypedia and the scripts running on toolserver.org, extracting statistics at runtime for each page and user have been released as open source so that other researchers can build on them. They are available at https://github.com/volpino/.

 

++++++++++

5. Possible uses of Manypedia and future work

In this section we briefly highlight possible foreseen uses of Manypedia. We are not experts of cross–cultural studies and carefully conducted investigations in this realm about similarities and dissimilarities on how different communities represent the same concept go over the scope of this paper and are future work.

Manypedia interface provides (on top right, see Figure 1) a list of featured comparisons, as selected by hand by authors, as well as a list of the latest comparisons performed by Manypedia users. It also provides access to popular comparisons over the last 20 days in order to highlight recently investigated topics. These lists provide interesting starting points for cross–cultural investigations, considering that each link present in Wikipedia is transformed into a comparison inside Manypedia as well. We plan to also offer an additional list of comparisons recently performed by users whose concept similarity percentage is smaller than, say, 10 percent along with a large number of links present in both pages.

An interesting starting point for investigation is the page “List of controversial articles”. These pages, just as every Wikipedia page, can be analyzed using Manypedia. For example, the URL http://www.manypedia.com/#|en|List_of_controversial_articles|zh is the comparison of the the page “List of controversial articles” from English Wikipedia (en) and Chinese Wikipedia (zh), translated into English [38].

It is possible to observe that the page from English Wikipedia (which groups the many controversial articles into 15 main classes such as Politics/Economics, History, Religion, Science/Biology/Health, Sexuality, Sports, Entertainment, Environment, Law and Order, Linguistics, Philosophy, Psychiatry, Technology, Media/culture, People/Public figures/Infamous persons) is slightly centered around topics important for U.S. and Western culture. On the other hand, the Chinese Wikipedia page lists pages such as “Anti–Japanese War”; “Nanjing Massacre”; “Taiwan”; “Human Rights in China”; “Falun Gong”; “Tiananmen Incident”; “Mao Zedong”; and, “List of sites blocked by China”. Many of the links contained in both pages will possibly result in an interesting start for cross–cultural comparisons. The same argument is visible for most language communities, for example the “List of Controversial articles” in the Catalan Wikipedia refers predominantly to issues about the term “country” and “region” and the concept of Catalan country itself.

Another interesting example is the page “Human rights in the United States” whose Chinese counterpart starts with “Most Americans think the U.S. is a free country” and then “U.S. double standards on human rights is hypocritical”.

In general all topics related to recent history can be biased, especially if there are two or more fighting nations involved. We have already reported the article in which George Orwell, referring to the then ongoing World War II, argues that “a Nazi and a non–Nazi version of the present war would have no resemblance to one another, and which of them finally gets into the history books will be decided not by evidential methods but on the battlefield” and in which he reminds that “history is written by the winners” [39].

Surely the interesting part of Wikipedia is that it can be edited and “fixed” in real–time. As Rosenzweig notes,

“like journalism, Wikipedia offers a first draft of history, but unlike journalism’s draft, that history is subject to continuous revision. Wikipedia’s ease of revision not only makes it more up–to–date than a traditional encyclopedia, it also gives it (like the Web itself) a self–healing quality since defects that are criticized can be quickly remedied and alternative perspectives can be instantly added.”

In fact recent work on the formation of collective memories of recent events exploits this feature of Wikipedia and the community strives to fairly represent recent events as they unfold [40]. This is especially interesting in the case of traumatic events such as, for example, the recent North African revolutions [41]. With regard to history, Manypedia offers for example a tool for comparing the different representations of the “Vietnam war” between the English Wikipedia and the Vietnamese one, or to get an understanding of the reception of “Abu Ghraib torture and prisoner abuse” by different language communities in Wikipedia.

Ongoing struggles for disputed states might also be represented in diverse ways, especially by language communities more closely involved in specific issues. We have already reported about Catalonia in Catalan language but similar arguments can be made for Galicia in Galician Wikipedia, Taiwan and Tibet in Chinese Wikipedia. Northern Cyprus is an especially interesting comparison where the Greek Wikipedia reports it “is under Turkish occupation since 1974 in violation of international legal norms” while the Turkish Wikipedia states it “is an independent state”. Editors for the English Wikipedia are possibly less involved and more neutral and claim Northern Cyprus “is a de facto independent state (...). Tensions between the Greek Cypriot and Turkish Cypriot populations culminated in 1974 with a coup d’état, an attempt to annex the island to Greece and a military invasion by Turkey in response. (...) Northern Cyprus has received diplomatic recognition only from Turkey.”

A paradigmatic example with this regard is the ongoing conflict between Israeli and Palestinians that can be analyzed in terms of Linguistic Point of View on pages such as “Palestine”, “Israel”, “Israeli–Palestinian conflict” and dozens of other pages in the “Category:Israeli–Palestinian conflict” by comparing, for example, the Arabic and Hebrew Wikipedia representations. The page “Jerusalem” is a related example which is possibly even more controversial since it involves also issues related to religion. Religion is surely a topic which exhibits little neutrality: examine, for example, “Crusades”, “Islamofascism”, and “Poligamy”.

Moreover some knowledge areas might be more or less treated in relative terms by different language communities and hence reveal imbalanced coverage. For example, the English Wikipedia has an impressive coverage of topics related to sexuality both with regard to extreme practices and sexual orientation. Thanks to Manypedia, it is possible to check if other Wikipedias, such as the Arabic or Japanese ones, exhibit different coverage in relative terms and by number of edits and editors involved.

The feature of grouping all images of a Wikipedia page at the top can be particularly useful with this regard because it can be easier to spot how many and more importantly which images are used to represent a specific concept. This can be done on generic pages such as “1970 year” or “Black people” and also on sex–related pages. Just as an intriguing example of this, we report that in 2010, Larry Sanger, cofounder of Wikipedia in 2001, reported the Wikimedia Foundation to the FBI for “knowingly distributing child pornography”. The suspect material were 27 images in the “Pedophilia” and “Lolicon” categories on Wikimedia Commons [42]. This testifies that each language community is probably faced and must reach consensus between representation and self–censorship of sensitive topics.

The last examples we report here are the pages “Recent deaths” and “Portal:current events”. By comparing them across different language Wikipedias it is possible to quickly appreciate which deaths are consdered sigificant by editors of a specific language site and which events are important enough to be reported in the portal. Is it possible to write about these pages with an international audience in mind? Is it reasonable to ask different language communities of Wikipedia to do it? We will address these questions and the broad implications of Manypedia as a tool for investigating Linguistic Points of View in the next section.

In this section, we reported a few examples of comparisons which can act as starting points for cross–cultural investigations made possible by Manypedia. Our future work involves developing automatic ways for highlighting differences at the sentence level and conducting case studies with cross–cultural researchers in order to empirically validate the utility of Manypedia.

 

++++++++++

6. Conclusion

In this paper we introduced Manypedia, a Web mashup which allows comparisons the same page on two different language Wikipedias. Manypedia exploits automatic machine translation and hence does not require knowledge of another language for the comparison. Moreover, summarization provided through images, most frequent words and statistics of the creation process of a Wikipedia page complements an investigation about differences in representations of the same concept by different language Wikipedia communities.

As Wikipedia itself recognizes, there are systemic biases in its process which naturally arises from individuals writing and editing millions of articles. However, in this paper, we are not interested in biases intra-specific Wikipedia but on differences in inter-Wikipedia representations: are there Linguistic Points of View in different language editions of Wikipedia?

As we have seen, Manypedia is a tool which allows a researcher to answer this question and makes it easier to conduct cross–cultural studies in Wikipedia. Moreover, Manypedia can be used to maintain balanced, coherent and convergent points of view across different language Wikipedias since the current Wikipedia socio–technical platform does not provide abundant opportunities for editors of different language Wikipedias to discuss and share points of view.

As a consequence of Manypedia providing a measure of the magnitude of Linguistic Points of View, it is possible to enter into more philosophical questions in order to speculate on the possibility of creating an internationally neutral point of view in every language Wikipedia.

Do we risk moving towards a globalized knowledge, losing specificities and traditions of local cultures or do we accept complete fragmentation of world society into language specific communities? In the first case, there is a tyranny of the majority in which minority views and diversity are not represented and only few major points of view survive [43]. On the other hand, there are the so–called echo chambers [44]. These communities — identified by the language they speak, or by the founding point of view they chose, as in the examples of Ecured, Anarchopedia and Conservapedia — develop their own representations of the world. These representation become more biased over time and diverge in such a way that fragmentation of society is reached, as described by Sunstein [45].

Which extreme shall Wikipedia encourage, tyranny of the majority or a multitude of echo chambers? What is the best balance? Our aim with Manypedia is to help starting a global informed debate about these important issues. End of article

 

About the authors

Paolo Massa is a researcher at Bruno Kessler Foundation, Trento, Italy.
Web: http://www.gnuband.org/
E–mail: massa [at] fbk [dot] eu

Federico Scrinzi, Bruno Kessler Foundation, Trento, Italy
Web: http://volpino.github.com/
E–mail: fscrinzi [at] fbk [dot] eu

 

Acknowledgements

Manypedia could not have been possible without Wikipedia, its community of volunteers and the Creative Commons license which encourages remix of content. We also thanks Google for providing a real–time language translation API and the developers of open source software that we used for developing Manypedia, in particular the Javascript library jQuery and its dozens of plugins.

 

Notes

1. Pew Research Center’s Internet & American Life Project, 2011. “Wikipedia, past and present” (13 January), at http://pewinternet.org/Reports/2011/Wikipedia.aspx, accessed 4 August 2011.

2. A.J. Head and M.B Eisenberg, 2010. “How today’s college students use Wikipedia for course–related research,” First Monday, volume 15, number 3, at http://firstmonday.org/article/view/2830/2476, accessed 1 January 2013.

3. http://www.alexa.com/siteinfo/wikipedia.org, accessed 1 January 2013.

4. The data reported in this paper are taken from the Wikipedia site as they appeared on 29 March 2011.

5. J. Giles, 2005. “Internet encyclopaedias go head to head,” Nature, volume 438, number 7070, pp. 900–901.

6. Since this article deals with Wikipedia pages, it cites many of them. In order not to clutter the paper with too many footnotes or citations, we simply report the title of the Wikipedia page between brackets such as “Page title”. Wikipedia pages were accessed on 29 November 2011 and to get a version of any “Page title” at that date using the history feature of Wikipedia, the reader can visit http://en.wikipedia.org/w/index.php?action=history&limit=1&offset=20111129000000&title=Page_title where the offset parameter indicated the retrieval date.

7. R. Rosenzweig, 2006. “Can history be open source? Wikipedia and the future of the past,” Journal of American History volume 93, number 1, pp. 117–146.

8. T. Nagel, 1986. The view from nowhere. New York: Oxford University Press.

9. Rosenzweig, op. cit.

10. A. Kittur, B. Suh, B.A. Pendleton, and E.H. Chi, 2007. “He says, she says: Conflict and coordination in Wikipedia,” CHI ’07: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 453–462; F.B. Viégas, M. Wattenberg, and K. Dave, 2004. “Studying cooperation and conflict between authors with history flow visualizations,” CHI ’04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 575–582.

11. R. Rosenzweig, 2006. “Can history be open source? Wikipedia and the future of the past,” Journal of American History volume 93, number 1, pp. 117–146.

12. G. Orwell, 1968. As I please 1943–1945. Collected essays, journalism and letters of George Orwell, volume 3. London: Secker & Warburg.

13. R. Chandy, 2008. “Wikiganda: Identifying propaganda through text analysis,” Caltech Undergraduate Research Journal, volume 9, number 1, at http://curj.caltech.edu/issues/articles/9-1/CURJ%20v9n1%20Wikiganda.pdf, accessed 1 January 2013.

14. BBC, 2010. “Cuba launches online encyclopaedia similar to Wikipedia” (14 December), at http://www.bbc.co.uk/news/world-latin-america-11989296, accessed 1 January 2013.

15. Op. cit.

16. Rosenzweig, op. cit.

17. Op. cit.

18. L.M. Sanger, 2009. “The fate of expertise after Wikipedia,” Episteme, volume 6, number 1, pp. 52–73, at http://www.euppublishing.com/doi/abs/10.3366/E1742360008000543, accessed 1 January 2013.

19. R. Rosenzweig, 2006. “Can history be open source? Wikipedia and the future of the past,” Journal of American History volume 93, number 1, pp. 117–146.

20. B. Hecht and D. Gergle, 2010. “The Tower of Babel meets Web 2.0: User–generated content and its applications in a multilingual context,” CHI ’10: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 291–300.

21. B. Hecht and D. Gergle, 2009. “Measuring self–focus bias in community–maintained knowledge repositories,” C&T ’09: Proceedings of the Fourth International Conference on Communities and Technologies, pp. 11–20.

22. Rosenzweig, op. cit.

23. Hecht and Gergle, 2010, op. cit.

24. U. Pfeil, P. Zaphiris, and C.S. Ang, 2006. “Cultural differences in collaborative authoring of Wikipedia,” Journal of Computer–Mediated Communication, volume 12, number 1, pp. 88–113, and at http://jcmc.indiana.edu/vol12/issue1/pfeil.html, accessed 1 January 2013.

25. N. Hara, P. Shachaf, and K.F. Hew, 2010. “Cross–cultural analysis of the Wikipedia community,”Journal of the American Society for Information Science and Technology, volume 61, number 10, pp. 2,097–2,108.

26. B. Stvilia, A. Al–Faraj, and Y. Yi, 2009. “Issues of cross–contextual information quality evaluation — The case of Arabic, English, and Korean Wikipedias,” Library & Information Science Research, volume 31, number 4, pp. 232–239.http://dx.doi.org/10.1016/j.lisr.2009.07.005

27. H.–T. Liao, 2009. “Conflictual consensus in the Chinese version of Wikipedia,” IEEE Technology and Society Magazine, volume 28, number 2, pp. 49–56.

28. K. Nemoto and P. Gloor, 2011. “Analyzing cultural differences in collaborative innovation networks by analyzing editing behavior in different–language Wikipedias,” Procedia — Social and Behavioral Sciences, volume 26, pp. 180–190.

29. E.S. Callahan and S.C. Herring, 2011. “Cultural bias in Wikipedia articles about famous persons,” Journal of the American Society for Information Science and Technology, volume 62, number 10, pp. 1,899–1,915.

30. Hecht and Gergle, 2010; 2009, op. cit.

31. B. Hecht and D. Gergle, 2010. “The Tower of Babel meets Web 2.0: User–generated content and its applications in a multilingual context,” CHI ’10: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 291–300.

32. B. Hecht and D. Gergle, 2009. “Measuring self–focus bias in community–maintained knowledge repositories,” C&T ’09: Proceedings of the Fourth International Conference on Communities and Technologies, pp. 11–20.

33. A. Kittur, B. Suh, B.A. Pendleton, and E.H. Chi, 2007. “He says, she says: Conflict and coordination in Wikipedia,” CHI ’07: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 453–462; F.B. Viégas, M. Wattenberg, and K. Dave, 2004. “Studying cooperation and conflict between authors with history flow visualizations,” CHI ’04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 575–582.

34. B. Suh, E.H. Chi, A. Kittur, and B.A. Pendleton, 2008. “Lifting the veil: Improving accountability and social transparency in Wikipedia with wikidashboard,” CHI ’08: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1,037–1,040.

35. B. Hecht and D. Gergle, 2010. “The Tower of Babel meets Web 2.0: User–generated content and its applications in a multilingual context,” CHI ’10: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 291–300.

36. Op. cit.

37. Hecht and D. Gergle, 2010, op. cit..

38. Since this article deals with Wikipedia pages, it cites many of them. In order not to clutter the paper with too many footnotes or citations, we simply report the title of the Wikipedia page between brackets such as “Page title”. Wikipedia pages were accessed on 29 November 2011 and to get a version of any “Page title” at that date using the history feature of Wikipedia, the reader can visit http://en.wikipedia.org/w/index.php?action=history&limit=1&offset=20111129000000&title=Page_title where the offset parameter indicated the retrieval date.

39. G. Orwell, 1968. As I please 1943–1945. Collected essays, journalism and letters of George Orwell, volume 3. London: Secker & Warburg.

40. M. Ferron and P. Massa, 2011. “Studying collective memories in Wikipedia,” Third Digital Memories Conference (Prague), at http://www.inter-disciplinary.net/wp-content/uploads/2011/02/ferrondmpaper.pdf, accessed 1 January 2013.

41. M. Ferron and P. Massa, 2011. “Collective memory building in Wikipedia: The case of North African revolutions,” Wikisym 2011, at http://www.slideshare.net/phauly/collective-memory-building-in-wikipedia-the-case-of-north-african-uprisings, accessed 1 January 2013.

42. C. Metz, 2010. “Wikifounder reports Wikiparent to FBI over ‘child porn’,” Register (9 April), at http://www.theregister.co.uk/2010/04/09/sanger_reports_wikimedia_to_the_fbi/, accessed 4 April 2011.

43. P. Massa and P. Avesani, 2007. “Trust metrics on controversial users: Balancing between tyranny of the majority,” International Journal on Semantic Web and Information Systems, volume 3, number 1, pp. 39–64.

44. Op. cit.

45. C.R. Sunstein, 2001. Republic.com. Princeton, N.J.: Princeton University Press.

 


Editorial history

Received 27 January 2012; accepted 1 December 2012.


Creative Commons License
This work is licensed under a Creative Commons Attribution–NonCommercial–NoDerivs 3.0 Unported License.

Manypedia: Comparing language points of view of Wikipedia communities
by Paolo Massa and Federico Scrinzi
First Monday, Volume 18, Number 1 - 7 January 2013
http://journals.uic.edu/ojs/index.php/fm/article/view/3939/3382
doi:10.5210/fm.v18i1.3939





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2016.