Individual focus and knowledge contribution
First Monday

Individual focus and knowledge contribution

Before contributing new knowledge, individuals must attain requisite background knowledge or skills through schooling, training, practice, and experience. Given limited time, individuals often choose either to focus on few areas, where they build deep expertise, or to delve less deeply and distribute their attention and efforts across several areas. In this paper we measure the relationship between the narrowness of focus and the quality of contribution across a range of both traditional and recent knowledge sharing media, including scholarly articles, patents, Wikipedia, and online question and answer forums. Across all systems, we observe a small but significant positive correlation between focus and quality.


Materials and methods




The Internet is enabling an unprecedented number and variety of individuals to contribute knowledge, by authoring content individually or collaboratively and by helping one another directly in online forums. Traditionally, knowledge generation was the purview of scholars and other professionals, who would acquire pertinent domain knowledge through time–consuming study or extensive experience. Their contributions were published in domain–specific peer–reviewed venues. Today, anyone can easily contribute knowledge online across many domains, by editing a Wikipedia article or answering a question in any given category on an online question and answer site. Hence it is especially important to understand the role of breadth in the quality of contribution in highly inclusive online settings.

Most individuals tend to build their expertise in a small range of fields; and experiencing success in one subject area may, through positive reinforcement and the ability to obtain more resources, result in additional focus on that subject. On the other hand, there are myriad successful individuals who dabble in many sciences, inventors who invent diverse gadgets, and Wikipedia editors who edit pages on seemingly disparate topics. One might argue that their versatility is a reflection of superior abilities that can yield greater opportunities for varied collaborations and cross–pollination of ideas. In this paper, we present the first comprehensive, large–scale analysis of this relationship between individual focus and performance across a broad range of both traditional and modern knowledge collections.

Previous research has aimed to quantify the benefit of interdisciplinarity among researchers at the group level. A study of scholarly articles in the U.K., for example, found that research articles whose coauthors are in different departments at the same university receive more citations than those authored in a single department, and those authored by individuals across different universities yield even more citations on average (Katz and Hicks, 1997). Multi–university collaborations that include a top tier–university were found to produce the highest–impact research articles(Jones, et al., 2008). It has also been demonstrated that scholarly work covering a range of fields — and patents generated by larger teams of co–authors — tend to have greater impact over time (Wuchty, et al., 2007). Collaborations between experienced researchers who have not previously collaborated fare better than repeat collaborations (Guimerà, et al., 2005). In the area of nanotechnology authors who have a diverse set of collaborators tend to write articles that have higher impact (Rafols and Meyer, 2010). Finally, diverse groups can, depending on the type of task, outperform individual experts or even groups of experts (Page, 2007).

All of this work is evidence of a benefit in bringing together diverse individuals. It does not demonstrate, however, whether diversity in research focus is beneficial at the individual level. One exception is a study of political forecasting that established that “foxes”, individuals who know many little things, tend to make better predictions about future outcomes than “hedgehogs” who focus on one big thing (Tetlock, 2005). Our work addresses knowledge contribution in a much broader context than forecasting and, more importantly, quantifies the relationship between individuals’ narrowness of focus and the corresponding quality of their contributions.




In order to cover a broad range of knowledge-generating activities, we study several collections of traditional scholarly publications in addition to recent, Web–based media collections. The traditional media we consider include patents and research articles. Our patent collection consists of 5.5 million patents filed with the U.S. Patent and Trademark Office (USPTO, between 1976 and 2006. We consider two sources of research articles: JSTOR ( and the American Physical Society (APS, The JSTOR corpus consists of two million articles from 1,108 journals in the natural sciences, social sciences, and humanities, and the APS data set we consider covers over 200,000 research articles in the single discipline of physics. We complement data from these traditional publication venues with data from two recent, online types of knowledge–sharing activity: Wikipedia and a collection of question–answering forums. Wikipedia is a collaborative online effort to document all of human knowledge in a systematic way into a popular Internet–based encyclopedia. The Question and Answer forums we study are Yahoo Answers (English) (Adamic, et al., 2008); Baidu Knows (Chinese) (Yang and Wei, 2009); and Naver Knowledge iN (Korean) (Nam, et al., 2009). On each of the sites, millions of questions are answered each year by individuals with a wide range of expertise.

Online knowledge–sharing activity includes not just those who specialize in knowledge generation and dissemination, i.e., professional researchers and scholars, but also others who gained their expertise through study and experience. In addition to providing data on different types of individuals, these data sets represent knowledge generation at different scales. Authoring a research article or patent in most cases involves weeks to years of research, culminating in a significant new result worthy of publication. In contrast, contributing a fact to Wikipedia or answering a question posed in an online forum may involve little more than a simple recall of previously attained knowledge — and a few minutes of the contributor’s time.

In evaluating focus across such a broad range of activities, we aimed to use a metric that captures three qualities: variety, or how many different areas an individual contributes to; balance, or how evenly their efforts are distributed among these areas; and, similarity, or how related those areas are (Rafols and Meyer, 2010). We use the Stirling measure Stirling measure, which captures all three aspects (Stirling, 2007):

Equation 1

where pi is the proportion of the individual’s contributions in category i and sij = nij/nj is a measure of similarity between categories i and j, inferred from the number of joint contributors nij between two categories i and j. This Stirling measure metric assigns a narrower (higher) focus value to an individual who contributes to fewer, related areas than to someone who contributes in many unrelated areas.

The categories across which focus is measured differ by the type of knowledge–sharing medium. An inventor’s proportion pc of contributions in subject c is proportional to the number of times the class c is assigned by the inventor or patent examiner to the inventor’s patents. Articles in the APS data set are classified according to the Physics and Astronomy Classification Scheme. For JSTOR articles, in the absence of a pre–defined category structure, we used unsupervised topic models on the full text of authors’ research articles (Blei, et al., 2003). Wikipedia articles are situated within Wikipedia’s category hierarchy, while answers provided in Q&A forums are sorted according to the hierarchy of categories of the corresponding questions.

For each data set, we sought a relevant, objective measure of quality of a contribution and evaluated it in the context of its peers. For research articles, we measured each article’s citation count relative to those of other articles in the same discipline and year (Valderas, et al., 2007). Likewise, patents’ citation counts were compared with those of other patents in the same patent classes and years. In doing this, we control for discipline–specific factors that can impact a publication’s citation count such as publication cycle length and number of publications in the discipline (Stringer, et al., 2008; Seglen, 1997). For Wikipedia contributions, we consider the percentage of words an author newly introduces to an article that survive subsequent revisions (Adler and de Alfaro, 2007). Finally, for Q&A forums, we rely on the asker’s rating of answers: a good contributor should have their answer selected as best more often than expected by chance.


Figure 1: Mean quality as a function of focus of contribution
Figure 1: Mean quality as a function of focus of contribution.


Using these measures of focus and quality, we find that focus is weakly yet consistently positively correlated with quality across all types of knowledge contribution systems, as summarized in Table 1. The relationship between focus and quality is detailed further in Figure 1, which shows the variation in average quality for individuals grouped by their levels of focus. As focus increases, so does average quality; but the trend levels off or even reverses for extremely focused individuals. This is clarified by plotting average focus at a given level of quality (see Figure 2). While high–quality contributors are more narrowly focused than others on average, very poor contributors sometimes also dwell in a single area. In Q&A forums, we find further that narrowly focused users with poor track records of giving best answers tend to give answers that are significantly shorter than those of other users.


Figure 2: Mean focus as a function of quality of contribution
Figure 2: Mean focus as a function of quality of contribution.


We note that these data sets provide incomplete views of contributors’ activity. JSTOR archives over a thousand journals, but does not include many more. Inventors’ patents prior to 1976 are not captured in our data. Likewise, we parsed only a subset of the Wikipedia revision history; and, while our Yahoo Answers (YA) data set spanned the complete activity of a sample of users, our Baidu and Naver data sets covered only several months each.


Table 1: Pearson correlation between quantity, focus, and quality.
Note: All correlations are significant at ρ < 0.001.
Typeρ (log (quantity), focus)ρ (log (quantity), quality)ρ (focus, quality)
Research articles JSTOR-0.0550.0580.112
Research articles APS0.3020.1300.173
Yahoo Answers0.1500.1160.084
Baidu Knows0.0950.0830.111
Naver KnowledgeIn0.0660.1020.169


Nevertheless, we believe our results to be robust. We expect that we would find equally strong or stronger correlation between focus and quality, if we had complete records of each individual’s contributions. An indication of this is that the correlations between focus and quality strengthen for individuals for whom we observe a higher number of contributions. For example, the correlation between focus and impact for inventors with 10–20 patents is just 0.079 ± 0.003, but for inventors with 50 to 100 patents it increases to 0.166 ± 0.014. Similarly, Wikipedia users editing between 10 and 20 pages display a correlation of 0.088 ± 0.065, while those editing between 50 and 100 pages display a correlation of 0.243 ± 0.038.

We find our results to be robust in several other respects as well. Aside from being consistent across a wide range of media and performance metrics, our results hold when focus is measured at different levels of granularity, e.g., when using top–level patent, Wikipedia, and Q&A categories as opposed to subcategories in those data sets, and when we construct 250 as opposed to 100 topics in the JSTOR data set. While the distribution of focus shifts downward as we increase the granularity, the correlations between focus and other variables remain qualitatively similar.


Table 2: Pearson correlations between quality, focus, and quantity, when self–citations are removed.
Note: All correlations are significant at ρ < 0.001.
Typeρ (log (quantity), quality)ρ (focus, quality)
Research articles JSTOR0.0510.095
Research articles APS0.0590.036


We also find our results to be consistent, though weaker for the paper and patent data sets, when self–citations, i.e., citations between two papers that share an author with the same last name, are removed. Table 2 summarizes these findings. Self–citations may inflate the impact of prolific and focused authors who have greater opportunity and justification to cite their own work. We also find consistent results using alternative focus measures such as Shannon Entropy, Equation 2. Entropy captures the balance and variety of contribution, but not similarity, and is negatively correlated with focus. We find the results to be qualitatively consistent to those obtained using the Stirling measure. Table 3 and Figures 3 and 4 correspond to Table 1 and Figures 1 and 2 respectively, but present results using contributor entropy rather than focus.

One remaining concern is that focus and quality are both correlated with a third variable that holds greater explanatory power. One such potential variable is that of quantity. Quantity itself is positively correlated with quality, revealing a possible link between contributor success and motivation or resources. However, focus remains a significant factor in the quality of contributions, even once quantity is accounted for (see Table 4). We also note that quantity’s correlation with focus varies by medium studied, as shown in Table 1. The correlation is positive for patents and Q&A forums, but negative for research and Wikipedia articles. Individuals generating many patents or answers tend to focus their contributions more narrowly, but authors who write a greater number of research and Wikipedia articles tend to make broader contributions.


Figure 3: Mean quality as a function of entropy of contribution
Figure 3: Mean quality as a function of entropy of contribution.



Table 3: Pearson correlations between focus, entropy, quantity, and quality.
Note: All correlations are significant at ρ < 0.001.
Data setρ (entropy, focus)ρ (log (quantity), entropy)ρ (entropy, quality)
Research articles JSTOR-0.6060.082-0.218
Research articles APS-0.6010.083-0.116
Yahoo Answers-0.878-0.081-0.121
Baidu Knows-0.918-0.034-0.108
Naver KnowledgeIn-0.946-0.175-0.223



Figure 4: Mean entropy as a function of quality of contribution
Figure 4: Mean entropy as a function of quality of contribution.



Table 4: Regression models using quantity and focus as predictors of quality.
log (quantity)0.177***


Finally we examine whether individuals narrow or broaden their focus over time. For JSTOR research articles, patents, and Q&A forums, a majority of contributors narrow their focus over time (see Table 5). Wikipedia contributors and physicists, on the other hand, do not appear to specialize further. In addition to a change in focus, we also observe a slight change in quality. Across most data sets, contributors tend to improve in quality over time; exceptions include Baidu Knows, where the change in answer quality is not significant, and JSTOR, where there is a statistically significant decline in contribution quality. One might speculate that a researcher’s early success permits him or her to continue producing publications, but that the quality of those publications may fall due to factors such as moving from a primary contributor to a project management role.


Table 5: Change in focus from first half to second half of contributions.
Data set% who increased focusav. change focusav. change quality
Research articles JSTOR62.0%0.024-0.394
Research articles APS44.0%-0.012-0.157
Wikipedia49.5%not. sig.0.066
Yahoo Answers69.8%0.0560.056
Baidu Knows61.4%0.040not. sig.
Naver KnowledgeIn68.8%0.0230.034





We have quantified the value an individual’s focus in contributing knowledge through both traditional and online media and across a wide range of subjects. Consistently we observe a slight but significant correlation between an individual’s degree of focus and this individual’s quality of contribution. The relationship persists even when quantity of contributions is taken into account.

How should individuals invest their time? While our results do not demonstrate causality, the overall trend appears to favor those who do a few things and do them well. However, individuals who focus in a very narrow field tend to contribute work that is on average less well recognized than that of their slightly less focused peers.

This work immediately suggests several areas for future research. It would be useful to understand the benefit of narrowing one’s focus in the context of the specific domain and knowledge sharing medium, in addition to the quality and diversity of one’s prior efforts. One could also examine whether novel, groundbreaking contributions are made by more or less narrowly focused individuals, and whether editorial tasks, important in the context of online collaborative media such as Wikipedia, benefit from a breadth of expertise.

In addition, while these results have shed light on the value of focus in the context of the individual, they say nothing about focus in the context of a group. After all, several studies have demonstrated the value of interdisciplinary collaborations in the sciences, and we believe that large–scale online knowledge sharing systems such as those discussed are successful precisely because they bring together individuals of different backgrounds. This leaves open the question of whether collaborations between more individually focused — yet collectively diverse — individuals are more fruitful.



Materials and methods

Table 6 summarizes the data sets we used to study focus and contribution. For each data set we selected a threshold criterion for the minimum level of activity needed for an individual to be included.


Table 6: Description of data.
Data setTime spanNo. individualsThreshold no. contributions
JSTOR1668–200637,03110 articles
APS1977–200622,35110 articles
Patents1976–2006604,11310 patents
Wikipedia2001–20067,12940 edits
Yahoo Answers08/05–03/095,25640 answers
Baidu Knows12/07–05/0865,85440 answers
Naver KnowledgeIn12/08–02/095,91840 answers


Research articles. A snapshot of JSTOR data includes two million research articles with 6.6 million citations between them. JSTOR spans over a century of publications in the natural and social sciences, the arts, and humanities. For this data set, we needed to address name ambiguity. For example there were 26,000 instances where a person with the last name of Smith authored an article and 728 unique combinations of initials appearing alongside “Smith”. Identifying two different individuals as being one and the same would tend to introduce data points with low focus and an inflated number of articles. Since both variables are related to quality, we sought to exclude such instances. We excluded authors with Equation 3 where FL is the number of first names or initials the inventor’s last name occurs with in the data set, and LF is the number of last names the inventor’s first name occurs with. We also collapsed matching names and initials if there was only one matching first name/inital pair and the last name occurred with fewer than 50 first names. This left us with 37,031 authors with 10 or more publications, for whom we were reasonably sure that they were uniquely identified.

Using latent dirichlet allocation (Blei, et al., 2003), we generate 100 topics over the entire corpus of research articles. Each document was assigned a normalized score for each of the 100 topics, and the pairwise topic similarity matrix s was computed from cosines of vector values across documents. An author’s distribution across topics was computed by averaging the topic vectors of all of the articles they authored. For robustness, we repeated the analysis with 250 topics instead of 100, and found quantitatively similar correlation between focus and quality, although focus scores were lower due to the finer granularity. The quality of an article is measured as the number of times the article is cited, divided by the number of times other articles in the same area and year are cited. Citations originate within the data set. By normalizing quality by area, we mitigate the possible biases introduced by some areas being better represented in the data set than others.

Our database of American Physical Society publications included Physical Review Letters, and Physical Review A–E journal articles. We excluded Reviews of Modern Physics as we were considering the impact of original research rather than review articles. The data set contained 396,134 articles published between 1893 and 2006, with 3,112,706 citations between them. For our purposes, we were limited to the 261,161 articles with PACS (Physics and Astronomy Classification Scheme) codes associated with articles published after 1977. The PACS hierarchy has five levels, and we performed our analysis at the level of the 76 main categories, such as 42 (Optics) and the 859 2nd level categories, e.g., 42.50 (Quantum Optics).

Patents. The patent data set contains all 5,529,055 patents filed between 1976 and 2006, in 468 top level categories. We construct a similarity matrix for the 468 categories, reflecting the frequency with which inventors in one category also file patents in another. There are 3,643,520 patents citing 2,382,334 others, for a total of 44,556,087 citations. We excluded inventors with Equation 4. This makes it unlikely that we would identify two separate individuals as being one. We measure an inventor’s impact according to a citation count normalized by the average number of citations for other patents in the same year and categories as those filed by the inventor.

Q&A forums: We obtained snapshots of activity on Q&A forums with uniquely identified users posting answers to questions in distinct categories. We perform our analysis at the subcategory level, which gives us enough resolution to differentiate the question topics, while supplying a sufficient number of observations in each subcategory. We use best answers as a proxy for answer quality. The best answer is selected by the user who posed the question. If this user does not select a best answer, it may be selected via a vote by other users. The quality metric we used was the γ score (Nam, et al., 2009), which compares the number of answers the user gives that were selected as best among others, relative to the expected number of best answers. Equation 5 the expected number of best answers is simply given by Equation 6 where ak is the total number of users answering question k.

Wikipedia: Our Wikipedia data set is a meta–history dump file of the English Wikipedia generated on 4 November 2006. The dump file has the entire revision history of about 1.5 million encyclopedia pages, of which we parsed 100,000, or about seven percent. In order to verify that our sample is unbiased with respect to topic distribution, we compare the category and subcategory distributions of our sample to that of a larger corpus of one million pages. The two distributions have a nearly perfect correlation (ρ < 0.96***).

Articles are a product of varying number of revisions, from several to 10,000 for single article. Revisions are contributed by either registered or anonymous users. Since anonymous users’ revision histories are non–traceable, we only consider registered users whose unique user names are associated with at least 40 revisions. We excluded Wikipedia administrators from our study because they may perform a primarily editorial role. In like manner, to better filter the noise of measuring the quality of words by the final version of articles, we only choose pages in which ewer than five percent of the revisions occurred in the 30 days prior to the data dump.

A Wikipedia contributor’s focus and entropy were calculated from the second–level categories of the pages they edit. Each Wikipedia article belongs to one or more categories. We truncated each hierarchical category to one of the roughly 500 second–level categories.

The quality of a contribution is measured in terms of wnew, the number of new words added by a user to Wikipedia articles, such that the words were not present in any previous revisions of those articles. We found a high correlation between the number of new words that survive five revisions, and the number wsurv that survive to the last revision of the article (ρ > 0.97***), consistent with previous analyses of edit persistence (Panciera, et al., 2009). We therefore constructed a simple metric by taking the proportion of new words introduced by the user that are retained in the last version of a sufficiently frequently edited article: wsurv/wnew (Adler and de Alfaro, 2007). End of article


About the author

Lada A. Adamic is an assistant professor in the School of Information and the Center for the Study of Complex Systems at the University of Michigan.
E–mail: ladamic [at] umich [dot] edu

Xiao Wei is a masters student and research assistant, School of Information, University of Michigan.
E–mail: xiaowei [at] umich [dot] edu

Jiang Yang is a PhD candidate, School of Information, University of Michigan.
E–mail: yangjian [at] umich [dot] edu

Sean Gerrish is a PhD candidate, Department of Computer Science, Princeton University.
E–mail: sgerrish [at] cs [dot] princeton [dot] edu

Kevin K. Nam is a PhD candidate, School of Information, University of Michigan.
E–mail: ksnam [at] umich [dot] edu

Gavin S. Clarkson is an associate professor at the University of Houston Law Center and the Institute for Intellectual Property & Information Law.
E–mail: gclark [at] uh [dot] edu



We thank IBM for providing the patent data, and JSTOR, APS, and Katy Borner for providing the article citation data. We would also like to thank Michael McQuaid, Jure Leskovec, Scott Page and Eytan Adar for helpful comments. This research was supported by MURI award FA9550–08–1–0265 from the Air Force Office of Scientific Research and NSF award IIS 0855352.



Lada A. Adamic, Jun Zhang, Eytan Bakshy, and Mark S. Ackerman, 2008. “Knowledge sharing and Yahoo Answers: Everyone knows something,” WWW ’08: Proceedings of the 17th International Conference on World Wide Web, New York: ACM, pp. 665–674.

B. Thomas Adler and Luca de Alfaro, 2007. “A content–driven reputation system for the Wikipedia,” WWW ’07: Proceedings of the 16th International Conference on World Wide Web, New York: ACM, pp. 261–270.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan, 2003. “Latent dirichlet allocation,” Journal of Machine Learning Research, volume 3, pp. 993–1,022.

Roger Guimerà, Brian Uzzi, Jarrett Spiro, and Luis A. Nunes Amaral, 2005. “Team assembly mechanisms determine collaboration network structure and team performance,” Science, volume 308, number 5722, pp. 697–702.

Benjamin F. Jones, Stefan Wuchty, and Brian Uzzi, 2008. “Multi–university research teams: Shifting impact, geography, and stratification in science,” Science, volume 322, number 5905, pp. 1,259–1,262.

J.S. Katz and Diana Hicks, 1997. “How much is a collaboration worth? A calibrated bibliometric model,” Scientometrics, volume 40, number 3, pp. 541–554.

Kevin Kyung Nam, Mark S. Ackerman, and Lada A. Adamic, 2009. “Questions in, knowledge in? A study of Naver’s question answering community,” CHI ’09: Proceedings of the 27th International Conference on Human Factors in Computing Systems, New York: ACM, pp. 779–788.

Scott E. Page, 2007. The difference: How the power of diversity creates better groups, firms, schools, and societies. Princeton, N.J.: Princeton University Press.

Katherine Panciera, Aaron Halfaker, and Loren Terveen, 2009. “Wikipedians are born, not made: A study of power editors on Wikipedia,” GROUP ’09: Proceedings of the ACM 2009 International Conference on Supporting Group Work, New York: ACM, pp. 51–60.

Ismael Rafols and Martin Meyer, 2010. “Diversity and network coherence as indicators of interdisciplinarity: Case studies in bionanoscience,” Scientometrics, volume 82, number 2, pp. 263–287.

Per O. Seglen, 1997. “Why the impact factor of journals should not be used for evaluating research,” British Medical Journal, volume 314, number 7079, p. 497.

Andy Stirling, 2007. “A general framework for analysing diversity in science, technology and society,” Interface: Journal of the Royal Society, volume 4, number 15, pp. 707–719, at, accessed 22 February 2010.

Michael J. Stringer, Marta Sales–Pardo, and Luís A. Nunes Amaral, 2008. “Effectiveness of journal ranking schemes as a tool for locating information,” PLoS ONE, volume 3, number 2, at, accessed 22 February 2010.

Philip E. Tetlock, 2005. Expert political judgment: How good is it? How can we know? Princeton, N.J.: Princeton University Press.

Jose M. Valderas, R. Alexander Bentley, Ralf Buckley, K. Brad Wray, Stefan Wuchty, Benjamin F. Jones, and Brian Uzzi, 2007. “Why do team–authored papers get cited more?” Science, volume 317, number 5844, pp. 1,496—1,498.

Stefan Wuchty, Benjamin F. Jones, and Brian Uzzi, 2007. “The increasing dominance of teams in production of knowledge,” Science, volume 316, number 5827, pp. 1,036–1,039.

Jiang Yang and Xiao Wei, 2009. “Seeking and offering expertise across categories: A sustainable mechanism works for Baidu Knows,” Proceedings of International AAAI Conference on Weblogs and Social Media, at ICWSM/09/paper/view/175, accessed 22 February 2010.


Editorial history

Paper received 9 February 2010; accepted 22 February 2010.

Commons License
“Individual focus and knowledge contribution” by Lada A. Adamic, Xiao Wei, Jiang Yang, Sean Gerrish, and Gavin S. Clarkson is licensed under a Creative Commons Attribution 3.0 United States License.

Individual focus and knowledge contribution
by Lada A. Adamic, Xiao Wei, Jiang Yang, Sean Gerrish, Kevin K. Nam, and Gavin S. Clarkson.
First Monday, Volume 15, Number 3 - 1 March 2010

A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2020. ISSN 1396-0466.