First Monday

An initial exploration of ethical research practices regarding automated data extraction from online social media user profiles by Sophia Alim

The popularity of social media, especially online social networks, has led to the availability of potentially rich sources of data, which researchers can use for extraction via automated means. However, the process of automated extraction from user profiles results in a variety of ethical considerations and challenges for researchers. This paper examines this question further, surveying researchers to gain information regarding their experiences of, and thoughts about, the challenges to ethical research practices associated with automated extraction. Results indicated that, in comparison with two or three years ago researchers are more aware of ethical research practices, and are implementing them into their studies. However, areas such as informed consent suffer from a lack of implementation in research studies. This is due to various factors, such as social media ‘Terms of Service’, challenges with large volumes of data, how far to take informed consent, and the definition of online informed consent. Researchers face a range of issues from digital rights to clear guidance. This paper discusses the findings of the survey questionnaire and explores how the findings affect researchers.


1. Introduction
2. Literature review
3. Methodology
4. Results
5. Discussion and conclusions



1. Introduction

The popularity of social media in recent years has resulted in a vast amount of online data consisting of people, interactions and items (boyd and Crawford, 2011). Examples of social media include:

Social media, especially social networking sites (SNS), contain user profiles consisting of personal data. Social media has provided researchers with potential access to rich data sources for various uses. Sociologists collect user data to study human behaviour online. Engineers study social media to gain understanding to design better networked systems. Marketers design strategies for viral marketing by extracting publicly available user profile information (Gjoka, et al., 2010).

The building blocks in Table 1 form the framework of social media. Table 1 demonstrates the various data sources within a user profile which are potentially available for researchers to extract and analyse.


Table 1: Building blocks of social media.
Note: Adapted from Kietzmann, et al. (2011).
Building blockDescription
IdentityUsers reveal their identities on social media via profiles by disclosing personal details and other information that presents users in certain ways.
ConversationsUsers communicate with other users using a variety of methods. One of the main purposes of social media is to facilitate interactions between users.
SharingUsers share, receive and exchange information between each other.
PresenceUsers can find out how accessible other users are. This involves finding out the location of other users in the virtual or real world.
RelationshipsTwo or more users can build relationships with each other. This can lead to sharing information, and interacting with each other.
ReputationThe extent to which users can identify whether other users can be trusted.


The extraction of profile data from social media occurs via non–automated or automated extraction methods. Non–automated methods include the use of interviews, questionnaires, manual scraping of data, having conversations, and joining forums and networks to listen in. Automated methods focus more on building software for extraction, e.g., Web crawlers. This paper focuses on automated methods of extracting user profile data.

Researchers use a range of automated data extraction techniques to acquire both public and private information from user profiles. Public information can be extracted via techniques such as a Web crawler. Alim, et al. (2009) study looked at the process of automated extraction of personal data from public MySpace user profiles using a Web crawler. The aim of the study was to build a Web crawler which would extract users’ profile details, list of top friends then the top friends’ profile details and list of their top friends. This data will go into a database. The process involved the following steps:

  1. The user starts the crawling process by imputing into the crawler, the URL of the first MySpace profile they want to extract profile details and list of top friends from. This profile is known as the seed node. They also input the stopping criteria for the crawler e.g., the first 100 top friends or by the level (e.g., 1 is just the top friends of the specified profile) extracted.
  2. The crawler extracts the HTML which is a long string from the MySpace profile and stores it as an array. Then all the HTML tags are removed from the string and split into tokens. The tokens are stored in a vector.
  3. The crawler checks the stopping criteria and whether the profile has been extracted before. This is done through the use of Breadth–first search (BFS) which is used to travel across MySpace. Breadth–first search is a graph searching algorithm that involves using a queuing system. Implementation of the algorithm is presented below.
    a) Add profile 1’s (seed node) profile details and top friends list to the front of the queue, ready to go into the database.
    b) Loop.
    c) Look at profile 1’s top friends and check to see if they already exist in the database. In the first iteration, the friends are profiles 2, 3 and 4.
    d) If the friends do not exist in the database, add their profile details and list of top friends to the back of the queue.
    e) Look at the next profile at the front of the queue (in this case profile 2).
    f) Repeat steps a and b.
  4. Breadth–first search will be carried out until the stopping criteria set by the user has been reached.

Past research studies (including Bonneau, et al., 2009; Caverlee and Webb, 2008 and Gundecha, et al., 2011) also used Web crawlers to extract publically available profile data. Another method for extracting public profile information involves the use of passive fake profiles. Unlike Alim, et al., (2009) study where MySpace would allow non–MySpace members to extract from public MySpace profiles, social media platforms, e.g., Facebook require you to login into Facebook. To deal with this issue, passive fake profiles were set by researchers to gain access to users’ public information. The fake profile would not send friend requests or contribute to the activity of Facebook (Elovici, et al., 2013).

Private information can also be extracted from user profiles via inferring information from public data, data via agreement from users, socialbots or third–party applications (Elovici, et al., 2013). In regards to inferring a user’s personal details, analyzing friends’ profiles can give clues into some of the user’s details. In 2009, Becker and Chen (2009) developed a tool to investigate whether the personal details of a user could be inferred from their friends’ profiles. The methodology used a threat model approach and the concept of frequency to try to infer the personal details of the user. The tool was installed by 93 participants and had a 60 percent accuracy rate.

Dynamic fake profiles, known as socialbots can be used to extract a user’s private information. Unlike a fake profile, socialbots send out friend requests. For a researcher, if users accept friend requests from a socialbot, this gives them the chance to analyse a users’ private information as well as their privacy settings. The installation of third party applications on users’ profiles, which is explored in more detail later in this paper, is another method researchers can gain access to a user’s private information (Elovici, et al., 2013).

Fire, et al. (2012) explored the issue of how Facebook users can expose their personal details to third–party Facebook applications and how their privacy settings may not set to make the user secure. The researchers came up with an application to protect users. The application consisted of three layers of protection. Layer one presents an easy method for users to set their profile privacy settings by simply using one easy click to select the privacy settings suitable for them. Layer two informs the user of the number of installed applications on their profile which cause a threat to their privacy. Layer three analyses the user’s list of friends and identifies friends which are fake profiles. Fake profiles can cause a risk to the user’s security and privacy.

Automated extraction raises various ethical issues, from gaining informed consent from users, to anonymising extracted profile data. Studies have explored and discussed ethical issues associated with extraction (Henderson, et al., 2012; Jones, 2011; Menczer, 2008; Wilkinson and Thelwall, 2011). However, more research needs to be carried out into researchers’ thoughts, concerns and actual experiences with regard to ethical research practices. Previous research from Salmons and Woodfield (2013) surveyed members of the New Social Media New Social Science network about ethical concerns regarding the use of social media for research. Their research study, however, did not focus specifically on automated data extraction.

This paper involves using an online questionnaire to explore ethical research practices amongst researchers who have carried out automated data extraction of user profile data from social media. Ethical issues regarding automated data extraction from social media are discussed in the literature review. This is followed by a description of the methodology, followed by the questionnaire results. The paper ends with the discussion and conclusion section.



2. Literature review

A variety of ethical issues exist for a researcher when designing research experiments involving human participants. Ethical considerations include research definition, access to study participants, the people involved in the study, study objectives, data collection, data storage, data representation, potential harm or risk to the participants, study benefits, and presentation and distribution of findings (Markham and Buchanan, 2012). Whether or not a researcher thinks that online user profiles are classified as human participants, all these considerations apply to the automated extraction of profile data. The main ethical considerations with regard to extracting profile data via automated means are presented in the following sections.

2.1. Informed consent

Informed consent is a major cornerstone in research ethics and human subject research. Research institutions, such as Stanford University and University of Texas, have policies and guidance on human subject research. Stanford University (2014) policy on their human research protection program (HRPP), aims to protect human participants in research. The policy contains guidance on areas such as informed consent, participant recruitment, risks to research participants, privacy, what is defined as human subject research, multi–site research and addressing the concerns of a research participant. The aim of the HRPP is to ensure that all research by the university, meets the following criteria:

The Belmont report was created by the U.S. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research and released in 1979. The report summaries guidelines and ethical principles for research which involve human research participants. The report identifies three core ethical principles, which include respect for persons, justice and beneficence. Respecting people means not using people as a means to an end, treating people as autonomous agents, allowing people to make their own choices and providing extra support for those with limited autonomy. Respecting people plays a big part in informed consent to participants in human subject research. Beneficence focuses on preventing risks and promoting the benefits of the research. Justice centers on treating people fairly, e.g., research participants who are vulnerable, are not selected because of convenience (U.S. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, 1979).

The University of Texas Institutional Review Board Policies and Procedures Manual [2] contains sections on research, which utilizes surveys and Internet research; vulnerable populations and research with investigational devices. The aim of the policy is to protect the welfare and rights of human subjects, who choose to participate in social behavioural or biomedical research studies. The University commits to principles and guidelines stated in the Belmont report.

In the U.S., researchers, people involved with human subject research and policy writers, can undergo Collaborative Instructional Training Initiative (CITI) on human subjects research. CITI includes modules on basics of Institutional Review Board (IRB) regulation, informed consent, privacy and confidentiality, assessing risks to participants and federal research regulations (CITI Program, 2014). The IRB is responsible for protecting the rights and welfare of research subjects in the United States involved in research containing human participants (Brown University, 2014).

In the last few years, automated data extraction techniques have changed with regard to profile data. Past studies (including Caverlee and Webb, 2008; Gundecha, et al., 2011; Mislove, et al., 2007; Viswanath, et al., 2009) used Web crawlers to extract a vast amount of publicly available profile data. The studies did not obtain informed consent from the users from whom they extracted profile data.

Changes in social media platforms’ privacy policies led to the deployment of third–party applications developed using the corresponding application programming interface (API) to extract user information. In the case of social media platforms, the API is a collection of open and public access methods which allow third parties to interact with the sites by retrieving user profile information (Doran, et al., 2012).

Unlike other methods of data extraction, such as Web crawlers, the user had to grant permission to the application for it to extract data from a user’s profile. Permission allowed third parties to access a user’s profile data in a way not normally possible, due to the user’s privacy settings. Researchers gained a chance to access private information regarding a user and their network of friends. The requests of the application are presented to the user before they decide whether to grant permission or not. This is an ‘all or nothing approach’ because the user cannot define the information that is accessed by the application if permission is granted (Cheng, et al., 2013).

APIs have disadvantages such as data leakage (Cheng, et al., 2013). Social media platforms found it hard to enforce their privacy policies with third–party applications (Felt and Evans, 2008). When users granted permission to applications to extract profile data, their friends’ profile data could also be extracted so users are left with an ethical responsibility for their friends’ privacy. A concept by Oboler, et al. (2012), which emphasised ethical responsibility to users, involved introducing a code of ethics for information producers and consumers. The ethical code for information consumers highlighted issues associated with publishing information on social media such as revealing information about other users. The code for consumers explored areas such as information posted by others on social media and whether that information should be shared further.

In the last two years, social media platforms have changed privacy policies. This has had an impact on the type of automated extraction techniques used by researchers. In 2012, Facebook updated their privacy policy to prohibit the collection of user information via automated means, such as harvesting bots, robots, spiders or scrapers without permission from Facebook. If information from users was collected, consent from the user had to be obtained. Also, it had to be made clear that the researcher (and not Facebook) was collecting the information. A privacy policy explaining what information was collected, and how it would be used, had to be in place (Facebook, 2013).

A brief analysis of current ‘Terms of Service’ policy statements for a selection of social media platforms, presented in Table 2 found that Facebook and Twitter specifically mention automated data extraction techniques, such as crawling and the use of third–party applications. MySpace discusses third–party applications but does not discuss other extraction techniques. Google uses one privacy policy for all its products, but does not mention extraction techniques specifically. However, Google does discuss privacy principles: one of the privacy principles focuses on being a responsible holder for the data that users entrust to Google.


Table 2: ‘Terms of Service’ policy statements for a selection of social media platforms.
Social media platformPolicy descriptionPolicy yearPolicy reference

Collection of user content or information via automated means (e.g., robots, spiders, scrapers or harvesting bots) requires permission from Facebook.

Third–party applications implemented using a Facebook API must contain a privacy policy, and only data needed for the application must be requested.

Privacy policy informs the user what data will be used, as well as how the data will be used. The policy discusses how the data is going to be shared, displayed and transferred.

2013Facebook (2013)
TwitterCrawling Twitter is allowed if carried out in accordance with the robot.txt file, which contains instructions for a crawler regarding what parts of the site can and cannot be crawled.2012Twitter (2012)

MySpace user content may not be copied, modified, downloaded, made available, communicated, translated, performed, sold or otherwise used unless stated in MySpace Terms of Service or additional terms.

For third–party applications, MySpace is not responsible for the privacy and information practices of the application.

MySpace does not control the treatment of the users’ data.

2013MySpace (2013)

Google+ belongs to a range of Google services. The policy terms state that, to access the services, only the interface and instructions that Google provides can be used.

Google’s policy discusses privacy principles, such as being transparent about the collection of personal information, complying with privacy laws and giving users options to best suit their privacy needs.

Service content cannot be used unless permission is granted by the content owner.

2014Google (2014)


In terms of extraction techniques, Giles, et al. (2010) emphasises that most researchers look more at the robot.txt file than the ‘Terms of Service’. The ‘Terms of Service’ of social media platforms help to explain acceptable extraction techniques to researchers, and from whom to request permission to extract user data. Many research studies have used crawlers to extract data from profiles and this can violate the ‘Terms of Service’ in some cases. Even though platforms cannot control third–party applications, platforms can set out rules which would benefit the user. An example is Facebook, which requires applications to request consent from the user and contains a privacy policy. Researchers can use the privacy policy as an ethical statement to explain to the user the details of the study and its implications for them and their network of friends.

An ethical issue associated with gaining informed consent from profile users is that of age: an increased number of minors are using social media. The topic of informed consent and minors is challenging, because of minors being viewed as not having the ability to provide informed consent without their parents (Grimes, et al., 2009). The lack of identity validation results in minors pretending to be older.

Once profile data has been extracted, ethical considerations related to extraction include access to the extracted data, data aggregation, data anonymisation and data dissemination. The ‘Taste, Ties and Time’ experiment by Lewis, et al. (2008) highlighted these considerations. The experiment involved researchers downloading Facebook profiles of a cohort of college students manually to code and analyse the profiles over time. Even though the study used a non–automated extraction approach, lessons learnt from the implementation of the methodology can be applied to automated data extraction.

2.2. Dissemination of extracted data

One of the main issues associated with the Lewis, et al. (2008) research study was the dissemination of extracted profile data. The extracted data was coded and anonymised by removing the students’ names and replacing it with a number–based identifier. However, partial identification of the data subjects, i.e., the students, took place once the dataset of extracted data was released publicly. This was due to the public availability of a codebook which decoded the data and was available to download. Those downloading the dataset were not required to submit an application to the researchers in charge of the study, detailing the usage of the dataset (Lewis, et al., 2008). The codebook contained coded data about the college students’ attributes, such as gender, ethnicity, race, college major, home state and political views. With the use of the codebook and public comments regarding the research, the name of the college was identified, as well as the cohort of students. The identification took place through the unique majors stated in the codebook and the housing details. The identification of college did not require the full dataset to be downloaded. In terms of automated data extraction, not many datasets are disseminated publicly, due to the large volume of data and the need to protect the identities of the data subjects.

The Lewis, et al. (2008) study highlighted the issue of profile data access. When constructing the dataset, research assistants from the same college as the data subjects gathered Facebook profile data. The data subjects who were college students may have set their profile privacy settings to allow only other users from the same college as theirs to have the ability to view their profiles. Since the research assistants were from the same college as the students, they could view the students’ profiles, download the profile data and put it in a dataset that could be viewed and used by people outside the college (Zimmer, 2010).

2.3. Data aggregation

The partial re–identification of the ‘Taste, Ties and Time’ dataset, demonstrated the risks of aggregate data. Knowledge from multiple sources was used to partially re–identify college students from the dataset. Social media platforms have moved into new fields and provided new data sources. Examples have included Facebook Places which created a geolocation service for users which provided geolocation data. This is highly valuable to many researchers if the data is publicly available (McCarthy, 2010). The value of information can be maximised (Oboler, et al., 2012) by a primary key such as an e–mail address or social media account that can link applications with other personal data sources. Many users have multiple Web site accounts, and information that they disclose may differ in the context of the site and how much they trust the site (Korayem and Crandall, 2013).

2.4. Data anonymisation

Anonymising and data sharing requires time and ‘technical capacity to organize, store, and preserve data’ (Hartter, et al., 2013). For researchers, this means protecting data from unauthorised access to systems which are protected. Storing research data on a laptop which can easily be misplaced and stolen, leads to a data breach and risk of harm to data subjects.

Anonymising data is a challenge, and a question often asked by researchers is whether data can be fully anonymised to protect the identity of data subjects. Authors such as Warden (2011), Narayanan and Shmatikov (2009) and Bonneau, et al. (2009) argued that anonymising a dataset was not enough to maintain the privacy of data subjects in the dataset. This is due to potential attacks that can occur with anonymised datasets. These attacks are detailed in research by Narayanan and Shmatikov (2009) and Zhou and Pei (2008).

Advice is available regarding working within the limits of data anonymisation. The advice has included informing data subjects about the risk of de–anonymisation, anonymisation via the removal of personal identifiable information, making it harder for attackers to reconstruct the dataset, thinking about the detail of the data distributed to researchers, and learning from experts who have distributed datasets and used techniques to anonymise data (Warden, 2011).

This section has highlighted how automated data extraction from user profiles presents a variety of ethical considerations for researchers. Researchers have to balance the rights of the users against the rights of the researchers.

The methodology, presented in the next section, investigates researchers’ experiences, thoughts and concerns regarding ethical research practices.



3. Methodology

An online questionnaire was distributed to researchers who had carried out studies involving automated extraction of user profile data from social media platforms in the last three years. The creation and distribution of the questionnaire was to address the following research questions:

RQ1: What are the most popular ethical considerations implemented in research studies involving automated extraction from user profiles?

RQ2: What are the reasons for the lack of implementation of less popular ethical considerations in research studies involving automated extraction from user profiles?

RQ3: What ethical challenges are faced by researchers carrying out research studies in this area?

RQ4: What are the issues and concerns that researchers have for the future with regard to ethical research practices?

3.1. Questionnaire design

The questionnaire consisted of two sections, including open and closed questions: six questions in Section 1 and eight questions in Section 2. Section 1 collected the demographic information of participants, such as gender, age, area of expertise, and job position. This information was used for statistical purposes only. The questions in Section 1 were all closed questions.

Section 2 concentrated on questions regarding the implementation of ethical research practices. The questions included categorising a list of ethical considerations, according to whether they were implemented, specifying how decisions regarding research ethics were made, and discussing ethical challenges. Section 2 contained both open and closed questions. Open questions were placed in the questionnaire to allow participants to express honest opinions without influence from the researcher. Some of the questions in section 2 are based on Salmons and Woodfield’s (2013) survey.

3.2. Questionnaire sampling

Snowball sampling was used for sample selection. Snowball sampling is the method by which initial respondents are selected then additional respondents are acquired by information that is passed on from the initial respondents (Zikmund, et al., 2012).

For this study, questionnaire participants were selected by searching Google Scholar for recent academic papers that centred on automated extraction of user profile data from social media platforms. The first authors of the papers were e–mailed the study details and the online questionnaire link. The authors were also encouraged to pass the questionnaire link onto researchers or academics they knew who would be interested in the questionnaire. The questionnaire link was also e–mailed to social network research groups at various universities, and the author of this paper also e–mailed the questionnaire to academic contacts. The questionnaire link was, in addition, placed on a social network analysis mailing list. The questionnaire was initially tested by research students and they highlighted issues involving questionnaire flow. Corrections were carried out and the questionnaire was distributed to the intended sample.



4. Results

The questionnaire was distributed to 400 researchers and it was live for four months. Eighty–four responses were received. However, 20 of those responses were not valid due to only Section 1 being completed. These responses were therefore removed, leaving 64 responses altogether. Despite the survey response rate of 21 percent somewhat limiting the study, the results have provided a valuable initial insight into the implementation of research ethical practices through the eyes of researchers.

4.1. Profile of questionnaire respondents

Table 3 presents a profile of questionnaire respondents, comprising primarily researchers and academics. However, respondents with other job roles, such as research fellow, masters’ student and software architect, also answered the questionnaire. The profile illustrated how far–reaching the practice of automated data extraction of user profile data has become. Respondents were based in a variety of institutions around the world, including universities, colleges and companies.


Table 3: Background information of questionnaire respondents.
Note: N=64.
Female31 percent
Male69 percent
Job title 
Ph.D. student42 percent
Academic34 percent
Post–doctoral researchers/Research assistant15 percent
Independent researcher5 percent
Other4 percent
Area of expertise 
Computer science59 percent
Social science15 percent
Engineering/physical sciences7 percent
Other (Please specify)5 percent
Arts and humanities4 percent
Biological sciences3 percent
Business and management3 percent
Location of institution 
Europe36 percent
North America30 percent
Asia23 percent
Australia3 percent
South America3 percent
Africa0 percent
Did not specify5 percent


In terms of expertise, more than half the respondents were computer scientists, but the multidisciplinary nature of social networking research was also represented. Respondents originated from various fields, including social sciences, arts, geography and business. Respondents who had specified “other” were from fields such as information science, communication, software engineering and media studies.

4.2. Ethical considerations

Section 2 of the questionnaire explored the ethical research practices of the respondents. Respondents extracted user profile data from a variety of social media platforms for their research studies. Forty–five respondents (70 percent) specified the social media platforms from which they extracted data. The respondents could specify more than one platform. Facebook was the most popular platform for extraction, with 67 percent. Twitter followed, with 58 percent, and MySpace trailed with 20 percent. This is in contrast to 2008, where studies such as Caverlee and Webb (2008) and Thelwall (2009) demonstrated the popularity of crawling MySpace for user profile data. Forty–five percent of respondents used other social media platforms for extraction. Examples included Xing, Reddit, YouTube, Tumblr, Flickr, Orkut, Blogspot, Pokec, Instagram, and Foursquare.

Respondents were then asked to classify a list of ethical considerations according to whether they incorporated them into research studies associated with the automated extraction of user profile data. The results are presented in Table 4.


Table 4: Incorporation of ethical considerations into research studies.
Note: N=64, Due to rounding, some of the percentages in this table add up to 99 percent.
 Yes (%)No (%)Partially (%)
(some studies and not others)
Name of ethical consideration   
Gaining informed consent from profile users for data extraction473319
Maintaining privacy and confidentiality of extracted profile data8838
Secure storage of extracted profile data78614
Access to the extracted profile data, i.e., data sharing721611
Anonymisation of extracted personally identifiable data (i.e., name, age, location, etc.)641322
Anonymity of data subject, i.e., profile user691317
Dissemination of dataset containing extracted profile data731613
Risks to the data subject, i.e., profile user672011
‘Terms of Service’ for the social media sites from which data was extracted691713
Data subject, i.e., profile user unable to give informed consent because of being children or vulnerable50399


The results indicated that three ethical considerations were incorporated into research studies by more than half the number of respondents. These considerations were:

In contrast, ethical considerations which were not incorporated by many respondents were associated with the area of informed consent. Respondents who selected ‘No’ for any of the ethical considerations were asked to comment on the reasons why. Many of the comments related to the issues associated with informed consent. Comments for not considering informed consent included:

‘Only publicly visible data was extracted so we thought that, because the data was publicly available, no ethics applied.’

‘I did not gain informed consent because I only harvested publicly available data. That is, I only collected data that anyone on the internet could freely access. My ethics training has stated this to count as public information.’

‘We typically use public data, we violate the terms of service by crawling the sites, but do not gain any data that is not publicly available.’

These comments indicate agreement with Thelwall and Stuart’s (2006) view that publicly available user profiles are in the public domain and an invasion of privacy occurs if the data from the user profile is used in certain ways. Ess and Association of Internet Researchers’ (2002) view goes one step further to state that SNS, blogs and bulletin boards are electronic documents rather than human individuals. If research is carried out on electronic documents while not informing the owner of the electronic documents of their use then, according to Eynon, et al. (2009), it is not human research. Eynon, et al. (2009) also argue that the research may thus avoid ethical procedures and the research study application can avoid going through an ethics committee.

Another debate associated with profile content is to who profile data belongs. Several questionnaire respondents commented that the user profile content is the property of the user as illustrated by the comment of one respondent who stated that, ‘content on a person’s Facebook page is their intellectual property and if the Facebook user consents to the study, that was all we (the researchers) and our IRB felt was necessary given the current laws and ethical guidelines’. The comment also brings forward the issue of the digital rights of a user, which is explored in the discussion section below.

The process of gaining informed consent from users was highlighted as a concern by respondents, because of the large amount of data required for research studies, and the impossibility of getting informed consent from all the users as illustrated in selected comments below.

‘Studies require data of millions of users from all over the world, it is not possible to get consent from so many people. Data is crawled abiding by the terms of use of the social networking sites. Also, user data is not made public without anonymisation.’

‘Gathering informed consent would be unfeasible with the scale that we collect our data on.’

‘It was not feasible due to the scale of data and not required due to the anonymisation procedure.’

The use of social media APIs to implement third party applications has incorporated the concept of informed consent from the user in one way. Profile users have to grant permission to the application in order for extraction of private profile data to occur. However, informed consent is not obtained by the users’ network of friends. How far do you go in obtaining informed consent? Also, which types of data require informed consent?

In terms of the anonymisation and distribution of extracted data, respondents noted that user data was generally not made public. If the user data was made public, this was after anonymising the data. However, several respondents emphasised that the sharing of anonymised data can put users at risk. The final ethical consideration focused on the issue of minors and informed consent. Respondents who did not consider the issue of minors commented that they were not able to detect whether a profile user was a minor. Despite social media platforms having age restrictions, there is no age validation process.

4.3. Decision–making regarding ethics

Respondents were asked about how decisions were made regarding research ethics.


Table 5: Respondents’ decision–making regarding research ethics.
Note: N=64.
How were decisions madePercentage
Follow guidelines established by your institution58
Use your own values to determine good ethical practice17
Follow guidance given by research supervisor6
Follow guidelines established by your field of expertise5
Following guidance included in published research or research method books5


Table 5 illustrates that over half of the respondents made decisions regarding ethics using guidelines established by their institution. In the case of this research study, the institutions are mainly universities. Other decision methods included a combination of all the options presented in Table 5, a combination of institution guidelines, plus field guidance, and the guidelines of a project sponsor. One respondent commented that, when there were no guidelines, the respondent’s best interpretation of ethical guidelines or field practices was used instead.

Out of the 64 questionnaire respondents, 16 respondents (25 percent) went through an ethical approval process for their research studies .This figure is very low. The primary reason that several respondents cited for not going through the ethical approval process was that they were accessing publicly available profile data. One respondent commented that their research study was exempt from the ethical approval process because of the data being publicly available. For the 16 respondents who did go through the ethical approval process, four were based at institutions in Europe, eight in U.S., three in Asia and one in Australia. Studies which carried out large crawls of publicly available data, such as Caverlee and Webb (2008) and Viswanath, et al. (2009), did not cover the issue of ethics or ethics approval process in their published papers.

With regard to the respondents who did go through an ethical approval process, 14 respondents (88 percent) commented on the ethical approval processes. Nine respondents (56 percent) stated that the process was fine. However, criticisms of the process by other respondents included the slow nature of the process and lack of information on ethical considerations, such as privacy and risk in the online world. Respondents suggested additions to the process of extraction, such as gaining approval from the profile user, encrypting the extracted data, deleting the users’ data after finishing the study, and putting researchers in the shoes of the profile user. How would researchers feel about data from their profiles being extracted?

Only one respondent answered the question in the questionnaire about proposing resources for the future in this area. The respondent suggested making more standard datasets available so the profile data is not extracted over and over again. This idea could be valuable for researchers in terms of having accessible data sources.

4.4. Ethical challenges

Respondents were asked to discuss the ethical challenges they faced when carrying out their research studies. The aim of the question was to gain an insight into how respondents felt about ethical research practices in this area. Twenty–one respondents (33 percent) answered this question. The responses were analysed and classified into the following areas:

The responses regarding data collection focused on the challenges of extracting profile data, and the ‘public versus private’ debate. Users may set their profile settings to private but their network of friends may leak a user’s personal information in their profiles, which are public. Some users put personal data in their profiles and do not fully understand privacy settings.

Maintaining user privacy can be challenging. Respondents highlighted issues such as maintaining privacy when building datasets and publishing academic papers. Decisions have to be taken by researchers as to what types of data to include in datasets. Some researchers do not make datasets available, because of privacy issues. Other researchers will not make the dataset public without anonymisation of data. As the Lewis, et al. (2008) study illustrated, distributed anonymised data does not guarantee the privacy of your data subjects, especially if the data can be re–identified through aggregate data sources. Only one respondent commented on keeping the data safe, i.e., confidentiality of the data, as being a challenge.

Other ethical challenges stated by respondents covered a range of issues such as:

4.5. The future of ethics and social media platforms

The final question asked respondents for their views regarding issues for the future with regard to ethics. Thirteen respondents answered this question. The areas that the responses covered included:

Overall, the questionnaire responses have highlighted positive and negative aspects of ethical research practices. The next section will analyse the results in more detail.



Discussion and conclusions

The questionnaire results produced a variety of opinions regarding research ethical practices. The profile of respondents illustrated that it is not only social scientists who consider ethical research practices in automated data extraction.

The questionnaire responses have shown that informed consent is a grey area. This is because of a lack of an exact definition of online informed consent, and also agreement regarding to what sort of data it applies. This is validated by Salmons and Woodfield’s (2013) research study, which found that 72 percent of the 461 members surveyed stated that there was a gap in clarification when it came to deciding which profile data is acceptable to use without informed consent. Orton–Johnson (2010) also argues that there is little agreement on the meaning of privacy in different online settings.

The use of the third–party applications by researchers to extract private profile data has worked towards addressing the presentation of online consent. However, issues surround third–party applications still exist, such as the privacy of a user’s network of friends and the lack of control over privacy by social media platforms. Researchers face challenges in wanting to extract enough data to produce valid results without violating the ‘Terms of Service’ of the social media platform. Despite the use of third–party applications by researchers to extract profile data, Elovici, et al. (2013) highlighted that some researchers will go to extreme lengths to extract data by using fake identities to collect users’ private data from SNS. Fake identities can be considered a security risk and a violation of the SNS’ ‘Terms of Service’.

Violating the ‘Terms of Service’ can lead to social media platforms taking action. In 2010, Pete Warden announced that he was going to release publicly available Facebook data of 210 million users, which he had extracted via a crawler, to researchers. The data was going to be released once it had been anonymised. More than 50 researchers requested the dataset but Facebook threatened Warden with legal action and requested that he destroy the data. The data was subsequently destroyed by Warden. Facebook took this course of action because Warden had violated the site’s ‘Terms of Service’ by not requesting permission from Facebook before extracting the data (Giles, et al., 2010). Clarity is needed in terms of the legal implications for researchers if the ‘Terms of Service’ are breached. Prior to 2012, a dichotomy existed in terms of privacy for social media platforms. Some platforms (e.g., MySpace) did not allow the profile data of users to be downloaded, scraped or extracted via automation but did allow public profiles containing personal identifiable data to be publicly available to external users who may not be members of MySpace.

From a profile user’s perspective, more research needs to be carried out regarding what social media users want and need in terms of ethics, as highlighted by Salmons and Woodfield (2013). As illustrated by several respondents in this study, education is needed for users in terms of who can access their profile data, the profile data that they choose to publish, the implications for the user, and how researchers extract profile data. Social media platforms give users the opportunity to decide which personal information is publicly available and with whom they share it. An example is Facebook, which gives users the option to set with whom they share their personal details. This places responsibility on the user’s network of friends to keep their personal data private.

Some users want digital rights, such as the right to privacy and the right to be forgotten. Users have to take some responsibility for achieving this, if this is what they require. Social media platforms run on user contributions and default privacy settings are normally set to public. The Burkell, et al. (2014) exploration of whether Facebook is a private or public space highlighted that user profiles are ‘structured with the view that everyone can see them, even if the explicitly intended audience is more limited’. Social media platforms are not naturally private places and users have to evaluate the level of privacy they want. Several questionnaire respondents raised the issue of intellectual property and rights of online content. They believe that it is the user who has the right to decide how they use the service, provided there are no violations of laws. This places responsibility with the user.

The questionnaire responses showed that other areas of research ethics, such as privacy, confidentiality and anonymity have been incorporated into studies by over half the researchers. As highlighted by the Lewis, et al. (2008) study, anonymisation does not guarantee the privacy of data subjects. This has a knock–on effect in terms of the production and distribution of datasets for researchers to use. Researchers are more reluctant to release their datasets due to privacy issues.

Ethical codes and guidelines exist for researchers to protect research subjects, such as profile users. The questionnaire results demonstrated that factors, such as publicly available profile user data and the definition of human research, contributed towards the research study not going through an ethical approval process. This is an area which would benefit from further development and understanding. Ethical codes, guidelines and processes need to be adapted because, as social media grows, technology and user expectations change.

Different ethical guidance can interpret areas of research ethics in various ways. This results in a difference of opinions regarding ethical considerations. An example is with the definition of public versus private online spaces. The Association of Internet Researchers ethics guidelines (Markham and Buchanan, 2012) do not have a strict distinction between public and private online spaces, but refer to the changing cultural and individual definitions of privacy. On the one hand, people may produce data in public spaces but have strong perceptions of privacy. On the other hand, people acknowledge that communication is public but the context it appears in implies restrictions.

By comparison, ESOMAR (2011) defines public and private spaces more clearly. Public social media is where entry to the social media site is without barriers, i.e., publicly available profile data. Private social media are sites where users or the Web sites do not want data to be accessible publicly. For researchers, the definition of public versus private data and the Terms of Service for platforms, as presented in Table 2, are factors to consider for the selection of extraction techniques.

Future issues for research into ethical practices in this area bring a mixture of new technologies focusing on users and the extraction techniques that researchers use. The questionnaire responses indicated a wide variety of ethical challenges for the future, ranging from intellectual property rights through to the use of the mobile Web. The rise in handheld devices and the use of the mobile Web being used by researchers for data collection can bring its own ethical issues, such as accidental data discovery. Jones (2011) highlighted an example, where the mobile phone provider gives informed consent for images to be collected but the people in the images have not given their informed consent.

Overall, the initial findings of this paper have presented various experiences and thoughts with regard to ethical research practices involving extracting profile data. Researchers are becoming more aware of ethical considerations and incorporating them into research experiments, as illustrated by the questionnaire results. Compared to previous studies, ethical research practices are moving forwards. However, there remain grey areas, such as informed consent as well as the definition of public and private data, where more clarity and research is required. Ethical approval processes need to develop with the changing technology and requirements of the users. More research should be carried out to investigate users and their ethical needs, as well as where social media platforms stand with regard to ethics when it comes to research. This study has formed a foundation for further research to take place in such an important area. End of article


About the author

Sophia Alim is an administrator at Barnardos, a charity helping vulnerable young people, based in the United Kingdom. She is also an independent researcher in the field of social research ethics involving automated data extraction from social media platforms. Sophia Alim has a Ph.D. from the University of Bradford in the area of online social networks and privacy. Her research focused on calculating the vulnerability of online social network profiles in terms of information–disclosing behaviour of a profile owner and the owner’s friends. In 2006, she was awarded a B.Sc. (Hons) in business information systems from the University of Salford in the United Kingdom. In 2007, she received her M.Sc. in computing from the University of Bradford. Her research interests include Web accessibility and social networking.
E–mail: sophiaalim66 [at] gmail [dot] com



The author would like to thank the questionnaire respondents for their responses and also several research students for their feedback regarding the initial design of the questionnaire.



1. U.S. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, “Ethical principles and guidelines for the protection of human subjects of research” (18 April 1979), at, accessed 19 May 2014.

2. “University of Texas at Austin Institutional Review Board Policies and procedures manual” (15 May 2014), at, accessed 22 May 2014.



Sophia Alim, Ruqayya Abdul–Rahman, Daniel Neagu and Mick Ridley, 2009. “Data retrieval from online social networking profiles for social engineering applications,” Proceedings of the Fourth International Conference for Internet Technology and Secured Transactions, pp. 207–211.

Justin Becker and Hao Chen, 2009. “Measuring privacy risk in online social networks,” Proceedings of W2SP 2009: Web 2.0 Security and Privacy, at, accessed 29 May 2014.

Joseph Bonneau, Jonathan Anderson, and George Danezis, 2009. “Prying data out of a social network,” ASONAM ’09: Proceedings of the 2009 International Conference on Advances in Social Network Analysis and Mining, pp. 249–254.
doi:, accessed 25 June 2014.

danah boyd and Kate Crawford, 2011. “Six provocations for big data,” at, accessed 1 March 2014.

Brown University, “What is IRB, and when is IRB needed?” at, accessed 1 April 2014.

Jacquelyn Burkell, Alexandre Fortier, Lorraine Yeung Cheryl Wong and Jennifer Lynn Simpson, 2014. “Facebook: Public space, or private space?” Information, Communication & Society, volume 17, number 8, pp. 974–985.
doi:, accessed 25 June 2014.

James Caverlee and Steve Webb, 2008. “A large–scale study of MySpace: Observations and implications for online social networks,” ICWSM ’12: Proceedings of the Second AAAI International Conference on Weblogs and Social Media, pp. 36–44.

Yuan Cheng, Jaehong Park and Ravi Sandhu, 2013. “Preserving user privacy from third–party applications in online social networks,” WWW ’13: Companion Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 723–728.

CITI Program, Collaborative Institutional Training Initiative (CITI) at the University of Miami, 2014. “Human subjects research (HSR) series,” at, accessed 20 May 2014.

Derek Doran, Sean Curley and Swapna S. Gokhale, 2012. “How social network APIs have ended the age of privacy,” SEKE ’12: Proceedings of the Twenty–Fourth IEEE International Conference on Software Engineering and Knowledge Engineering, pp. 400–405.

Yuval Elovici, Michael Fire, Amir Herzberg and Haya Shulman, 2013. “Ethical considerations when employing fake identities in OSNs for research,” at, accessed 6 April 2014.

ESOMAR, 2011. “ESOMAR guideline for online research,” at, accessed 4 April 2014.

Charles Ess and Association of Internet Researchers, 2002. “Ethical decision–making and Internet research: Recommendations from the AOIR Ethics Working Committee,” at, accessed 6 April 2014.

Rebecca Eynon, Ralph Schroeder and Jenny Fry, 2009. “New techniques in online research: Challenges for research ethics,” 21st Century Society, volume 4, number 2, pp. 187–199.
doi:, accessed 25 June 2014.

Facebook, 2013. “Statement of rights and responsibilities” (15 November), at, accessed 16 January 2014.

Adrienne Felt and David Evans, 2008. “Privacy protection for social networking APIs,” W2SP ’08: Proceedings of the Workshop on Web 2.0 Security and Privacy, pp. 1–8; version at, accessed 25 June 2014.

Michael Fire, Dima Kagan, Aviad Elishar and Yuval Elovici, 2012. “Social privacy protector — Protecting users’ privacy in social networks,” SOTICS 2012: Second International Conference on Social Eco–Informatics, pp. 46–50.

C.Lee Giles, Yang Sun and Isaac G. Councill, 2010. “Measuring the Web crawler ethics,” WWW ’10: Proceedings of the 19th International Conference on World Wide Web, pp. 1,101–1,102.
doi:, accessed 25 June 2014.

Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, 2010. “Walking in Facebook: A case study of unbiased sampling of OSNs,” INFOCOM ’10: Proceedings of the 29th Conference on Information Communications, pp. 2,498–2,506.

Google, 2014. “Technologies and principles,” at, accessed 2 April 2014.

Justin M. Grimes, Kenneth R. Fleischman and Paul T. Jaeger, 2009. “Virtual guinea pigs: Ethical implications of human subjects research in virtual worlds,” International Journal of Internet Research Ethics, volume 2, number 1, pp. 38–56, and at, accessed 2 April 2014.

Pritam Gundecha, Geoffrey Barbier and Huan Liu, 2011. “Exploiting vulnerability to secure user privacy on a social networking site,” KDD ’11: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 511–519.
doi:, accessed 25 June 2014.

Joel Hartter, Sadie J. Ryan, Catrina A. MacKenzie, John N. Parker and Carly A. Strasser, 2013. “Spatially explicit data: Stewardship and ethical challenges in science,” PLoS Biology, volume 11, number 9, at, accessed 4 April 2014.
doi:, accessed 25 June 2014.

Tristan Henderson, Luke Hutton and Sam McNeilly, 2012. “Ethics and online social network research — Developing best practices,” Proceedings of HCI 2012: The 26th BCS Conference on Human Computer Interaction, at, accessed 25 June 2014.

Chris Jones, 2011. “Ethical issues in online research,” at, accessed 13 April 2013.

Jan H. Kietzmann, Kristopher Hermkens, Ian P. McCarthy and Bruno S Silvestre, 2011. “Social media? Get serious! Understanding the functional building blocks of social media,” Business Horizons, volume 54, number 3, pp. 241–251.
doi:, accessed 25 June 2014.

Mohammed Korayem and David J. Crandall, 2013. “De–anonymizing users across heterogeneous social computing platforms,” ICWSM 2013: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, pp. 689–692; version at, accessed 25 June 2014.

Kevin Lewis, Jason Kaufman, Marco Gonzalez, Andreas Wimmer and Nicholas Christakis, 2008. “Tastes, ties, and time: A new social network dataset using,” Social Networks, volume 30, number 4, pp. 330–342.
doi:, accessed 25 June 2014.

Annette Markham and Elizabeth Buchanan, 2012. “Ethical decision–making and Internet research: Recommendations from the AoIR Ethics Working Committee (version 2.0),“ at, accessed 1 March 2014.

Caroline McCarthy, 2010. “Facebook granted geolocation patent,” CNET (6 October), at, accessed 5 March 2014.

Fernando Menczer, 2008. “Legal and ethical considerations in crawling/mining online social network data,” at, accessed 1 March 2014.

Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel and Bobby Bhattacharjee, 2007. “Measurement and analysis of online social networks,” IMC ’07: Proceedings of the Seventh ACM SIGCOMM Conference on Internet Measurement, pp. 29–42, and at, accessed 25 June 2014.
doi:, accessed 25 June 2014.

MySpace, 2013. “MySpace privacy policy,” at, accessed 3 April 2014.

Arvind Narayanan and Vitaly Shmatikov, 2010. “Myths and fallacies of ‘Personally Identifiable Information’,” Communications of the ACM, volume 53, number 6, pp. 24–26.
doi:, accessed 25 June 2014.

Andre Oboler, Kristopher Welsh and Lito Cruz, 2012. “The danger of big data: Social media as computational social science,” First Monday, volume 17, number 7, at, accessed 1 March 2014.
doi:, accessed 25 June 2014.

Kate Orton–Johnson, 2010. “Ethics in online research; Evaluating the ESRC framework for research ethics categorisation of risk,” Sociological Research Online, volume 15, number 4, at, accessed 25 June 2014.
doi:, accessed 25 June 2014.

Janet Salmons and Kandy Woodfield, 2013. “Social media, social science & research ethics,” at, accessed 1 March 2014.

Stanford University, 2014. “The Human Research Protection Program (HRPP),” at, accessed 23 May 2014.

Mike Thelwall, 2009. “Homophily in MySpace,” Journal of the American Society for Information Science and Technology, volume 60, number 2, pp. 219–231.
doi:, accessed 25 June 2014.

Mike Thelwall and David Stuart, 2006. “Web crawling ethics revisited: Cost, privacy, and denial of service,” Journal of the American Society for Information Science and Technology, volume 57, number 13, pp. 1771–1779.
doi:, accessed 25 June 2014.

Twitter, 2012. “Terms of service,” at, accessed 2 April 2014.

U.S. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, 1979. “Ethical principles and guidelines for the protection of human subjects of research” (18 April), at, accessed 19 May 2014.

University of Texas, 2014. “University of Texas at Austin Institutional Review Board Policies and procedures manual” (15 May), at, accessed 22 May 2014.

Bimal Viswanath, Alan Mislove, Meeyoung Cha and Krishna P. Gummadi, 2009. “On the evolution of user interaction in Facebook,” WOSN ’09: Proceedings of the Second ACM Workshop on Online Social Networks, pp. 37–42.
doi:, accessed 25 June 2014.

Pete Warden, 2011. “Why you can’t really anonymize your data” (17 May), at, accessed 4 April 2014.

David Wilkinson and Mike Thelwall, 2011. “Researching personal information on the public Web: Methods and ethics,” Social Science Computer Review, volume 29, number 4, pp. 387–401.
doi:, accessed 25 June 2014.

Bin Zhou and Jian Pei, 2008. “Preserving privacy in social networks against neighborhood attacks,” ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 506–515.
doi:, accessed 25 June 2014.

William G. Zikmund, Barry J. Babin and Mitch Griffin, 2012. Business research methods. Ninth edition. Mason, Oh.: Erin Joyner.

Michael Zimmer, 2010. “‘But the data is already public’ On the ethics of research in Facebook,” Ethics and Information Technology, volume 12, number 4, pp. 313–325.
doi:, accessed 25 June 2014.


Editorial history

Received 12 May 2014; revised 14 June 2014; accepted 15 June 2014.

Creative Commons License
This paper is licensed under a Creative Commons Attribution–NonCommercial–NoDerivatives 4.0 International License.

An initial exploration of ethical research practices regarding automated data extraction from online social media user profiles
by Sophia Alim.
First Monday, Volume 19, Number 7 - 7 July 2014