Using Mechanical Turk for experiments

In my upcoming CHI paper, “Presenting Diverse Political Opinions: How and How Much,” we used Amazon’s Mechanical Turk (AMT) to recruit subjects and to administer the study. I’ll talk a bit more about the research questions and results in a future post, but I’ve had enough questions about using Mechanical Turk that I think a blog post may be helpful.

In this study, Paul Resnick and I explored individuals’ preferences for diversity of political opinion in online news aggregators and evaluated whether some very basic presentation techniques might affect satisfaction with the range of opinions represented in a collection of articles.

To address these questions, we needed subjects with known political preferences, from the United States, and with at least some very basic political knowledge, and we wanted to collect some demographic information about each subject. Each approved subject was then assigned to either a manipulation check group or to the experimental group. Subjects in the manipulation check group viewed individual articles and indicated their agreement or disagreement with each; subjects in the experimental group viewed entire collections and answered questions about the collection. The subjects in the experimental group were also assigned to a particular treatment (how the list would appear to them). Once approved, subjects could view a list up to once per day.

Screening. To screen subjects, we used a Qualification test in AMT. When unqualified subjects viewed at task (HIT – human intelligence task, in mTurk parlance), they were informed that they needed to complete a qualification. The qualification test asked subjects two questions about their political preferences, three multiple choice questions about US politics, and a number of demographic questions. Responses were automatically downloaded and evaluated to complete screening and assignment.

To limit our subjects to US residents, we also used the automatic locale qualification.

Assignment. We handled subject assignment in two ways. To distinguish between the treatment group and the manipulation check group, we created to additional qualifications that were automatically assigned; an approved subject would be granted only one of these qualifications, and could thus could only complete the associated task type.

Tasks (HITs). The task implementation was straightforward. We hosted tasks on our own server using the external question HIT type. When a subject loaded a task, AMT passed us the subject’s worker ID. We verified that the subject was qualified for the task and loaded the appropriate presentation for that subject. Each day, we uploaded one task of each type, with many assignments; assignments are the number of turkers that can complete each task.

Because we needed real-time access to the manipulation check data, the responses to this task were stored in our own database after a subject submitted the form; the subject could then return to AMT. This was not necessary for the experimental data, and so the responses were sent directly to AMT for later retrieval.

Quality control. Careless clicking or hurrying through the task is a potential problem on mTurk. Using multiple raters does not work when asking subjects about their opinions. Kittur, Chi, and Suh recommend asking Turkers verifiable questions as a way to deal with the problem1. We did not, however, ask verifiable questions about any of the articles or the list, because that might have changed how turkers read the list and responded to our other questions. Instead, we randomly repeated a demographic question from the qualification test. 5 subjects changed their answer substantially (e.g. aging more than one year or in reverse or shifting on either of the political spectrum questions by 2 points or more). Though there are many possible explanations for these shifts – such as shared accounts within a household, careless clicking, easily shifting political opinions, deliberate deception, or lack of effort – all of these explanations are not desirable for study subjects, and so they were excluded. We also examined how long each subject took to complete tasks (looking for implausibly fast responses); this did not lead to the exclusion of any additional subjects or responses.

Some reflection. We had to pay turkers a bit more than we expected (~$12/hr) and we recruited fewer subjects than we anticipated. The unpaid qualification task may be a bit of a barrier, especially because potential subjects could only complete one of our paid tasks per day (and only one was listed at a time). Instead, we might have implemented the qualification as a paid task, but that might result in paying for subjects who would never return to complete an actual task.

Further resources

1. Kittur, A., Chi, E. H., and Suh, B. (2009). “Crowdsourcing User Studies With Mechanical Turk,” Proc. CHI 2009: 453-456. (ACM | PARC)
2. Mason, W. and Watts, D. J. (2009). “Financial incentives and the ‘performance of crowds,’” SIGKDD Workshop on Human Computation: 77-85. (ACM | Yahoo)

This study is part of the BALANCE project and was funded by NSF award #IIS-0916099.