Like many other areas of the social sciences, experimental philosophy is nowadays heavily fueled by Amazon Mechanical Turk. (The Experimental Philosophy Blog certainly helped with that too!) So I thought it might be helpful to consider the state of Amazon Mechanical Turk, for both experimental philosophers and critics of experimental philosophy. (By the way, the Experimental Turk blog / website remains an invaluable resource!)
In my view, the best academic paper that gives a thorough overview of Amazon Mechanical Turk’s strengths and weaknesses is Gabriele Paolacci and Jesse Chandler’s “Inside the Turk: Understanding Mechanical Turk as a Participant Pool” (2014). My main takeaways from the paper are
- In general, MTurk data quality is at least as good as university lab data quality.
- However, there is serious concern about MTurk participants’ non-naivety. So the data quality of common experimental paradigms (such as cognitive reflection test, ultimatum game, and in my view, trolley problems) is relatively poor.
- MTurk data participants are much more demographically representative than university lab participants.
- There is not much point to doing attention checks, given implausible theoretical assumptions (such as constancy of attention throughout all study tasks) and given participant non-naivety.
There is also a somewhat recent PBS profile of Amazon Mechanical Turk workers that makes similar points, but in much more accessible terms.
One tidbit that’s new to me in the PBS profile is this:
Early results by the team suggests another potentially interesting finding. Turkers seem more likely to provide false negatives – failing to observe a phenomenon that exists — than false positives — falsely observing something that doesn’t exist. (An example of a false positive would be a study that shows a relationship between vaccines and autism that doesn’t really exist. A test that fails to show the effectiveness of a successful drug would be a false negative.)
In other words, if anything, we should be a little more skeptical of null results from an MTurk sample (e.g. a negative replication) than a positive result from an MTurk sample. Though it’s still helpful to remember that, with in mind the non-naivety caveat, the overall data quality with MTurk sample is pretty good.
On the point about non-naivety, in addition to trying to not use well-known (to Turkers) experimental paradigms, I exclude repeat workers in a series of studies on the same topic using Unique Turker. And sometimes I monitor reddit and other discussion boards to make sure there are no inappropriate discussions. Though, again, it’s useful to stress here that, as the PBS profile mentions, most Turkers take great pride in their work and the community self-polices as well: “No disclosure or discussion of attention memory checks. No discussion of survey content, period. That can affect the results.”.
Any other thoughts, experiences, and tools you’ve gathered from using Amazon Mechanical Turk in your research?
[x-posted at The Experimental Philosophy Blog; please comment there!]