Turker consensus in Phase III.
View this table
With minimal training, an anonymous, untrained workforce recruited through a public crowdsourcing platform can rapidly and correctly categorize fundus photos of diabetic patients as normal or abnormal. The Turkers in this study graded these images with a high sensitivity, which is ideal for a screening instrument. Critically, no false negative consensus results were generated in any of our batches, indicating that no cases of DR were missed. When asked to categorize photos by degree of retinopathy, Turkers improved with changes in the Turk interface, particularly with increasing prior approval rating needed to perform the HITs. The number of graders required to reach a reliable “consensus grade” was consistent across categories, and 10 grading instances per image was established as sufficient for valid results.
Images were interpreted for a total cost of US $1.10 per eye. While abnormal images would still currently still require further evaluation to identify patients in need of a live examination, this cost is near the limit suggested by some investigators for feasible public health screening in developing economies . Indeed, the reimbursement per image, which ranged from an effective hourly wage of US $6.60 to $14.31 in our study, could possibly be reduced, since a reasonable target hourly wage for AMT workers is the federal minimum wage of US $7.25 per hour or even lower [ ]. Additionally, posting larger batches of images might allow for lower per image reimbursement, since Turker speed would likely increase after becoming more skilled at the task, allowing them to maintain the same effective wage. While there may not be a direct relationship between quality responses and high wages [ ], there may be a link between high wages and rapid completion of image grading, so it may not be wise to dramatically reduce reimbursement.
A post hoc analysis of individual Turkers’ scores revealed inconsistent use of the four grading categories by one Turker in the first batch of Phase 3 (). Several issues are brought to light by considering this specific batch. First, inconsistent use of all categories was a rare occurrence, demonstrating that Turkers are conscientious workers. This was also evident from comments made by Turkers as they completed HITs, which included thoughtful suggestions for improvement to the interface and concern over HITs that were felt to be ambiguous. Second, using the mean of crowdsourced responses may generate outputs that are rather sensitive to outliers. For this reason, using the mode to calculate consensus is generally preferable, though some images may not have a pure mode ( ), in which case the higher score of any “tie” would be used clinically. Third, and more broadly, AMT may be susceptible to Turker accounts that attempt to take advantage of the system by rapidly completing HITs with random responses either with live individuals or with automated programs or “bots” [ ]. Moving forward, it may be necessary to analyze raw Turker scores for such phenomena and perhaps develop methods to exclude systematically unreliable scores.
Since AMT is a reputation-based economy, Requestors can reject or block Turkers who are not performing appropriately. Both actions negatively impact the Turkers’ reputation, which in turn affects their ability to perform HITs in the future, so there is a strong incentive to perform tasks accurately and honestly. This is likely why increasing the prior HIT approval rating to 99% had the most dramatic impact on consensus accuracy. Adding the “Photo Moderation Master” qualification did not improve worker consensus. This may be due to the fact that the criteria Amazon uses to grant this qualification are not relevant to our task. Additionally, since only a fraction of Turkers have the qualification, requiring it reduces the available workforce, which can increase the time required to complete batches. Especially when factoring in the additional Amazon commission, use of the Master qualification may not be necessary or cost-effective for these types of tasks in the future.
The current study was limited to a small set of hand-picked, high-quality, mydriatic fundus photos chosen to illustrate the key findings in diabetic retinopathy. Screening populations might have a subset of low-quality or uninterpretable images and would also be expected to have far more images of normal fundi. Identifying pathology in such sets would require extra vigilance on the part of Turkers to detect mild disease within large groups of normal photos. Larger datasets with more subtle pathology need to be tested with this methodology. Additionally, analyzing whether iterative improvements to the interface lead to better results is confounded by the fact that Turkers may have previously been exposed to the task and may be improving in their grading through practice. This is unlikely because the Turkers receive no feedback on their grading, so they do not know if they have correctly graded images or not. Moreover, while it is not currently feasible to “block” Turkers who have previously completed any of our HITs, it is possible to view their grading history within the project. Surprisingly, throughout all batches posted, most of our HITs were completed by Turkers otherwise naïve to our project. In the final batch of 190 HITs posted for this project, after approximately 3000 HITs using the same images had been posted, 170 (89.5%) were completed by Turkers who had never done any of our HITs before, and 20 HITs were done by 3 individual Turkers who had each graded only four images previously (data not shown). In future larger batches, adjusting for individual graders’ improvement over time could become necessary.
While further refinement of the methodology is still needed to resolve the degree of retinopathy, the current model could possibly be used as a way to reduce the burden on skilled graders by eliminating most of the normal images in very large screenings and passing only those abnormal images on for further characterization. While the individuals who complete HITs on AMT are truly anonymous, they do have unique numerical identifiers and can be tracked across HITs and batches of HITs. Therefore, an intriguing possibility using a crowdsourcing interface could include developing a cadre of specific Turkers who demonstrate a track record of reliable grading. These graders might be capable of a higher level of categorization than the general pool of Turkers and could be recruited for more complex grading tasks. Additionally, it is likely that automated computer algorithms will also play a role in the analysis of fundus images for DR and other conditions in the future. This raises the possibility of an even more robust interaction between artificial intelligence and human intelligence. Images could be graded in a tiered fashion by one system, and then those graded ambiguously could be routed to the other for further validation.
An unanticipated benefit of such a crowdsourcing program is that it might raise awareness of diabetes and diabetic retinopathy. Since our interface allowed Turkers to leave feedback for us to refine the instrument, we were able to capture comments such as, “I have learn about diabetes little bit [sic]”, “I really liked seeing the pics of the eye, very interesting”, and “This HIT was very good and a nice break from all of the bubbling surveys. Thank you!”, suggesting an interest in the subject matter beyond other HITs and beyond what we had expected at the outset. This finding is consistent with what has been termed “Virtual Citizen Science” in fields outside of biomedical research .
Crowdsourcing represents a novel and inexpensive means to rapidly identify diabetic retinopathy. Further refinements of the technique are required, as is external validation with larger image sets. Additionally, multiple medico-legal and ethical issues would need to be addressed prior to clinical use of this technology, but there may be a role for crowdsourcing medical imaging data in large public health screenings and other settings in the future.
This research was supported by the J. Arch McNamara, MD Memorial Fund for Retina Research and Education at Wills Eye Hospital. The sponsor had no role in the design, conduct, or decision to publish the research, nor did they have any role in the review or approval of the manuscript.
This manuscript was presented at the American Society of Retinal Specialists Annual Meeting, August 2014.
Conflicts of Interest
- Beck RW. The burgeoning public health impact of diabetes: the role of the ophthalmologist. Arch Ophthalmol 2011 Feb;129(2):225-229. [CrossRef] [Medline]
- Javitt JC, Canner JK, Frank RG, Steinwachs DM, Sommer A. Detecting and treating retinopathy in patients with type I diabetes mellitus. A health policy model. Ophthalmology 1990 Apr;97(4):483-94; discussion 494. [Medline]
- Javitt JC, Aiello LP. Cost-effectiveness of detecting and treating diabetic retinopathy. Ann Intern Med 1996 Jan 1;124(1 Pt 2):164-169. [Medline]
- Jones S, Edwards RT. Diabetic retinopathy screening: a systematic review of the economic evidence. Diabet Med 2010 Mar;27(3):249-256. [CrossRef] [Medline]
- Schoenfeld ER, Greene JM, Wu SY, Leske MC. Patterns of adherence to diabetes vision care guidelines: baseline findings from the Diabetic Retinopathy Awareness Program. Ophthalmology 2001 Mar;108(3):563-571. [Medline]
- Paz SH, Varma R, Klein R, Wu J, Azen SP, Los Angeles Latino Eye Study Group. Noncompliance with vision care guidelines in Latinos with type 2 diabetes mellitus: the Los Angeles Latino Eye Study. Ophthalmology 2006 Aug;113(8):1372-1377. [CrossRef] [Medline]
- Lee PP, Feldman ZW, Ostermann J, Brown DS, Sloan FA. Longitudinal rates of annual eye examinations of persons with diabetes and chronic eye diseases. Ophthalmology 2003 Oct;110(10):1952-1959. [CrossRef] [Medline]
- Fong DS, Sharza M, Chen W, Paschal JF, Ariyasu RG, Lee PP. Vision loss among diabetics in a group model Health Maintenance Organization (HMO). Am J Ophthalmol 2002 Feb;133(2):236-241. [Medline]
- Sharp PF, Olson J, Strachan F, Hipwell J, Ludbrook A, O’Donnell M, et al. The value of digital imaging in diabetic retinopathy. Health Technol Assess 2003;7(30):1-119 [FREE Full text] [Medline]
- Scanlon PH. The English national screening programme for sight-threatening diabetic retinopathy. J Med Screen 2008;15(1):1-4. [CrossRef] [Medline]
- Ng M, Nathoo N, Rudnisky CJ, Tennant MT. Improving access to eye care: teleophthalmology in Alberta, Canada. J Diabetes Sci Technol 2009 Mar;3(2):289-296 [FREE Full text] [Medline]
- Teng T, Lefley M, Claremont D. Progress towards automated diabetic ocular screening: a review of image analysis and intelligent systems for diabetic retinopathy. Med Biol Eng Comput 2002 Jan;40(1):2-13. [Medline]
- Abràmoff Md, Folk JC, Han DP, Walker JD, Williams DF, Russell SR, et al. Automated analysis of retinal images for detection of referable diabetic retinopathy. JAMA Ophthalmol 2013 Mar;131(3):351-357. [CrossRef] [Medline]
- Trucco E, Ruggeri A, Karnowski T, Giancardo L, Chaum E, Hubschman JP, et al. Validating retinal fundus image analysis algorithms: issues and a proposal. Invest Ophthalmol Vis Sci 2013 May;54(5):3546-3559 [FREE Full text] [CrossRef] [Medline]
- Brabham DC. Crowdsourcing. The MIT Press Essential Knowledge Series. Cambridge, MA: MIT Press; 2013.
- Brabham DC, Ribisl KM, Kirchner TR, Bernhardt JM. Crowdsourcing applications for public health. Am J Prev Med 2014 Feb;46(2):179-187. [CrossRef] [Medline]
- Luengo-Oroz MA, Arranz A, Frean J. Crowdsourcing malaria parasite quantification: an online game for analyzing images of infected thick blood smears. J Med Internet Res 2012;14(6):e167 [FREE Full text] [CrossRef] [Medline]
- Mitry D, Peto T, Hayat S, Morgan JE, Khaw KT, Foster PJ. Crowdsourcing as a novel technique for retinal fundus photography classification: analysis of images in the EPIC Norfolk cohort on behalf of the UK Biobank Eye and Vision Consortium. PLoS One 2013;8(8):e71154 [FREE Full text] [CrossRef] [Medline]
- Li HK, Horton M, Bursell SE, Cavallerano J, Zimmer-Galler I, Tennant M, American Telemedicine Association Diabetic Retinopathy Telehealth Practice Recommendations Working Group, et al. Telehealth practice recommendations for diabetic retinopathy, second edition. Telemed J E Health 2011 Dec;17(10):814-837. [CrossRef] [Medline]
- Amazon Mechanical Turk.
URL: https://www.mturk.com/mturk/welcome [accessed 2014-08-25]
- Ilakkuvan V, Tacelosky M, Ivey KC, Pearson JL, Cantrell J, Vallone DM, et al. Cameras for Public Health Surveillance: A Methods Protocol for Crowdsourced Annotation of Point-of-Sale Photographs. JMIR Res Protoc 2014;3(2):e22 [FREE Full text] [CrossRef] [Medline]
- Turner AM, Kirchhoff K, Capurro D. Using crowdsourcing technology for testing multilingual public health promotion materials. J Med Internet Res 2012;14(3):e79 [FREE Full text] [CrossRef] [Medline]
- Henshaw EJ. Too sick, not sick enough? Effects of treatment type and timing on depression stigma. J Nerv Ment Dis 2014 Apr;202(4):292-299. [CrossRef] [Medline]
- Halkjelsvik T. Do disgusting and fearful anti-smoking advertisements increase or decrease support for tobacco control policies? Int J Drug Policy 2014 Jul;25(4):744-747. [CrossRef] [Medline]
- Carter RR, DiFeo A, Bogie K, Zhang GQ, Sun J. Crowdsourcing awareness: exploration of the ovarian cancer knowledge gap through Amazon Mechanical Turk. PLoS One 2014;9(1):e85508 [FREE Full text] [CrossRef] [Medline]
- Swan M. Crowdsourced health research studies: an important emerging complement to clinical trials in the public health research ecosystem. J Med Internet Res 2012;14(2):e46 [FREE Full text] [CrossRef] [Medline]
- Good BM, Loguercio S, Griffith OL, Nanis M, Wu C, Su AI. The Cure: Design and Evaluation of a Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction. JMIR Serious Games 2014 Jul 29;2(2):e7. [CrossRef]
- Dasgupta N, Freifeld C, Brownstein JS, Menone CM, Surratt HL, Poppish L, et al. Crowdsourcing black market prices for prescription opioids. J Med Internet Res 2013;15(8):e178 [FREE Full text] [CrossRef] [Medline]
- Martin D, Hanrahan BV, O’Neill J. Being a turker. In: Proceedings of the 17th ACM conference on Computer supported cooperative work social computing. 2014 Presented at: 17th ACM conference on Computer supported cooperative work social computing; February 15-19, 2014; Baltimore, Maryland, USA. [CrossRef]
- Ipeirotis PG. CeDER Working Papers-10-01. 2010. Demographics of mechanical turk
URL: http://hdl.handle.net/2451/29585 [accessed 2014-09-26]
- Shotliff K, Duncan G. Diabetic retinopathy: summary of grading and management criteria. Pract Diab Int 2006 Nov;23(9):418-420. [CrossRef]
- Quenouille MH. Notes on bias in estimation. Biometrika 1956;43(3-4):353-360. [CrossRef]
- Amazon Mechanical Turk Frequently Asked Questions.
URL: https://requester.mturk.com/help/faq [accessed 2014-08-25]
- Friedman DS, Ali F, Kourgialis N. Diabetic retinopathy in the developing world: how to approach identifying and treating underserved populations. Am J Ophthalmol 2011 Feb;151(2):192-4.e1. [CrossRef] [Medline]
- Horton JJ, Chilton LB. The labor economics of paid crowdsourcing. In: Proceedings of the 11th ACM conference on Electronic commerce. 2010 Presented at: 11th ACM conference on Electronic commerce; June 7-11, 2010; Cambridge, Massachusetts, USA. [CrossRef]
- Mason W, Watts DJ. Financial incentives and the “performance of crowds”. In: Proceedings of the ACM SIGKDD Workshop on Human Computation. 2009 Presented at: 2009 ACM SIGKDD Workshop on Human Computation; June 28-July 1, 2009; Paris, France. [CrossRef]
- Reed J, Raddick MJ, Lardner A. An exploratory factor analysis of motivations for participating in Zooniverse, a collection of virtual citizen science projects. In: 46th Hawaii International Conference on System Sciences (HICSS). 2013 Presented at: 46th Hawaii International Conference on System Sciences (HICSS); January 7-10, 2013; Wailea, Maui, HI. [CrossRef]
Edited by G Eysenbach; submitted 25.08.14; peer-reviewed by D Brabham, G Leontidis; comments to author 10.09.14; revised version received 15.09.14; accepted 16.09.14; published 30.10.14
©Christopher J Brady, Andrea C Villanti, Jennifer L Pearson, Thomas R Kirchner, Omesh P Gupta, Chirag P Shah. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 30.10.2014.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.