Journal Home > Volume 3 , Issue 3
Purpose

Ensuring quality is one of the most significant challenges in microtask crowdsourcing tasks. Aggregation of the collected data from the crowd is one of the important steps to infer the correct answer, but the existing study seems to be limited to the single-step task. This study aims to look at multiple-step classification tasks and understand aggregation in such cases; hence, it is useful for assessing the classification quality.

Design/methodology/approach

The authors present a model to capture the information of the workflow, questions and answers for both single- and multiple-question classification tasks. They propose an adapted approach on top of the classic approach so that the model can handle tasks with several multiple-choice questions in general instead of a specific domain or any specific hierarchical classifications. They evaluate their approach with three representative tasks from existing citizen science projects in which they have the gold standard created by experts.

Findings

The results show that the approach can provide significant improvements to the overall classification accuracy. The authors’ analysis also demonstrates that all algorithms can achieve higher accuracy for the volunteer- versus paid-generated data sets for the same task. Furthermore, the authors observed interesting patterns in the relationship between the performance of different algorithms and workflow-specific factors including the number of steps and the number of available options in each step.

Originality/value

Due to the nature of crowdsourcing, aggregating the collected data is an important process to understand the quality of crowdsourcing results. Different inference algorithms have been studied for simple microtasks consisting of single questions with two or more answers. However, as classification tasks typically contain many questions, the proposed method can be applied to a wide range of tasks including both single- and multiple-question classification tasks.


menu
Abstract
Full text
Outline
About this article

Quality assessment in crowdsourced classification tasks

Show Author's information Qiong Bu( )Elena SimperlAdriane ChapmanEddy Maddalena
School of Electronics and Computer Science, University of Southampton, Southampton, UK

Abstract

Purpose

Ensuring quality is one of the most significant challenges in microtask crowdsourcing tasks. Aggregation of the collected data from the crowd is one of the important steps to infer the correct answer, but the existing study seems to be limited to the single-step task. This study aims to look at multiple-step classification tasks and understand aggregation in such cases; hence, it is useful for assessing the classification quality.

Design/methodology/approach

The authors present a model to capture the information of the workflow, questions and answers for both single- and multiple-question classification tasks. They propose an adapted approach on top of the classic approach so that the model can handle tasks with several multiple-choice questions in general instead of a specific domain or any specific hierarchical classifications. They evaluate their approach with three representative tasks from existing citizen science projects in which they have the gold standard created by experts.

Findings

The results show that the approach can provide significant improvements to the overall classification accuracy. The authors’ analysis also demonstrates that all algorithms can achieve higher accuracy for the volunteer- versus paid-generated data sets for the same task. Furthermore, the authors observed interesting patterns in the relationship between the performance of different algorithms and workflow-specific factors including the number of steps and the number of available options in each step.

Originality/value

Due to the nature of crowdsourcing, aggregating the collected data is an important process to understand the quality of crowdsourcing results. Different inference algorithms have been studied for simple microtasks consisting of single questions with two or more answers. However, as classification tasks typically contain many questions, the proposed method can be applied to a wide range of tasks including both single- and multiple-question classification tasks.

Keywords: Classification, Aggregation, Task-oriented crowdsourcing, Human computation, Quality assessment

References(62)

Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S. and Lehmann, J. (2013), “Crowdsourcing linked data quality assessment”, The Semantic Web – ISWC 2013, pp. 260-276.

Bachrach, Y. Minka, T. and Guiver, J. (2012), “How to grade a test without knowing the answers – a Bayesian graphical model for adaptive crowdsourcing and aptitude testing”.

Batini, C., Cappiello, C., Francalanci, C. and Maurino, A. (2009), “Methodologies for data quality assessment and improvement”, ACM Computing Surveys, Vol. 41 No. 3, pp. 1-52.

Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B., Ackerman, M.S., Karger, D.R., Crowell, D. and Panovich, K. (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd annual ACM symposium on User interface software and technology, ACM, pp. 313-322.https://doi.org/10.1145/1866029.1866078
DOI
Bernstein, M.S., G., Little, Robert, C., Miller, B., Hartmann, Mark, S., Ackerman, David, R., Karger, David Crowell, K. and Panovich, (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, ACM, pp. 313-322.https://doi.org/10.1145/1866029.1866078
DOI
Bragg, J., Mausam and Weld, D.S. (2013), “Crowdsourcing multi-label classification for taxonomy creation”, in HCOMP 2013, First AAAI Conference on Human Computation and Crowdsourcing.

Dawid, A.P. and Skene, A.M. (1979), “Maximum likelihood estimation of observer error-rates using the em algorithm”, Applied Statistics , Vol. 28 No. 1, p. 20.

Demartini, G., Difallah, D.E. and Cudré-Mauroux, P. (2012), “Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking”, Proceedings of the 21st international conference on World Wide Web, ACM, pp. 469-478.https://doi.org/10.1145/2187836.2187900
DOI
Difallah, D.E. Catasta, M. Demartini, G. Ipeirotis, P.G. and Cudré-Mauroux, P. (2015a), “The dynamics of micro-task crowdsourcing: the case of Amazon MTurk”, pp. 238-247.https://doi.org/10.1145/2740908.2744109
DOI

Difallah, D.E., Catasta, M., Demartini, G. and Panagiotis, G. (2015b), “Ipeirotis, and Philippe Cudré-Mauroux”, The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk,” Pages, pp. 238-247.

dos Reis, F.J.C.S., Lynn, H.R., Ali, D., Eccles, A., Hanby, E., Provenzano, C., Caldas, W.J., Howat, L.-A., McDuffus, B. and Liu, (2015), “Crowdsourcing the general public for large scale molecular pathology studies in cancer”, EBioMedicine, Vol. 2 No. 7, pp. 679-687.

Dumais, S. (2000), “Hierarchical classification of web content”, pp. 256-263.https://doi.org/10.1145/345508.345593
DOI
Eickhoff, C. and de Vries, A. (2011), “How crowdsourcable is your task”, in Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 11-14.

Gadiraju, U., Demartini, G., Kawase, R. and Dietze, S. (2015), “Human beyond the machine: challenges and opportunities of microtask crowdsourcing”, IEEE Intelligent Systems, Vol. 30 No. 4, pp. 81-85.

Gadiraju, U., Kawase, R. and Dietze, S. (2014), “A taxonomy of microtasks on the web”, Proceedings of the 25th ACM conference on Hypertext and social media, ACM, pp. 218-223.https://doi.org/10.1145/2631775.2631819
DOI
Gelas, H. Solomon Teferra Abate, L. and Besacier, (2011), Laboratoire Dynamique, Du Langage, Cnrs Universit, De Lyon, Laboratoire Informatique De Grenoble, Cnrs Universit, and Fourier Grenoble., “Quality assessment of crowdsourcing transcriptions for African languages,” (August), pp. 3065-3068.https://doi.org/10.21437/Interspeech.2011-767
DOI
Hare, J.S., Acosta, M., Weston, A., Simperl, E., Samangooei, S., Dupplaw, D. and Lewis, P.H. (2013), “An investigation of techniques that aim to improve the quality of labels provided by the crowd”, in Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, Barcelona, Spain, October 18-19, 2013., vol. 1043 of CEUR Workshop Proceedings, available at: CEUR-WS.org
Hung, Q.V.N., Tam, N.T., Tran, L.N. and Aberer, K. (2013), “An evaluation of aggregation techniques in crowdsourcing”, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8181 LNCS, no. PART 2, pp. 1-15.https://doi.org/10.1007/978-3-642-41154-0_1
DOI
Hung, N.Q.V., Thang, D.C., Weidlich, M. and Aberer, K. (2015), “Minimizing efforts in validating crowd answers”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Vol. 2015-May, pp. 999-1014.https://doi.org/10.1145/2723372.2723731
DOI
Huynh, T.D., Ebden, M., Venanzi, M., Ramchurn, S., Roberts, S. and Moreau, L. (2013), “Interpretation of crowdsourced activities using provenance network analysis”, The First AAAI Conference on Human Computation and Crowdsourcing, pp. 78-85.

Ipeirotis, P.G., Provost, F., Sheng, V.S. and Wang, J. (2014), “Repeated labeling using multiple noisy labelers”, Data Mining and Knowledge Discovery, Vol. 28 No. 2, pp. 402-441.

Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64.https://doi.org/10.1145/1837885.1837906
DOI
Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64.https://doi.org/10.1145/1837885.1837906
DOI
JCGM. JCGM 200 (2008), “International vocabulary of metrology? Basic and general concepts and associated terms (VIM) vocabulaire international de métrologie? Concepts fondamentaux et généraux et termes associés (VIM)”, International Organization for Standardization Geneva ISBN, Vol. 3 No. Vim, p. 1042008.

Kahn, B.K., Strong, D.M. and Wang, R.Y. (2002), “Information quality benchmarks: product and service performance”, Communications of the Acm, Vol. 45 No. 4, pp. 184-192.

Kamar, E., Hacker, S. and Horvitz, E. (2012), “Combining human and machine intelligence in large-scale crowdsourcing”, Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, Vol. 1, pp. 467-474.
Kamar, E. and Horvitz, E. (2015), “Planning for crowdsourcing hierarchical tasks”, Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, p. 2030.

Karger, D.R., Oh, S. and Shah, D. (2011), “Iterative learning for reliable crowdsourcing systems”, Advances in Neural Information Processing Systems, pp. 1953-1961.

Khattak, F.K. and Salleb-Aouissi, A. (2011), “Quality control of crowd labeling through expert evaluation”, Second Workshop on Computational Social Science and the Wisdom of Crowds (NIPS 2011), pp. 1-5.
Kim, J.-H., Kang, I.-H. and Choi, K.-S. (2002), “Unsupervised named entity classification models and their ensembles”, Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1, pp. 1-7.https://doi.org/10.3115/1072228.1072316
DOI
Kittur, A., Chi, E.H. and Suh, B. (2008), “Crowdsourcing user studies with mechanical Turk”, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, pp. 453-456.https://doi.org/10.1145/1357054.1357127
DOI
Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M. and Horton, J. “The future of crowd work”, Proceedings of the 2013 Conference on Computer Supported Cooperative Work – CSCW ’13, ACM Press, New York, NY, USA), p. 1301, 2013.https://doi.org/10.1145/2441776.2441923
DOI
Kittur, A., Smus, B., Khamkar, S. and Kraut, R.E. (2011), “CrowdForge: Crowdsourcing complex work”, Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology – UIST ’11, pp. 43-52.https://doi.org/10.1145/2047196.2047202
DOI
Kulkarni, A.P., Can, M. and Hartmann, B. (2011), “Turkomatic”, Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems – CHI EA ’11, p. 2053.https://doi.org/10.1145/1979742.1979865
DOI

Lintott, C., Schawinski, K., Bamford, S., Slosar, A., Land, K., Thomas, D., Edmondson, E., Masters, K., Nichol, R.C. and Raddick, M.J. (2011), “Galaxy zoo 1: data release of morphological classifications for nearly 900 000 galaxies”, Monthly Notices of the Royal Astronomical Society, Vol. 410 No. 1, pp. 166-178.

Little, G., Chilton, L.B., Goldman, M. and Miller, R.C. (2009), “Turkit: tools for iterative tasks on mechanical Turk”, in Proceedings of the ACM SIGKDD Workshop on Human Computation, ACM, pp. 29-30.https://doi.org/10.1145/1600150.1600159
DOI

Liu, X., Lu, M., Ooi, C., Shen, Y., Wu, S. and Zhang, M. (2012), “CDAS: a crowdsourcing data analytics system”, Proceedings of the Vldb Endowment, Vol. 5 No. 10, pp. 1040-1051.

Loni, B., Hare, J., Georgescu, M., Riegler, M., Zhu, X., Morchid, M., Dufour, R. and Larson, M. (2014), “Getting by with a little help from the crowd: practical approaches to social image labeling”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia, pp. 69-74.https://doi.org/10.1145/2660114.2660123
DOI

Malone, T.W., Laubacher, R. and Dellarocas, C. (2010), “The collective intelligence genome”, IEEE Engineering Management Review, Vol. 38 No. 3.

Mao, A., Kamar, E., Chen, Y., Horvitz, E., Schwamb, M.E., Lintott, C.J. and Smith, A.M. (2013), “Volunteering versus work for pay: incentives and tradeoffs in crowdsourcing”,First AAAI Conference on Human Computation and Crowdsourcing, pp. 94-102.
Otani, N., Baba, Y. and Kashima, H. (2016), “Quality control for crowdsourced hierarchical classification”, Proceedings – IEEE International Conference on Data Mining, ICDM, Vol. 2016-Janua, pp. 937-942.https://doi.org/10.1109/ICDM.2015.83
DOI

Parameswaran, A., Sarma, A.D., Garcia-Molina, H., Polyzotis, N. and Widom, J. (2011), “Human-Assisted graph search: it’s okay to ask questions”, Proceedings of the VLDB Endowment, Vol. 4 No. 5, pp. 267-278.

Paulheim, H. and Bizer, C. (2014), “Improving the quality of linked data using statistical distributions”, International Journal on Semantic Web and Information Systems (Systems), Vol. 10 No. 2, pp. 63-86.

Pukelsheim, F. (1994), “The three sigma rule”, The American Statistician, Vol. 48 No. 2, pp. 88-91.

Redi, J. and Povoa, I. (2014), “Crowdsourcing for rating image aesthetic appeal: better a paid of a volunteer crowd?”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia – CrowdMM ’14, no. NOVEMBER 2014, pp. 25-30.https://doi.org/10.1145/2660114.2660118
DOI
Rosenthal, S.L. and Dey, A.K. (2010), “Towards maximizing the accuracy of human-labeled sensor data”, in Proceedings of the 15th International Conference on Intelligent User Interfaces – IUI ’10, ACM Press, New York, NY, p. 259.https://doi.org/10.1145/1719970.1720006
DOI
Shahaf, D. and Horvitz, E. (2010), “Generalized task markets for human and machine computation”, in AAAI.

Sheshadri, A., Lease, M. (2013), “SQUARE: a benchmark for research on computing crowd consensus”, First AAAI Conference on Human Computation and …, pp. 156-164.

Simpson, E., Roberts, S., Psorakis, I. and Smith, A. (2013), “Dynamic Bayesian combination of multiple imperfect classifiers”, Studies in Computational Intelligence, Vol. 474, pp. 1-35.

Simpson, E., Roberts, S.J., Smith, A. and Lintott, C. (2011), “Bayesian combination of multiple, imperfect classifiers”, in Proceedings of the 25th Conference on Neural Information Processing Systems, Granada.
Snow, R., O’Connor, B., Jurafsky, D. and Ng, A.Y. (2008), “Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 254-263.https://doi.org/10.3115/1613715.1613751
DOI
Su, W., Wang, J. and Lochovsky, F. (2006), Automatic Hierarchical Classification of Structured Deep Web Databases BT - Web Information Systems – WISE, 2006, Springer, Berlin Heidelberg, pp. 210-221.https://doi.org/10.1007/11912873_23
DOI
Tran-Thanh, S.R.L., Huynh, T.D. and Rosenfeld, A. (2015), “Crowdsourcing complex workflows under budget constraints”, Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), pp. 1298-1304.
Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A. and Koller, D. (2008), “Online word games for semantic data collection”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 533-542.https://doi.org/10.3115/1613715.1613781
DOI

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J. and Movellan, J. (2009), “Whose vote should count more: optimal integration of labels from labelers of unknown expertise”, Advances in Neural Information Processing Systems, Vol. 22 No. 1, pp. 1-9.

Wiggins, A., Newman, G., Stevenson, R.D. and Crowston, K. (2011), “Mechanisms for data quality and validation in citizen science”, e-Science Workshops (eScienceW), 20111 IEEE Seventh International Conference on, IEEE, pp. 14-19.https://doi.org/10.1109/eScienceW.2011.27
DOI

Willett, K.W., Lintott, C.J., Bamford, S.P., Masters, K.L., Simmons, B.D., Casteels, K.R.V., Edmondson, E.M., Fortson, L.F., Kaviraj, S., Keel, W.C., Melvin, T., Nichol, R.C., Raddick, M.J., Schawinski, K., Simpson, R.J., Skibba, R.A., Smith, A.M. and Thomas, D. (2013), “Galaxy zoo 2: detailed morphological classifications for 304 122 galaxies from the sloan digital sky survey”, Monthly Notices of the Royal Astronomical Society, Vol. 435 No. 4, pp. 2835-2860.

Wu, X., Fan, W. and Yu, Y. (2012), “Sembler: ensembling crowd sequential labeling for improved quality”, Proceedings of the National Conference on Artificial Intelligence, vol. 2, pp. 1713-1719.
Yang, J. Redi, J. Demartini, G. and Bozzon, A. (2016), “Modeling task complexity in crowdsourcing”.

Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S. and Hitzler, P. (2013), “Quality assessment methodologies for linked open data”, Semantic Web.

Zhang, J., Sheng, V.S., Li, Q., Wu, J. and Wu, X. (2017a), “Consensus algorithms for biased labeling in crowdsourcing”, Information Sciences, Vol. 382-383, pp. 254-273.

Zheng, Y., Li, G., Li, Y., Shan, C. and Cheng, R. (2017b), “Truth inference in crowdsourcing: is the problem solved?”, Proceedings of the VLDB Endowment, Vol. 10 No. 5.

Publication history
Copyright
Rights and permissions

Publication history

Received: 22 June 2019
Revised: 16 August 2019
Accepted: 20 August 2019
Published: 17 October 2019
Issue date: December 2019

Copyright

© The author(s)

Rights and permissions

Qiong Bu, Elena Simperl, Adriane Chapman and Eddy Maddalena. Published in International Journal of Crowd Science. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

Return