Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Zhi-Xin Qi; Hong-Zhi Wang; An-Jie Wang

doi:10.1007/s11390-021-1344-6

Journal of Computer Science and Technology 2021, 36(4): 806-821 https://doi.org/10.1007/s11390-021-1344-6

Regular Paper | Issue | Published: 05 July 2021

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Show Author's Information Hide Author's Information Zhi-Xin Qi, Hong-Zhi Wang(

), An-Jie Wang

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China

Keywords:

classification, clustering, data quality, data cleaning, model selection

Cite this article:

Qi Z-X, Wang H-Z, Wang A-J. Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation. Journal of Computer Science and Technology, 2021, 36(4): 806-821. https://doi.org/10.1007/s11390-021-1344-6

Download citation

EndNote(RIS)

BibTeX

245

Views

Citations

Crossref

WoS

Scopus

CSCD

Abstract Electronic supplementary material About this article

Abstract

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Electronic supplementary material

File

jcst-36-4-806-Highlights.pdf (161.4 KB)

About this article

Publication history

Received: 31 January 2021

Accepted: 27 June 2021

Published: 05 July 2021

Issue date: July 2021

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Publication history

Copyright