Intelligent and Converged Networks 2022, 3(4): 325-339 https://doi.org/10.23919/ICN.2022.0026

Open Access | Issue | Published: 30 December 2022

IoT data cleaning techniques: A survey

Show Author's Information Hide Author's Information Xiaoou Ding^¹, Hongzhi Wang^¹(

), Genglong Li^², Haoxuan Li^¹, Yingze Li^¹, Yida Liu^¹

1School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

2School of Mechatronics Engineering, Harbin Institute of Technology, Harbin 150001, China

Keywords:

Internet of Things (IoT), data quality, data cleaning, error detection, data repairing

Cite this article:

Ding X, Wang H, Li G, et al. IoT data cleaning techniques: A survey. Intelligent and Converged Networks, 2022, 3(4): 325-339. https://doi.org/10.23919/ICN.2022.0026

Download citation

EndNote(RIS)

BibTeX

990

Views

154

Downloads

Citations

Crossref

N/A

WoS

Scopus

N/A

CSCD

Abstract Full text About this article

Abstract

Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time series characteristics, including error data detection and data repairing. In respect to error data detection techniques, it categorizes an overview of quantitative data error detection methods for detecting single-point errors, continuous errors, and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors. Besides, it provides a detailed description of error data repairing techniques, involving statistics-based repairing, rule-based repairing, and human-involved repairing. We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning.

Full text

Abstract

Full text

Outline

About this article

IoT data cleaning techniques: A survey

Show Author's information Hide Author's Information Xiaoou Ding^¹, Hongzhi Wang^¹(

), Genglong Li^², Haoxuan Li^¹, Yingze Li^¹, Yida Liu^¹

1School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

2School of Mechatronics Engineering, Harbin Institute of Technology, Harbin 150001, China

Abstract

Keywords: Internet of Things (IoT), data quality, data cleaning, error detection, data repairing

References(79)

[1]

A. Karkouch, H. Mousannif, H. A. Moatassime, and T. Noël, Data quality in internet of things: A state-of-the-art survey, J. Netw. Comput. Appl., vol. 73, pp. 57–81, 2016.

DOI Google Scholar

[2]

A. Zhang, Research on time series data cleaning methods, (in Chinese), PhD dissertation, School of Software, Tsinghua University, Beijing, China, 2018.

[3]

K. Yue, Data Engineering: Processing, Analysis and Services, (in Chinese). Beijing, China: Tsinghua University Press, 2013.

[4]

F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, Data lake management: Challenges and opportunities, Proc. VLDB Endow., vol. 12, no. 12, pp. 1986–1989, 2019.

DOI Google Scholar

[5]

R. Y. Wang and D. M. Strong, Beyond accuracy: What data quality means to data consumers, J. Manag. Inf. Syst., vol. 12, no. 4, pp. 5–33, 1996.

DOI Google Scholar

[6]

Z. Guo and A. Zhou, Research on data quality and data cleaning: A survey, (in Chinese), Journal of Software, vol. 13, no. 11, pp. 2076–2082, 2002.

Google Scholar

[7]

W. Fan and F. Geerts, Foundations of Data Quality Management. San Rafael, CA, USA: Morgan & Claypool Publishers, 2012.

DOI

[8]

F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha, Data quality: A survey of data quality dimensions, in Proc. 2012 International Conference on Information Retrieval & Knowledge Management, Kuala Lumpur, Malaysia, 2012, pp. 300–304.

DOI

[9]

J. Li, H. Wang, and H. Gao, State-of-the-art of research on big data usability, (in Chinese), Journal of Software, vol. 27, no. 7, pp. 1605–1625, 2016.

Google Scholar

[10]

J. Li and X. Liu, An important aspect of big data: Data usability, (in Chinese), Journal of Computer Research and Development, vol. 50, no. 6, pp. 1147–1162, 2013.

Google Scholar

[11]

X. Ding, H. Wang, X. Zhang, J. Li, and H. Gao, Association relationships study of multi-dimensional data quality, (in Chinese), Journal of Software, vol. 27, no. 7, pp. 1626–1644, 2016.

Google Scholar

[12]

L. Cai and Y. Zhu, Big Data Quality, (in Chinese). Shanghai, China: Shanghai Science and Technology Press, 2017.

[13]

S. Song and A. Zhang, IoT data quality, in Proc. 29^th ACM International Conference on Information & Knowledge Management, Virtual event, Ireland, 2020, pp. 3517–3518.

DOI

[14]

Z. Liu, Y. Zhang, R. Huang, Z. Chen, S. Song, and J. Wang, EXPERIENCE: Algorithms and case study for explaining repairs with uniform profiles over IoT data, J. Data Inf. Qual., vol. 13, no. 3, pp. 1–17, 2021.

DOI Google Scholar

[15]

W. Y. Kim, B. -J. Choi, E. K. Hong, S. -K. Kim, and D. Lee, A taxonomy of dirty data, Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99, 2003.

DOI Google Scholar

[16]

A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, Descriptive and prescriptive data cleaning, in Proc. 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 2014, pp. 445–456.

DOI

[17]

I. F. Ilyas and X. Chu, Data Cleaning. New York, NY, USA: Association for Computing Machinery, 2019.

DOI

[18]

M. M. Lahijani, Semi-supervised data cleaning, PhD dissertation, School of Electrical Engineering and Informatics, Technical University of Berlin, Berlin, Germany, 2020.

[19]

S. Hao, G. Li, J. Feng, and N. Wang, Survey of structured data cleaning methods, (in Chinese), Journal of Tsinghua University (Science and Technology), vol. 58, no. 12, pp. 1037–1050, 2018.

Google Scholar

[20]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, Data cleaning: Overview and emerging challenges, in Proc. 2016 International Conference on Management of Data, San Francisco, CA, USA, 2016, pp. 2201–2206.

DOI

[21]

X. Wang and C. Wang, Time series data cleaning: A survey, IEEE Access, vol. 8, pp. 1866–1881, 2019.

DOI Google Scholar

[22]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang, Detecting data errors: Where are we and what needs to be done, Proc. VLDB Endow., vol. 9, no. 12, pp. 993–1004, 2016.

DOI Google Scholar

[23]

V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009.

DOI Google Scholar

[24]

M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, Outlier detection for temporal data: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250–2267, 2014.

DOI Google Scholar

[25]

Y. Zhang, N. Meratnia, and P. Havinga, Outlier detection techniques for wireless sensor networks: A survey, IEEE Communications Surveys &Tutorials, vol. 12, no. 2, pp. 159–170, 2010.

DOI Google Scholar

[26]

E. Keogh, J. Lin, and A. Fu, HOT SAX: Efficiently finding the most unusual time series subsequence, in Proc. Fifth IEEE International Conference on Data Mining (ICDM '05), Houston, TX, USA, 2005, pp. 226–233.

[27]

U. Rebbapragada, P. Protopapas, C. E. Brodley, and C. R. Alcock, Finding anomalous periodic time series, Mach. Learn., vol. 74, no. 3, pp. 281–313, 2009.

DOI Google Scholar

[28]

K. -H. Le and P. Papotti, User-driven error detection for time series with events, in Proc. 2020 IEEE 36^th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 2020, pp. 745–757.

[29]

S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning, Pattern Recognit., vol. 58, pp. 121–134, 2016.

DOI Google Scholar

[30]

R. Fujimaki, T. Nakata, H. Tsukahara, A. Sato, and K. Yamanishi, Mining abnormal patterns from heterogeneous time-series with irrelevant features for fault event detection, Stat. Anal. Data Min., vol. 2, no. 1, pp. 1–17, 2009.

DOI Google Scholar

[31]

I. F. Ilyas and X. Chu, Trends in cleaning relational data: Consistency and deduplication, Foundations and Trends in Databases, vol. 5, no. 4, pp. 281–393, 2015.

DOI Google Scholar

[32]

L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava, Sequential dependencies, Proc. VLDB Endow., vol. 2, no. 1, pp. 574–585, 2009.

DOI Google Scholar

[33]

W. Fan, F. Geerts, N. Tang, and W. Yu, Conflict resolution with data currency and consistency, J. Data Inf. Qual., vol. 5, nos. 1&2, pp. 1–37, 2014.

DOI Google Scholar

[34]

X. Chu, I. F. Ilyas, and P. Papotti, Discovering denial constraints, Proc. VLDB Endow., vol. 6, no. 13, pp. 1498–1509, 2013.

DOI Google Scholar

[35]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A cost-based model and effective heuristic for repairing constraints by value modification, in Proc. 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), Baltimore, MD, USA, 2005, pp. 143–154.

DOI

[36]

X. Chu, I. F. Ilyas, and P. Papotti, Holistic data cleaning: Putting violations into context, in Proc. 2013 IEEE 29^th International Conference on Data Engineering (ICDE), Brisbane, Australia, 2013, pp. 458–469.

DOI

[37]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, HoloClean: Holistic data repairs with probabilistic inference, Proc. VLDB Endow., vol. 10, no. 11, pp. 1190–1201, 2017.

DOI Google Scholar

[38]

R. S. Tsay, Outliers, level shifts, and variance changes in time series, Journal of Forecasting, vol. 7, no. 1, pp. 1–20, 1988.

DOI Google Scholar

[39]

F. Moerchen, Algorithms for time series knowledge mining, in Proc. 12^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), Philadelphia, PA, USA, 2006, pp. 668–673.

DOI

[40]

T. Calders, B. Goethals, and S. Jaroszewicz, Mining rank-correlated sets of numerical attributes, in Proc. 12^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), Philadelphia, PA, USA, 2006, pp. 96–105.

DOI

[41]

S. Song, A. Zhang, J. Wang, and P. S. Yu, SCREEN: Stream data cleaning under speed constraints, in Proc. 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15), Melbourne, Australia, 2015, pp. 827–841.

DOI

[42]

S. Song, F. Gao, A. Zhang, J. Wang, and P. S. Yu, Stream data cleaning under speed and acceleration constraints, ACM Trans. Database Syst., vol. 46, no. 3, pp. 1–44, 2021.

DOI Google Scholar

[43]

E. H. M. Pena, E. C. D. Almeida, and F. Naumann, Discovery of approximate (and exact) denial constraints, Proc. VLDB Endow., vol. 13, no. 3, pp. 266–278, 2019.

DOI Google Scholar

[44]

G. Li, Y. Zheng, J. Fan, J. Wang, and R. Cheng, Crowdsourced data management: Overview and challenges, in Proc. 2017 ACM International Conference on Management of Data (SIGMOD '17), Chicago, IL, USA, 2017, pp. 1711–1716.

DOI

[45]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen, Tane: An efficient algorithm for discovering functional and approximate dependencies, Comput. J., vol. 42, no. 2, pp. 100–111, 1999.

DOI Google Scholar

[46]

C. Wyss, C. Giannella, and E. Robertson, FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract, in Proc. Third Int. Conf. Data Warehousing Knowl. Discovery, Munich, Germany, 2001, pp. 101–110.

DOI

[47]

T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J. -P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann, Functional dependency discovery: An experimental evaluation of seven algorithms, Proc. VLDB Endow., vol. 8, no. 10, pp. 1082–1093, 2015.

DOI Google Scholar

[48]

F. Chiang and R. J. Miller, Discovering data quality rules, Proc. VLDB Endow., vol. 1, no. 1, pp. 1166–1177, 2008.

DOI Google Scholar

[49]

W. Fan, F. Geerts, J. Li, and M. Xiong, Discovering conditional functional dependencies, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683–698, 2011.

DOI Google Scholar

[50]

L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu, On generating near-optimal tableaux for conditional functional dependencies, Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.

DOI Google Scholar

[51]

T. Bleifuß, S. Kruse, and F. Naumann, Efficient denial constraint discovery with hydra, Proc. VLDB Endow., vol. 11, no. 3, pp. 311–323, 2017.

DOI Google Scholar

[52]

E. Livshits, A. Heidari, I. F. Ilyas, and B. Kimelfeld, Approximate denial constraints, Proc. VLDB Endow., vol. 13, no. 10, pp. 1682–1695, 2020.

DOI Google Scholar

[53]

T. Dasu and J. M. Loh, Statistical distortion: Consequences of data cleaning, Proc. VLDB Endow., vol. 5, no. 11, pp. 1674–1683, 2012.

DOI Google Scholar

[54]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Towards certain fixes with editing rules and master data, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 173–184, 2010.

DOI Google Scholar

[55]

D. Miao, Research on computational complexity theory and algorithms for data consistency, (in Chinese), PhD dissertation, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, 2016.

[56]

G. Beskales, I. F. Ilyas, and L. Golab, Sampling the repairs of functional dependency violations under hard constraints, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 197–207, 2010.

DOI Google Scholar

[57]

F. Chiang and R. J. Miller, A unified model for data and constraint repair, in Proc. 2011 IEEE 27^th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 446–457.

DOI

[58]

G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, On the relative trust between inconsistent data and inaccurate constraints, in Proc. 2013 IEEE 29^th International Conference on Data Engineering (ICDE), Brisbane, Australia, 2013, pp. 541–552.

DOI

[59]

S. Song, H. Zhu, and J. Wang, Constraint-variance tolerant data repairing, in Proc. 2016 International Conference on Management of Data (SIGMOD '16), San Francisco, CA, USA, 2016, pp. 877–892.

DOI

[60]

S. Hao, Research on the key technology of cleaning structured data, PhD dissertation, Department of Computer Science and Technology, Tsinghua University, Beijing, China, 2018.

[61]

J. Fan, Y. Chen, and X. Du, Progress on human-in-the-loop data preparation, (in Chinese), Big Data, vol. 5, no. 6, pp. 3–18, 2019.

Google Scholar

[62]

C. Ye, H. Wang, H. Gao, and J. Li, Active learning approach for crowdsourcing-enhanced data cleaning, (in Chinese), Journal of Software, vol. 31, no. 4, pp. 1162–1172, 2020.

Google Scholar

[63]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye, KATARA: Reliable data cleaning with knowledge bases and crowdsourcing, Proc. VLDB Endow., vol. 8, no. 12, pp. 1952–1955, 2015.

DOI Google Scholar

[64]

I. F. Ilyas, Effective data cleaning with continuous evaluation, IEEE Data Eng. Bull., vol. 39, no. 2, pp. 38–46, 2016.

Google Scholar

[65]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, Guided data repair, Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.

DOI Google Scholar

[66]

M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller, Continuous data cleaning, in Proc. 2014 IEEE 30^th International Conference on Data Engineering, Chicago, IL, USA, 2014, pp. 244–255.

DOI

[67]

E. K. Rezig, M. Ouzzani, A. K. Elmagarmid, W. G. Aref, and M. Stonebraker, Towards an end-to-end human-centric data cleaning framework, in Proc. Workshop on Human-In-the-Loop Data Analytics (HILDA '19), Amsterdam, the Netherlands, 2019, pp. 1–7.

DOI

[68]

M. A. Siddiqui, A. Fern, T. G. Dietterich, W. -K. Wong, Sequential feature explanations for anomaly detection, ACM Trans. Knowl. Discov. Data, vol. 13, no. 1, pp. 1–22, 2019.

DOI Google Scholar

[69]

H. Zhang, C. Chai, A. Doan, P. Koutris, and E. Arcaute, Manually detecting errors for data cleaning using adaptive crowdsourcing strategies, in Proc. 23^rd International Conference on Extending Database Technology (EDBT), Copenhagen, Denmark, 2020, pp. 311–322.

[70]

M. Yakout, L. Berti-Équille, and A. K. Elmagarmid, Don’t be scared: Use scalable automatic repairing with maximal likelihood and bounded changes, in Proc. 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13), New York, NY, USA, 2013, pp. 553–564.

DOI

[71]

T. Dasu, R. Duan, and D. Srivastava, Data quality for temporal streams, IEEE Data Eng. Bull., vol. 39, no. 2, pp. 78–92, 2016.

Google Scholar

[72]

L. Berti-Équille, T. Dasu, and D. Srivastava, Discovery of complex glitch patterns: A novel approach to quantitative data cleaning, in Proc 2011 IEEE 27^th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 733–744.

DOI

[73]

A. Zhang, S. Song, J. Wang, and P. S. Yu, Time series data cleaning: From anomaly detection to anomaly repairing, Proc. VLDB Endow., vol. 10, no. 10, pp. 1046–1057, 2017.

DOI Google Scholar

[74]

W. Yin, T. Yue, H. Wang, Y. Huang, and Y. Li, Time series cleaning under variance constraints, in Proc. 2018 Int. Workshops Database Syst. Adv. Appl.: BDMS, BDQM, GDMA, and SeCoP, Gold Coast, Australia, 2018, pp. 108–113.

DOI

[75]

F. Gao, S. Song, and J. Wang, Time-series data cleaning under multi-speed constraints, (in Chinese), Journal of Software, vol. 32, no. 3, pp. 689–711, 2021.

Google Scholar

[76]

S. Song, Y. Sun, A. Zhang, L. Chen, and J. Wang, Enriching data imputation under similarity rule constraints, IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 2, pp. 275–287, 2020.

DOI Google Scholar

[77]

X. Ding, S. Yu, M. Wang, H. Wang, H. Gao, and D. Yang, Anomaly detection on industrial time series based on correlation analysis, Journal of Software, vol. 31, no. 3, pp. 726–747, 2020.

Google Scholar

[78]

Z. Li, X. Ding, and H. Wang, An effective constraint-based anomaly detection approach on multivariate time series, in Proc. 4^th International Joint Conference, APWeb-WAIM, Tianjin, China, 2020, pp. 61–69.

DOI

[79]

Z. Liang, H. Wang, X. Ding, and T. Mu, Industrial time series determinative anomaly detection based on constraint hypergraph, Knowl. Based Syst., vol. 233, p. 107548, 2021.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 16 September 2022

Revised: 18 November 2022

Accepted: 17 December 2022

Published: 30 December 2022

Issue date: December 2022

Copyright

Acknowledgements

Acknowledgment

This work was supported by the National Key Research and Development Program of China (No. 2021YFB3300502), National Natural Science Foundation of China (NSFC) (Nos. 62202126 and 62232005), and Heilongjiang Postdoctoral Financial Assistance (No. LBH-Z21137).

Rights and permissions

This work is available under the CC BY-NC-ND 3.0 IGO license:https://creativecommons.org/licenses/by-nc-nd/3.0/igo/