Journal Home > Volume 3 , Issue 4

Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time series characteristics, including error data detection and data repairing. In respect to error data detection techniques, it categorizes an overview of quantitative data error detection methods for detecting single-point errors, continuous errors, and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors. Besides, it provides a detailed description of error data repairing techniques, involving statistics-based repairing, rule-based repairing, and human-involved repairing. We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning.


menu
Abstract
Full text
Outline
About this article

IoT data cleaning techniques: A survey

Show Author's information Xiaoou Ding1Hongzhi Wang1( )Genglong Li2Haoxuan Li1Yingze Li1Yida Liu1
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
School of Mechatronics Engineering, Harbin Institute of Technology, Harbin 150001, China

Abstract

Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time series characteristics, including error data detection and data repairing. In respect to error data detection techniques, it categorizes an overview of quantitative data error detection methods for detecting single-point errors, continuous errors, and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors. Besides, it provides a detailed description of error data repairing techniques, involving statistics-based repairing, rule-based repairing, and human-involved repairing. We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning.

Keywords: Internet of Things (IoT), data quality, data cleaning, error detection, data repairing

References(79)

[1]

A. Karkouch, H. Mousannif, H. A. Moatassime, and T. Noël, Data quality in internet of things: A state-of-the-art survey, J. Netw. Comput. Appl., vol. 73, pp. 57–81, 2016.

[2]
A. Zhang, Research on time series data cleaning methods, (in Chinese), PhD dissertation, School of Software, Tsinghua University, Beijing, China, 2018.
[3]
K. Yue, Data Engineering: Processing, Analysis and Services, (in Chinese). Beijing, China: Tsinghua University Press, 2013.
[4]

F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, Data lake management: Challenges and opportunities, Proc. VLDB Endow., vol. 12, no. 12, pp. 1986–1989, 2019.

[5]

R. Y. Wang and D. M. Strong, Beyond accuracy: What data quality means to data consumers, J. Manag. Inf. Syst., vol. 12, no. 4, pp. 5–33, 1996.

[6]

Z. Guo and A. Zhou, Research on data quality and data cleaning: A survey, (in Chinese), Journal of Software, vol. 13, no. 11, pp. 2076–2082, 2002.

[7]
W. Fan and F. Geerts, Foundations of Data Quality Management. San Rafael, CA, USA: Morgan & Claypool Publishers, 2012.
DOI
[8]
F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha, Data quality: A survey of data quality dimensions, in Proc. 2012 International Conference on Information Retrieval & Knowledge Management, Kuala Lumpur, Malaysia, 2012, pp. 300–304.
DOI
[9]

J. Li, H. Wang, and H. Gao, State-of-the-art of research on big data usability, (in Chinese), Journal of Software, vol. 27, no. 7, pp. 1605–1625, 2016.

[10]

J. Li and X. Liu, An important aspect of big data: Data usability, (in Chinese), Journal of Computer Research and Development, vol. 50, no. 6, pp. 1147–1162, 2013.

[11]

X. Ding, H. Wang, X. Zhang, J. Li, and H. Gao, Association relationships study of multi-dimensional data quality, (in Chinese), Journal of Software, vol. 27, no. 7, pp. 1626–1644, 2016.

[12]
L. Cai and Y. Zhu, Big Data Quality, (in Chinese). Shanghai, China: Shanghai Science and Technology Press, 2017.
[13]
S. Song and A. Zhang, IoT data quality, in Proc. 29th ACM International Conference on Information & Knowledge Management, Virtual event, Ireland, 2020, pp. 3517–3518.
DOI
[14]

Z. Liu, Y. Zhang, R. Huang, Z. Chen, S. Song, and J. Wang, EXPERIENCE: Algorithms and case study for explaining repairs with uniform profiles over IoT data, J. Data Inf. Qual., vol. 13, no. 3, pp. 1–17, 2021.

[15]

W. Y. Kim, B. -J. Choi, E. K. Hong, S. -K. Kim, and D. Lee, A taxonomy of dirty data, Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99, 2003.

[16]
A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, Descriptive and prescriptive data cleaning, in Proc. 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 2014, pp. 445–456.
DOI
[17]
I. F. Ilyas and X. Chu, Data Cleaning. New York, NY, USA: Association for Computing Machinery, 2019.
DOI
[18]
M. M. Lahijani, Semi-supervised data cleaning, PhD dissertation, School of Electrical Engineering and Informatics, Technical University of Berlin, Berlin, Germany, 2020.
[19]

S. Hao, G. Li, J. Feng, and N. Wang, Survey of structured data cleaning methods, (in Chinese), Journal of Tsinghua University (Science and Technology), vol. 58, no. 12, pp. 1037–1050, 2018.

[20]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, Data cleaning: Overview and emerging challenges, in Proc. 2016 International Conference on Management of Data, San Francisco, CA, USA, 2016, pp. 2201–2206.
DOI
[21]

X. Wang and C. Wang, Time series data cleaning: A survey, IEEE Access, vol. 8, pp. 1866–1881, 2019.

[22]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang, Detecting data errors: Where are we and what needs to be done, Proc. VLDB Endow., vol. 9, no. 12, pp. 993–1004, 2016.

[23]

V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009.

[24]

M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, Outlier detection for temporal data: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250–2267, 2014.

[25]

Y. Zhang, N. Meratnia, and P. Havinga, Outlier detection techniques for wireless sensor networks: A survey, IEEE Communications Surveys &Tutorials, vol. 12, no. 2, pp. 159–170, 2010.

[26]
E. Keogh, J. Lin, and A. Fu, HOT SAX: Efficiently finding the most unusual time series subsequence, in Proc. Fifth IEEE International Conference on Data Mining (ICDM '05), Houston, TX, USA, 2005, pp. 226–233.
[27]

U. Rebbapragada, P. Protopapas, C. E. Brodley, and C. R. Alcock, Finding anomalous periodic time series, Mach. Learn., vol. 74, no. 3, pp. 281–313, 2009.

[28]
K. -H. Le and P. Papotti, User-driven error detection for time series with events, in Proc. 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 2020, pp. 745–757.
[29]

S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning, Pattern Recognit., vol. 58, pp. 121–134, 2016.

[30]

R. Fujimaki, T. Nakata, H. Tsukahara, A. Sato, and K. Yamanishi, Mining abnormal patterns from heterogeneous time-series with irrelevant features for fault event detection, Stat. Anal. Data Min., vol. 2, no. 1, pp. 1–17, 2009.

[31]

I. F. Ilyas and X. Chu, Trends in cleaning relational data: Consistency and deduplication, Foundations and Trends in Databases, vol. 5, no. 4, pp. 281–393, 2015.

[32]

L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava, Sequential dependencies, Proc. VLDB Endow., vol. 2, no. 1, pp. 574–585, 2009.

[33]

W. Fan, F. Geerts, N. Tang, and W. Yu, Conflict resolution with data currency and consistency, J. Data Inf. Qual., vol. 5, nos. 1&2, pp. 1–37, 2014.

[34]

X. Chu, I. F. Ilyas, and P. Papotti, Discovering denial constraints, Proc. VLDB Endow., vol. 6, no. 13, pp. 1498–1509, 2013.

[35]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A cost-based model and effective heuristic for repairing constraints by value modification, in Proc. 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), Baltimore, MD, USA, 2005, pp. 143–154.
DOI
[36]
X. Chu, I. F. Ilyas, and P. Papotti, Holistic data cleaning: Putting violations into context, in Proc. 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 2013, pp. 458–469.
DOI
[37]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, HoloClean: Holistic data repairs with probabilistic inference, Proc. VLDB Endow., vol. 10, no. 11, pp. 1190–1201, 2017.

[38]

R. S. Tsay, Outliers, level shifts, and variance changes in time series, Journal of Forecasting, vol. 7, no. 1, pp. 1–20, 1988.

[39]
F. Moerchen, Algorithms for time series knowledge mining, in Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), Philadelphia, PA, USA, 2006, pp. 668–673.
DOI
[40]
T. Calders, B. Goethals, and S. Jaroszewicz, Mining rank-correlated sets of numerical attributes, in Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), Philadelphia, PA, USA, 2006, pp. 96–105.
DOI
[41]
S. Song, A. Zhang, J. Wang, and P. S. Yu, SCREEN: Stream data cleaning under speed constraints, in Proc. 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15), Melbourne, Australia, 2015, pp. 827–841.
DOI
[42]

S. Song, F. Gao, A. Zhang, J. Wang, and P. S. Yu, Stream data cleaning under speed and acceleration constraints, ACM Trans. Database Syst., vol. 46, no. 3, pp. 1–44, 2021.

[43]

E. H. M. Pena, E. C. D. Almeida, and F. Naumann, Discovery of approximate (and exact) denial constraints, Proc. VLDB Endow., vol. 13, no. 3, pp. 266–278, 2019.

[44]
G. Li, Y. Zheng, J. Fan, J. Wang, and R. Cheng, Crowdsourced data management: Overview and challenges, in Proc. 2017 ACM International Conference on Management of Data (SIGMOD '17), Chicago, IL, USA, 2017, pp. 1711–1716.
DOI
[45]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen, Tane: An efficient algorithm for discovering functional and approximate dependencies, Comput. J., vol. 42, no. 2, pp. 100–111, 1999.

[46]
C. Wyss, C. Giannella, and E. Robertson, FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract, in Proc. Third Int. Conf. Data Warehousing Knowl. Discovery, Munich, Germany, 2001, pp. 101–110.
DOI
[47]

T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J. -P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann, Functional dependency discovery: An experimental evaluation of seven algorithms, Proc. VLDB Endow., vol. 8, no. 10, pp. 1082–1093, 2015.

[48]

F. Chiang and R. J. Miller, Discovering data quality rules, Proc. VLDB Endow., vol. 1, no. 1, pp. 1166–1177, 2008.

[49]

W. Fan, F. Geerts, J. Li, and M. Xiong, Discovering conditional functional dependencies, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683–698, 2011.

[50]

L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu, On generating near-optimal tableaux for conditional functional dependencies, Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.

[51]

T. Bleifuß, S. Kruse, and F. Naumann, Efficient denial constraint discovery with hydra, Proc. VLDB Endow., vol. 11, no. 3, pp. 311–323, 2017.

[52]

E. Livshits, A. Heidari, I. F. Ilyas, and B. Kimelfeld, Approximate denial constraints, Proc. VLDB Endow., vol. 13, no. 10, pp. 1682–1695, 2020.

[53]

T. Dasu and J. M. Loh, Statistical distortion: Consequences of data cleaning, Proc. VLDB Endow., vol. 5, no. 11, pp. 1674–1683, 2012.

[54]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Towards certain fixes with editing rules and master data, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 173–184, 2010.

[55]
D. Miao, Research on computational complexity theory and algorithms for data consistency, (in Chinese), PhD dissertation, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, 2016.
[56]

G. Beskales, I. F. Ilyas, and L. Golab, Sampling the repairs of functional dependency violations under hard constraints, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 197–207, 2010.

[57]
F. Chiang and R. J. Miller, A unified model for data and constraint repair, in Proc. 2011 IEEE 27th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 446–457.
DOI
[58]
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, On the relative trust between inconsistent data and inaccurate constraints, in Proc. 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 2013, pp. 541–552.
DOI
[59]
S. Song, H. Zhu, and J. Wang, Constraint-variance tolerant data repairing, in Proc. 2016 International Conference on Management of Data (SIGMOD '16), San Francisco, CA, USA, 2016, pp. 877–892.
DOI
[60]
S. Hao, Research on the key technology of cleaning structured data, PhD dissertation, Department of Computer Science and Technology, Tsinghua University, Beijing, China, 2018.
[61]

J. Fan, Y. Chen, and X. Du, Progress on human-in-the-loop data preparation, (in Chinese), Big Data, vol. 5, no. 6, pp. 3–18, 2019.

[62]

C. Ye, H. Wang, H. Gao, and J. Li, Active learning approach for crowdsourcing-enhanced data cleaning, (in Chinese), Journal of Software, vol. 31, no. 4, pp. 1162–1172, 2020.

[63]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye, KATARA: Reliable data cleaning with knowledge bases and crowdsourcing, Proc. VLDB Endow., vol. 8, no. 12, pp. 1952–1955, 2015.

[64]

I. F. Ilyas, Effective data cleaning with continuous evaluation, IEEE Data Eng. Bull., vol. 39, no. 2, pp. 38–46, 2016.

[65]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, Guided data repair, Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.

[66]
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller, Continuous data cleaning, in Proc. 2014 IEEE 30th International Conference on Data Engineering, Chicago, IL, USA, 2014, pp. 244–255.
DOI
[67]
E. K. Rezig, M. Ouzzani, A. K. Elmagarmid, W. G. Aref, and M. Stonebraker, Towards an end-to-end human-centric data cleaning framework, in Proc. Workshop on Human-In-the-Loop Data Analytics (HILDA '19), Amsterdam, the Netherlands, 2019, pp. 1–7.
DOI
[68]

M. A. Siddiqui, A. Fern, T. G. Dietterich, W. -K. Wong, Sequential feature explanations for anomaly detection, ACM Trans. Knowl. Discov. Data, vol. 13, no. 1, pp. 1–22, 2019.

[69]
H. Zhang, C. Chai, A. Doan, P. Koutris, and E. Arcaute, Manually detecting errors for data cleaning using adaptive crowdsourcing strategies, in Proc. 23rd International Conference on Extending Database Technology (EDBT), Copenhagen, Denmark, 2020, pp. 311–322.
[70]
M. Yakout, L. Berti-Équille, and A. K. Elmagarmid, Don’t be scared: Use scalable automatic repairing with maximal likelihood and bounded changes, in Proc. 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13), New York, NY, USA, 2013, pp. 553–564.
DOI
[71]

T. Dasu, R. Duan, and D. Srivastava, Data quality for temporal streams, IEEE Data Eng. Bull., vol. 39, no. 2, pp. 78–92, 2016.

[72]
L. Berti-Équille, T. Dasu, and D. Srivastava, Discovery of complex glitch patterns: A novel approach to quantitative data cleaning, in Proc 2011 IEEE 27th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 733–744.
DOI
[73]

A. Zhang, S. Song, J. Wang, and P. S. Yu, Time series data cleaning: From anomaly detection to anomaly repairing, Proc. VLDB Endow., vol. 10, no. 10, pp. 1046–1057, 2017.

[74]
W. Yin, T. Yue, H. Wang, Y. Huang, and Y. Li, Time series cleaning under variance constraints, in Proc. 2018 Int. Workshops Database Syst. Adv. Appl.: BDMS, BDQM, GDMA, and SeCoP, Gold Coast, Australia, 2018, pp. 108–113.
DOI
[75]

F. Gao, S. Song, and J. Wang, Time-series data cleaning under multi-speed constraints, (in Chinese), Journal of Software, vol. 32, no. 3, pp. 689–711, 2021.

[76]

S. Song, Y. Sun, A. Zhang, L. Chen, and J. Wang, Enriching data imputation under similarity rule constraints, IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 2, pp. 275–287, 2020.

[77]

X. Ding, S. Yu, M. Wang, H. Wang, H. Gao, and D. Yang, Anomaly detection on industrial time series based on correlation analysis, Journal of Software, vol. 31, no. 3, pp. 726–747, 2020.

[78]
Z. Li, X. Ding, and H. Wang, An effective constraint-based anomaly detection approach on multivariate time series, in Proc. 4th International Joint Conference, APWeb-WAIM, Tianjin, China, 2020, pp. 61–69.
DOI
[79]

Z. Liang, H. Wang, X. Ding, and T. Mu, Industrial time series determinative anomaly detection based on constraint hypergraph, Knowl. Based Syst., vol. 233, p. 107548, 2021.

Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 16 September 2022
Revised: 18 November 2022
Accepted: 17 December 2022
Published: 30 December 2022
Issue date: December 2022

Copyright

© All articles included in the journal are copyrighted to the ITU and TUP.

Acknowledgements

Acknowledgment

This work was supported by the National Key Research and Development Program of China (No. 2021YFB3300502), National Natural Science Foundation of China (NSFC) (Nos. 62202126 and 62232005), and Heilongjiang Postdoctoral Financial Assistance (No. LBH-Z21137).

Rights and permissions

This work is available under the CC BY-NC-ND 3.0 IGO license:https://creativecommons.org/licenses/by-nc-nd/3.0/igo/

Return