990
Views
154
Downloads
4
Crossref
N/A
WoS
5
Scopus
N/A
CSCD
Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time series characteristics, including error data detection and data repairing. In respect to error data detection techniques, it categorizes an overview of quantitative data error detection methods for detecting single-point errors, continuous errors, and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors. Besides, it provides a detailed description of error data repairing techniques, involving statistics-based repairing, rule-based repairing, and human-involved repairing. We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning.
Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time series characteristics, including error data detection and data repairing. In respect to error data detection techniques, it categorizes an overview of quantitative data error detection methods for detecting single-point errors, continuous errors, and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors. Besides, it provides a detailed description of error data repairing techniques, involving statistics-based repairing, rule-based repairing, and human-involved repairing. We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning.
A. Karkouch, H. Mousannif, H. A. Moatassime, and T. Noël, Data quality in internet of things: A state-of-the-art survey, J. Netw. Comput. Appl., vol. 73, pp. 57–81, 2016.
F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, Data lake management: Challenges and opportunities, Proc. VLDB Endow., vol. 12, no. 12, pp. 1986–1989, 2019.
R. Y. Wang and D. M. Strong, Beyond accuracy: What data quality means to data consumers, J. Manag. Inf. Syst., vol. 12, no. 4, pp. 5–33, 1996.
Z. Guo and A. Zhou, Research on data quality and data cleaning: A survey, (in Chinese), Journal of Software, vol. 13, no. 11, pp. 2076–2082, 2002.
J. Li, H. Wang, and H. Gao, State-of-the-art of research on big data usability, (in Chinese), Journal of Software, vol. 27, no. 7, pp. 1605–1625, 2016.
J. Li and X. Liu, An important aspect of big data: Data usability, (in Chinese), Journal of Computer Research and Development, vol. 50, no. 6, pp. 1147–1162, 2013.
X. Ding, H. Wang, X. Zhang, J. Li, and H. Gao, Association relationships study of multi-dimensional data quality, (in Chinese), Journal of Software, vol. 27, no. 7, pp. 1626–1644, 2016.
Z. Liu, Y. Zhang, R. Huang, Z. Chen, S. Song, and J. Wang, EXPERIENCE: Algorithms and case study for explaining repairs with uniform profiles over IoT data, J. Data Inf. Qual., vol. 13, no. 3, pp. 1–17, 2021.
W. Y. Kim, B. -J. Choi, E. K. Hong, S. -K. Kim, and D. Lee, A taxonomy of dirty data, Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99, 2003.
S. Hao, G. Li, J. Feng, and N. Wang, Survey of structured data cleaning methods, (in Chinese), Journal of Tsinghua University (Science and Technology), vol. 58, no. 12, pp. 1037–1050, 2018.
X. Wang and C. Wang, Time series data cleaning: A survey, IEEE Access, vol. 8, pp. 1866–1881, 2019.
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang, Detecting data errors: Where are we and what needs to be done, Proc. VLDB Endow., vol. 9, no. 12, pp. 993–1004, 2016.
V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009.
M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, Outlier detection for temporal data: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250–2267, 2014.
Y. Zhang, N. Meratnia, and P. Havinga, Outlier detection techniques for wireless sensor networks: A survey, IEEE Communications Surveys &Tutorials, vol. 12, no. 2, pp. 159–170, 2010.
U. Rebbapragada, P. Protopapas, C. E. Brodley, and C. R. Alcock, Finding anomalous periodic time series, Mach. Learn., vol. 74, no. 3, pp. 281–313, 2009.
S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning, Pattern Recognit., vol. 58, pp. 121–134, 2016.
R. Fujimaki, T. Nakata, H. Tsukahara, A. Sato, and K. Yamanishi, Mining abnormal patterns from heterogeneous time-series with irrelevant features for fault event detection, Stat. Anal. Data Min., vol. 2, no. 1, pp. 1–17, 2009.
I. F. Ilyas and X. Chu, Trends in cleaning relational data: Consistency and deduplication, Foundations and Trends in Databases, vol. 5, no. 4, pp. 281–393, 2015.
L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava, Sequential dependencies, Proc. VLDB Endow., vol. 2, no. 1, pp. 574–585, 2009.
W. Fan, F. Geerts, N. Tang, and W. Yu, Conflict resolution with data currency and consistency, J. Data Inf. Qual., vol. 5, nos. 1&2, pp. 1–37, 2014.
X. Chu, I. F. Ilyas, and P. Papotti, Discovering denial constraints, Proc. VLDB Endow., vol. 6, no. 13, pp. 1498–1509, 2013.
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, HoloClean: Holistic data repairs with probabilistic inference, Proc. VLDB Endow., vol. 10, no. 11, pp. 1190–1201, 2017.
R. S. Tsay, Outliers, level shifts, and variance changes in time series, Journal of Forecasting, vol. 7, no. 1, pp. 1–20, 1988.
S. Song, F. Gao, A. Zhang, J. Wang, and P. S. Yu, Stream data cleaning under speed and acceleration constraints, ACM Trans. Database Syst., vol. 46, no. 3, pp. 1–44, 2021.
E. H. M. Pena, E. C. D. Almeida, and F. Naumann, Discovery of approximate (and exact) denial constraints, Proc. VLDB Endow., vol. 13, no. 3, pp. 266–278, 2019.
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen, Tane: An efficient algorithm for discovering functional and approximate dependencies, Comput. J., vol. 42, no. 2, pp. 100–111, 1999.
T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J. -P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann, Functional dependency discovery: An experimental evaluation of seven algorithms, Proc. VLDB Endow., vol. 8, no. 10, pp. 1082–1093, 2015.
F. Chiang and R. J. Miller, Discovering data quality rules, Proc. VLDB Endow., vol. 1, no. 1, pp. 1166–1177, 2008.
W. Fan, F. Geerts, J. Li, and M. Xiong, Discovering conditional functional dependencies, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683–698, 2011.
L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu, On generating near-optimal tableaux for conditional functional dependencies, Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.
T. Bleifuß, S. Kruse, and F. Naumann, Efficient denial constraint discovery with hydra, Proc. VLDB Endow., vol. 11, no. 3, pp. 311–323, 2017.
E. Livshits, A. Heidari, I. F. Ilyas, and B. Kimelfeld, Approximate denial constraints, Proc. VLDB Endow., vol. 13, no. 10, pp. 1682–1695, 2020.
T. Dasu and J. M. Loh, Statistical distortion: Consequences of data cleaning, Proc. VLDB Endow., vol. 5, no. 11, pp. 1674–1683, 2012.
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Towards certain fixes with editing rules and master data, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 173–184, 2010.
G. Beskales, I. F. Ilyas, and L. Golab, Sampling the repairs of functional dependency violations under hard constraints, Proc. VLDB Endow., vol. 3, nos. 1&2, pp. 197–207, 2010.
J. Fan, Y. Chen, and X. Du, Progress on human-in-the-loop data preparation, (in Chinese), Big Data, vol. 5, no. 6, pp. 3–18, 2019.
C. Ye, H. Wang, H. Gao, and J. Li, Active learning approach for crowdsourcing-enhanced data cleaning, (in Chinese), Journal of Software, vol. 31, no. 4, pp. 1162–1172, 2020.
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye, KATARA: Reliable data cleaning with knowledge bases and crowdsourcing, Proc. VLDB Endow., vol. 8, no. 12, pp. 1952–1955, 2015.
I. F. Ilyas, Effective data cleaning with continuous evaluation, IEEE Data Eng. Bull., vol. 39, no. 2, pp. 38–46, 2016.
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, Guided data repair, Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.
M. A. Siddiqui, A. Fern, T. G. Dietterich, W. -K. Wong, Sequential feature explanations for anomaly detection, ACM Trans. Knowl. Discov. Data, vol. 13, no. 1, pp. 1–22, 2019.
T. Dasu, R. Duan, and D. Srivastava, Data quality for temporal streams, IEEE Data Eng. Bull., vol. 39, no. 2, pp. 78–92, 2016.
A. Zhang, S. Song, J. Wang, and P. S. Yu, Time series data cleaning: From anomaly detection to anomaly repairing, Proc. VLDB Endow., vol. 10, no. 10, pp. 1046–1057, 2017.
F. Gao, S. Song, and J. Wang, Time-series data cleaning under multi-speed constraints, (in Chinese), Journal of Software, vol. 32, no. 3, pp. 689–711, 2021.
S. Song, Y. Sun, A. Zhang, L. Chen, and J. Wang, Enriching data imputation under similarity rule constraints, IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 2, pp. 275–287, 2020.
X. Ding, S. Yu, M. Wang, H. Wang, H. Gao, and D. Yang, Anomaly detection on industrial time series based on correlation analysis, Journal of Software, vol. 31, no. 3, pp. 726–747, 2020.
Z. Liang, H. Wang, X. Ding, and T. Mu, Industrial time series determinative anomaly detection based on constraint hypergraph, Knowl. Based Syst., vol. 233, p. 107548, 2021.
This work was supported by the National Key Research and Development Program of China (No. 2021YFB3300502), National Natural Science Foundation of China (NSFC) (Nos. 62202126 and 62232005), and Heilongjiang Postdoctoral Financial Assistance (No. LBH-Z21137).
This work is available under the CC BY-NC-ND 3.0 IGO license:https://creativecommons.org/licenses/by-nc-nd/3.0/igo/