Journal Home > Volume 7 , Issue 1

There is a growing demand for time series data analysis in industry areas. Apache IoTDB is a time series database designed for the Internet of Things (IoT) with enhanced storage and I/O performance. With User-Defined Functions (UDF) provided, computation for time series can be executed on Apache IoTDB directly. To satisfy most of the common requirements in industrial time series analysis, we create a UDF library, IoTDQ, on Apache IoTDB. This library integrates stream computation functions on data quality analysis, data profiling, anomaly detection, data repairing, etc. IoTDQ enables users to conduct a wide range of analyses, such as monitoring, error diagnosis, equipment reliability analysis. It provides a framework for users to examine IoT time series with data quality problems. Experiments show that IoTDQ keeps the same level of performance compared to mainstream alternatives, and shortens I/O consumption for Apache IoTDB users.


menu
Abstract
Full text
Outline
About this article

IoTDQ: An Industrial IoT Data Analysis Library for Apache IoTDB

Show Author's information Pengyu Chen1( )Wendi He1Wenxuan Ma1Xiangdong Huang1( )Chen Wang2
School of Software, Tsinghua University, Beijing 100084, China
National Engineering Research Center for Big Data Software (NERCBDS), Tsinghua University, Beijing 100084, China

Abstract

There is a growing demand for time series data analysis in industry areas. Apache IoTDB is a time series database designed for the Internet of Things (IoT) with enhanced storage and I/O performance. With User-Defined Functions (UDF) provided, computation for time series can be executed on Apache IoTDB directly. To satisfy most of the common requirements in industrial time series analysis, we create a UDF library, IoTDQ, on Apache IoTDB. This library integrates stream computation functions on data quality analysis, data profiling, anomaly detection, data repairing, etc. IoTDQ enables users to conduct a wide range of analyses, such as monitoring, error diagnosis, equipment reliability analysis. It provides a framework for users to examine IoT time series with data quality problems. Experiments show that IoTDQ keeps the same level of performance compared to mainstream alternatives, and shortens I/O consumption for Apache IoTDB users.

Keywords: data quality, industrial big data, data mining and analytics

References(35)

[1]
X. Huang, J. Wang, R. K. Wong, J. Zhang, and C. Wang, PISA: An index for aggregating big time series data, in Proc. 25 th ACM Int. Conf. Information and Knowledge Management, Indianapolis, IN, USA, 2016, pp. 979–988.
DOI
[2]

J. Qiao, X. Huang, J. Wang, and R. K. Wong, Dual-PISA: An index for aggregation operations on time series data, Inf. Syst., vol. 87, p. 101427, 2020.

[3]
Apache Software Foundation, Apache IoTDB, http://iotdb.apache.org/, 2022.
[4]

C. Wang, J. Qiao, X. Huang, S. Song, H. Hou, T. Jiang, L. Rui, J. Wang, and J. Sun, Apache IoTDB: A time series database for IoT applications, Proc. ACM Manag. Data, vol. 1, no. 2, p. 195, 2023.

[5]
G. Box and G. M. Jenkins, Time series analysis forecasting and control, Journal of Time Series Analysis. doi: 10.2307/1912100, 1970.
DOI
[6]
T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting, in Proc. 39 th Int. Conf. Machine Learning, Baltimore, MD, USA, 2022, pp. 27268–27286.
[7]
W. Chen, W. Wang, B. Peng, Q. Wen, T. Zhou, and L. Sun, Learning to rotate: Quaternion transformer for complicated periodical time series forecasting, in Proc. 28 th ACM SIGKDD Conf. Knowledge Discovery and Data Mining, Washington, DC, USA, 2022, pp. 146–156.
DOI
[8]
M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph, Modeling irregular time series with continuous recurrent units, in Proc. 39 th Int. Conf. Machine Learning, Baltimore, MD, USA, 2022, pp. 19388–19405.
[9]
Z. Shao, Z. Zhang, F. Wang, and Y. Xu, Pre-training enhanced spatial-temporal graph neural network for multivariate time series forecasting, in Proc. 28 th ACM SIGKDD Conf. Knowledge Discovery and Data Mining, Washington, DC, USA, 2022, pp. 1567–1577.
DOI
[10]

T. Kolajo, O. Daramola, and A. Adebiyi, Big data stream analysis: A systematic literature review, J. Big Data, vol. 6, no. 1, p. 47, 2019.

[11]
Apache software foundation, IoTDQ, https://incubator.apache.org/ip-clearance/iotdb-udf-library.html, 2022.
[12]
Apache software foundation, IoTDQ code, https://github.com/apache/iotdb/tree/master/library-udf, 2022
[13]
[14]

P. Esling and C. Agón, Time-series data mining, ACM Comput. Surv., vol. 45, no. 1, p. 12, 2012.

[15]
Y. W. Lee, L. L. Pipino, J. D. Funk, and R. Y. Wang, Journey to Data Quality. Cambridge, MA, USA: MIT Press, 2006.
[16]
ISO 8000-1:2022 data quality - part 1: Overview. https://www.iso.org/standard/81745.html, 2022.
[17]
S. Song and A. Zhang, IoT data quality, in Proc. 29 th ACM Int. Conf. Information and Knowledge Management, Virtual Event, 2020, pp. 3517–3518.
DOI
[18]

C. Fang, S. Song, and Y. Mei, On repairing timestamps for regular interval time series, Proc. VLDB Endow., vol. 15, no. 9, pp. 1848–1860, 2022.

[19]
Apache software foundation, Apache DataFu spark, https://datafu.apache.org/docs/spark/getting-started.html, 2022.
[20]
Apache software foundation, Apache DataFu pig, https://datafu.apache.org/docs/datafu/getting-started.html, 2022.
[21]

Z. Chen, S. Song, Z. Wei, J. Fang, and J. Long, Approximating median absolute deviation with bounded error, Proc. VLDB Endow., vol. 14, no. 11, pp. 2114–2126, 2021.

[22]
N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, Medians and beyond: New aggregation techniques for sensor networks, in Proc. 2 nd Int. Conf. Embedded Networked Sensor Systems, Baltimore, MD, USA, 2004, pp. 239–249.
DOI
[23]

C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median, J. Experim. Soc. Psychol., vol. 49, no. 4, pp. 764–766, 2013.

[24]
M. Ester, H. P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in Proc. 2 nd Int. Conf. Knowledge Discovery and Data Mining, Portland, OR, USA, 1996, pp. 226–231.
[25]
D. Yang, E. A. Rundensteiner, and M. O. Ward, Neighbor-based pattern detection for windows over streaming data, in Proc. 12 th Int. Conf. Extending Database Technology : Advances in Database Technology, Saint Petersburg, Russia, 2009, pp. 529–540.
DOI
[26]
M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos, Continuous monitoring of distance-based outliers over data streams, in Proc. 27 th Int. Conf. Data Engineering, Hannover, Germany, 2011, pp. 135–146.
DOI
[27]
L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner, Scalable distance-based outlier detection over high-volume data streams, in Proc. 2014 IEEE 30th Int. Conf. on Data Engineering, Chicago, IL, USA, 2014, pp. 76–87.
DOI
[28]

S. Yoon, J. G. Lee, and B. S. Lee, NETS: Extremely fast outlier detection from a data stream via set-based processing, Proc. VLDB Endow., vol. 12, no. 11, pp. 1303–1315, 2019.

[29]

L. Tran, M. Y. Mun, and C. Shahabi, Real-time distance-based outlier detection in data streams, Proc. VLDB Endow., vol. 14, no. 2, pp. 141–153, 2020.

[30]
S. Song, C. Li, and X. Zhang, Turn waste into wealth: On simultaneous clustering and cleaning over dirty data, in Proc. 21 st ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, Australia, 2015, pp. 1115–1124.
DOI
[31]
S. Song, A. Zhang, J. Wang, and P. S. Yu, SCREEN: Stream data cleaning under speed constraints, in Proc. 2015 ACM SIGMOD Int. Conf. Management of Data, Melbourne, Australia, 2015, pp. 827–841.
DOI
[32]
A. Zhang, S. Song, and J. Wang, Sequential data cleaning: A statistical approach, in Proc. 2016 Int. Conf. Management of Data, San Francisco, CA, USA, 2016, pp. 909–924.
DOI
[33]
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, Web caching and Zipf-like distributions: Evidence and implications, in Proc. IEEE INFOCOM ’99. Conf. Computer Communications. Eighteenth Annu. Joint Conf. IEEE Computer and Communications Societies, New York, NY, USA, 1999, pp. 126–134.
DOI
[34]
R. Huang, Z. Chen, Z. Liu, S. Song, and J. Wang, TsOutlier: Explaining outliers with uniform profiles over IoT data, in Proc. 2019 IEEE Int. Conf. Big Data, Los Angeles, CA, USA, 2019, pp. 2024–2027.
DOI
[35]

Z. Liu, Y. Zhang, R. Huang, Z. Chen, S. Song, and J. Wang, EXPERIENCE: Algorithms and case study for explaining repairs with uniform profiles over IoT data, J. Data Inf. Qual., vol. 13, no. 3, p. 18, 2021.

Publication history
Copyright
Rights and permissions

Publication history

Received: 30 August 2022
Revised: 27 March 2023
Accepted: 15 May 2023
Published: 25 December 2023
Issue date: March 2024

Copyright

© The author(s) 2023.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return