Journal Home > Volume 24 , Issue 4

By combining multiple weak learners with concept drift in the classification of big data stream learning, the ensemble learning can achieve better generalization performance than the single learning approach. In this paper, we present an efficient classifier using the online bagging ensemble method for big data stream learning. In this classifier, we introduce an efficient online resampling mechanism on the training instances, and use a robust coding method based on error-correcting output codes. This is done in order to reduce the effects of correlations between the classifiers and increase the diversity of the ensemble. A dynamic updating model based on classification performance is adopted to reduce the unnecessary updating operations and improve the efficiency of learning. We implement a parallel version of EoBag, which runs faster than the serial version, and results indicate that the classification performance is almost the same as the serial one. Finally, we compare the performance of classification and the usage of resources with other state-of-the-art algorithms using the artificial and the actual data sets, respectively. Results show that the proposed algorithm can obtain better accuracy and more feasible usage of resources for the classification of big data stream.


menu
Abstract
Full text
Outline
About this article

A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning

Show Author's information Yanxia LvSancheng Peng( )Ying YuanCong WangPengfei YinJiemin LiuCuirong Wang
School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.
Laboratory of Language Engineering and Computing, and also with School of Cyber Security, Guangdong University of Foreign Studies, Guangzhou 510006, China.
School of Information Science and Engineering, Central South University, Changsha 410083, China.

Abstract

By combining multiple weak learners with concept drift in the classification of big data stream learning, the ensemble learning can achieve better generalization performance than the single learning approach. In this paper, we present an efficient classifier using the online bagging ensemble method for big data stream learning. In this classifier, we introduce an efficient online resampling mechanism on the training instances, and use a robust coding method based on error-correcting output codes. This is done in order to reduce the effects of correlations between the classifiers and increase the diversity of the ensemble. A dynamic updating model based on classification performance is adopted to reduce the unnecessary updating operations and improve the efficiency of learning. We implement a parallel version of EoBag, which runs faster than the serial version, and results indicate that the classification performance is almost the same as the serial one. Finally, we compare the performance of classification and the usage of resources with other state-of-the-art algorithms using the artificial and the actual data sets, respectively. Results show that the proposed algorithm can obtain better accuracy and more feasible usage of resources for the classification of big data stream.

Keywords: classification, big data stream, online bagging, ensemble learning, concept drift

References(27)

[1]
S. C., Peng G. J. Wang, and D. Q. Xie, Social influence analysis in social networking big data: Opportunities and challenges, IEEE Network, vol. 31, no. 1, pp. 11-17, 2017.
[2]
J. L. Torrecilla and J. Romo, Data learning from big data, Statistics and Probability Letters, vol. 136, pp. 15-19, 2018.
[3]
Z. Q., Wang J. C., Xin H. X., Yang S., Tian G., Yu C. R. Xu, and Y. D. Yao, Distributed and weighted extreme learning machine for imbalanced big data learning, Tsinghua Science and Technology, vol. 22, no. 2, pp. 160-173, 2017.
[4]
A. O. M., Abuassba Y., Zhang X., Luo D. Z. Zhang, and W. Aziguli, A heterogeneous ensemble of extreme learning machines with correntropy and negative correlation, Tsinghua Science and Technology, vol. 22, no. 6, pp. 691-701, 2017.
[5]
H. M. Gomes and F. Enembreck, SAE2: Advances on the social adaptive ensemble classifier for data streams, in Proc. 29th Annu. ACM Symp. Applied Computing, Gyeongju, Republic of Korea, 2014, pp. 798-804.
[6]
T. T. T., Nguyen T. T., Nguyen A. W. C. Liew, and S. L. Wang, Variational inference based bayes online classifiers with concept drift adaptation, Pattern Recognition, vol. 81, pp. 280-293, 2018.
[7]
B. S. Parker and L. Khan, Detecting and tracking concept class drift and emergence in non-stationary fast data streams, in Proc. 29th AAAI Conf. Artificial Intelligence, Austin, TX, USA, 2015.
DOI
[8]
J. P., Barddal H. M., Gomes F. Enembreck, and B. Pfahringer, A survey on feature drift adaptation: Definition, benchmark, challenges and future directions, Journal of Systems and Software, vol. 127, pp. 278-294, 2017.
[9]
L. Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
[10]
Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proc. 13th Int. Conf. Machine Learning, Bari, Italy, 1996, pp. 148-156.
[11]
L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[12]
A. B. Owen and D. Eckles, Bootstrapping data arrays of arbitrary order, Annals of Applied Statistics, vol. 6, no. 3, pp. 895-927, 2012.
[13]
N. C. Oza and S. Russell, Experimental comparisons of online and batch versions of bagging and boosting, in Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Ming, San Francisco, CA, USA, 2001, pp. 359-364.
[14]
H. M., Gomes J. P., Barddal F. Enembreck, and A. Bifet, A survey on ensemble learning for data stream classification, ACM Computing Surveys, vol. 50, no. 2, p. 23, 2017.
[15]
S. D. Grimshaw, An introduction to the bootstrap, Technometrics, vol. 37, no. 3, pp. 340-341, 1995.
[16]
A. Fern and R. Givan, Online ensemble learning: An empirical study, Machine Learning, vol. 53, pp. 71-109, 2003.
[17]
A., Bifet G. Holmes, and B. Pfahringer, Leveraging bagging for evolving data streams, in Machine Learning and Knowledge Discovery in Databases, J. L. Balcázar, F. Bonchi, A. Gionis, and M. Sebag, eds. Springer, 2010, pp. 135-150.
DOI
[18]
L. L. Minku and X. Yao, DDD: A new ensemble approach for dealing with concept drift, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 4, pp. 619-633, 2012.
[19]
G. U. Yule and K. Pearson, On the association of attributes in statistics: With illustrations from the material of the childhood society, Philosophical Transactions of the Royal Society of London, vol. 194, nos. 252-261, pp. 257-319, 1900.
[20]
D. Brzezinski and J. Stefanowski, Combining block-based and online methods in learning ensembles from concept drifting data streams, Information Sciences, vol. 265, pp. 50-67, 2014.
[21]
T. G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting output codes, Journal of Artificial Intelligence Research, vol. 2, no. 1, pp. 263-286, 1994.
[22]
R. E. Schapire, Using output codes to boost multiclass learning problems, in Proc. 14th Int. Conf. Machine Learning, San Francisco, CA, USA, 1997, pp. 313-321.
[23]
L. L., Minku A. P. White, and X. Yao, The impact of diversity on online ensemble learning in the presence of concept drift, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730-742, 2010.
[24]
C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 623-656, 1948.
[25]
A., Bifet G. de, Francisci J., Read G. Holmes, and B. Pfahringer, Efficient online evaluation of big data stream classifiers, in Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, Australia, 2015, pp. 59-68.
[26]
A., Bifet G., Holmes B., Pfahringer P., Kranen H., Kremer T. Jansen, and T. Seidl, MOA: Massive online analysis, a framework for stream classification and clustering, Journal of Machine Learning Research, vol. 11, pp. 44-50, 2010.
[27]
D., Newman S., Hettich C., Blake C. Merz, and D. Aha, UCI repository of machine learning databases, http://archive.ics.uci.edu/ml/datasets.html, 2017.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 07 July 2018
Accepted: 01 September 2018
Published: 07 March 2019
Issue date: August 2019

Copyright

© The author(s) 2019

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Nos. 61702089, 61876205, and 61501102), the Science and Technology Plan Project of Guangzhou (No. 201804010433), and the Bidding Project of Laboratory of Language Engineering and Computing (No. LEC2017ZBKT001).

Rights and permissions

Return