A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning

Yanxia Lv; Sancheng Peng; Ying Yuan; Cong Wang; Pengfei Yin; Jiemin Liu; Cuirong Wang

doi:10.26599/TST.2018.9010119

Tsinghua Science and Technology 2019, 24(4): 379-388 https://doi.org/10.26599/TST.2018.9010119

Open Access | Issue | Published: 07 March 2019

A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning

Show Author's Information Hide Author's Information Yanxia Lv, Sancheng Peng(

), Ying Yuan, Cong Wang, Pengfei Yin, Jiemin Liu, Cuirong Wang

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.

Laboratory of Language Engineering and Computing, and also with School of Cyber Security, Guangdong University of Foreign Studies, Guangzhou 510006, China.

School of Information Science and Engineering, Central South University, Changsha 410083, China.

Keywords:

classification, big data stream, online bagging, ensemble learning, concept drift

Cite this article:

Lv Y, Peng S, Yuan Y, et al. A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning. Tsinghua Science and Technology, 2019, 24(4): 379-388. https://doi.org/10.26599/TST.2018.9010119

Download citation

EndNote(RIS)

BibTeX

521

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

By combining multiple weak learners with concept drift in the classification of big data stream learning, the ensemble learning can achieve better generalization performance than the single learning approach. In this paper, we present an efficient classifier using the online bagging ensemble method for big data stream learning. In this classifier, we introduce an efficient online resampling mechanism on the training instances, and use a robust coding method based on error-correcting output codes. This is done in order to reduce the effects of correlations between the classifiers and increase the diversity of the ensemble. A dynamic updating model based on classification performance is adopted to reduce the unnecessary updating operations and improve the efficiency of learning. We implement a parallel version of EoBag, which runs faster than the serial version, and results indicate that the classification performance is almost the same as the serial one. Finally, we compare the performance of classification and the usage of resources with other state-of-the-art algorithms using the artificial and the actual data sets, respectively. Results show that the proposed algorithm can obtain better accuracy and more feasible usage of resources for the classification of big data stream.

Full text

Abstract

Full text

Outline

About this article

A Classifier Using Online Bagging Ensemble Method for Big Data Stream Learning

Show Author's information Hide Author's Information Yanxia Lv, Sancheng Peng(

), Ying Yuan, Cong Wang, Pengfei Yin, Jiemin Liu, Cuirong Wang

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.

Laboratory of Language Engineering and Computing, and also with School of Cyber Security, Guangdong University of Foreign Studies, Guangzhou 510006, China.

School of Information Science and Engineering, Central South University, Changsha 410083, China.

Abstract

Keywords: classification, big data stream, online bagging, ensemble learning, concept drift

References(27)

[1]

S. C., Peng G. J. Wang, and D. Q. Xie, Social influence analysis in social networking big data: Opportunities and challenges, IEEE Network, vol. 31, no. 1, pp. 11-17, 2017.

DOI Google Scholar

[2]

J. L. Torrecilla and J. Romo, Data learning from big data, Statistics and Probability Letters, vol. 136, pp. 15-19, 2018.

DOI Google Scholar

[3]

Z. Q., Wang J. C., Xin H. X., Yang S., Tian G., Yu C. R. Xu, and Y. D. Yao, Distributed and weighted extreme learning machine for imbalanced big data learning, Tsinghua Science and Technology, vol. 22, no. 2, pp. 160-173, 2017.

DOI Google Scholar

[4]

A. O. M., Abuassba Y., Zhang X., Luo D. Z. Zhang, and W. Aziguli, A heterogeneous ensemble of extreme learning machines with correntropy and negative correlation, Tsinghua Science and Technology, vol. 22, no. 6, pp. 691-701, 2017.

DOI Google Scholar

[5]

H. M. Gomes and F. Enembreck, SAE2: Advances on the social adaptive ensemble classifier for data streams, in Proc. 29th Annu. ACM Symp. Applied Computing, Gyeongju, Republic of Korea, 2014, pp. 798-804.

[6]

T. T. T., Nguyen T. T., Nguyen A. W. C. Liew, and S. L. Wang, Variational inference based bayes online classifiers with concept drift adaptation, Pattern Recognition, vol. 81, pp. 280-293, 2018.

DOI Google Scholar

[7]

B. S. Parker and L. Khan, Detecting and tracking concept class drift and emergence in non-stationary fast data streams, in Proc. 29th AAAI Conf. Artificial Intelligence, Austin, TX, USA, 2015.

DOI

[8]

J. P., Barddal H. M., Gomes F. Enembreck, and B. Pfahringer, A survey on feature drift adaptation: Definition, benchmark, challenges and future directions, Journal of Systems and Software, vol. 127, pp. 278-294, 2017.

DOI Google Scholar

[9]

L. Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.

DOI Google Scholar

[10]

Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proc. 13th Int. Conf. Machine Learning, Bari, Italy, 1996, pp. 148-156.

[11]

L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.

DOI Google Scholar

[12]

A. B. Owen and D. Eckles, Bootstrapping data arrays of arbitrary order, Annals of Applied Statistics, vol. 6, no. 3, pp. 895-927, 2012.

DOI Google Scholar

[13]

N. C. Oza and S. Russell, Experimental comparisons of online and batch versions of bagging and boosting, in Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Ming, San Francisco, CA, USA, 2001, pp. 359-364.

[14]

H. M., Gomes J. P., Barddal F. Enembreck, and A. Bifet, A survey on ensemble learning for data stream classification, ACM Computing Surveys, vol. 50, no. 2, p. 23, 2017.

DOI Google Scholar

[15]

S. D. Grimshaw, An introduction to the bootstrap, Technometrics, vol. 37, no. 3, pp. 340-341, 1995.

DOI Google Scholar

[16]

A. Fern and R. Givan, Online ensemble learning: An empirical study, Machine Learning, vol. 53, pp. 71-109, 2003.

DOI Google Scholar

[17]

A., Bifet G. Holmes, and B. Pfahringer, Leveraging bagging for evolving data streams, in Machine Learning and Knowledge Discovery in Databases, J. L. Balcázar, F. Bonchi, A. Gionis, and M. Sebag, eds. Springer, 2010, pp. 135-150.

DOI

[18]

L. L. Minku and X. Yao, DDD: A new ensemble approach for dealing with concept drift, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 4, pp. 619-633, 2012.

DOI Google Scholar

[19]

G. U. Yule and K. Pearson, On the association of attributes in statistics: With illustrations from the material of the childhood society, Philosophical Transactions of the Royal Society of London, vol. 194, nos. 252-261, pp. 257-319, 1900.

DOI Google Scholar

[20]

D. Brzezinski and J. Stefanowski, Combining block-based and online methods in learning ensembles from concept drifting data streams, Information Sciences, vol. 265, pp. 50-67, 2014.

DOI Google Scholar

[21]

T. G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting output codes, Journal of Artificial Intelligence Research, vol. 2, no. 1, pp. 263-286, 1994.

DOI Google Scholar

[22]

R. E. Schapire, Using output codes to boost multiclass learning problems, in Proc. 14th Int. Conf. Machine Learning, San Francisco, CA, USA, 1997, pp. 313-321.

[23]

L. L., Minku A. P. White, and X. Yao, The impact of diversity on online ensemble learning in the presence of concept drift, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730-742, 2010.

DOI Google Scholar

[24]

C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 623-656, 1948.

DOI Google Scholar

[25]

A., Bifet G. de, Francisci J., Read G. Holmes, and B. Pfahringer, Efficient online evaluation of big data stream classifiers, in Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, Australia, 2015, pp. 59-68.

[26]

A., Bifet G., Holmes B., Pfahringer P., Kranen H., Kremer T. Jansen, and T. Seidl, MOA: Massive online analysis, a framework for stream classification and clustering, Journal of Machine Learning Research, vol. 11, pp. 44-50, 2010.

Google Scholar

[27]

D., Newman S., Hettich C., Blake C. Merz, and D. Aha, UCI repository of machine learning databases, http://archive.ics.uci.edu/ml/datasets.html, 2017.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 07 July 2018

Accepted: 01 September 2018

Published: 07 March 2019

Issue date: August 2019

Copyright

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Nos. 61702089, 61876205, and 61501102), the Science and Technology Plan Project of Guangzhou (No. 201804010433), and the Bidding Project of Laboratory of Language Engineering and Computing (No. LEC2017ZBKT001).