Journal Home > Volume 21 , Issue 4

Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.


menu
Abstract
Full text
Outline
About this article

Taiga: Performance Optimization of the C4.5 Decision Tree Construction Algorithm

Show Author's information Yi YangWenguang Chen( )
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
Technology Innovation Center at Yinzhou, Yangtze Delta Region Institute of Tsinghua University, Yinzhou 315100, China.

Abstract

Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.

Keywords: machine learning, performance optimization, C4.5, RainForest, decision trees

References(21)

[1]
Han J. and Kamber M., Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann, 2001.
[2]
Safavian S. R. and Landgrebe D., A survey of decision tree classifier methodology, IEEE Transactions on Systems, Man and Cybernetics, vol. 21, no. 3, pp. 660–674, 1991.
[3]
Cheng J., Fayyad U. M., Irani K. B., and Qian Z., Improved decision trees: A generalized version of ID3, in Proc. Fifth Int. Conf. Machine Learning, 1988, pp. 100-107
DOI
[4]
Quinlan J. R., Learning efficient classification procedures and their application to chess end games, in Machine Learning. Springer, 1983, pp. 463-482
DOI
[5]
Quinlan J. R., Induction of decision trees, Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
[6]
Quinlan J. R., C4.5: Programs for Machine Learning, vol. 1. Morgan Kaufmann, 1993.
DOI
[7]
Quinlan J. R., Data mining tools see5 and C5.0, http://www.rulequest.com/see5-info.html, 2004.
[8]
Breiman L., Friedman J., Stone C. J., and Olshen R. A., Classification and Regression Trees. CRC Press, 1984.
[9]
Kass G. V., An exploratory technique for investigating large quantities of categorical data, Applied Statistics, vol. 29, no. 2, pp. 119–127, 1980.
[10]
Loh W.-Y. and Vanichsetakul N., Tree-structured classification via generalized discriminant analysis, Journal of the American Statistical Association, vol. 83, no. 403, pp. 715–725, 1988.
[11]
Mehta M., Rissanen J., and Agrawal R., Mdl-based decision tree pruning, KDD, vol. 21, pp. 216–221, 1995.
[12]
Gehrke J., Ramakrishnan R., and Ganti V., Rainforest-a framework for fast decision tree construction of large datasets, VLDB, vol. 98, pp. 416–427, 1998.
[13]
Gehrke J., Ganti V., Ramakrishnan R., and Loh W.-Y., Boat–optimistic decision tree construction, ACM SIGMOD Record, vol. 28, pp. 169–180, 1999.
[14]
Jin R. and Agrawal G., Communication and memory efficient parallel decision tree construction, SDM, pp. 119–129, 2003.
[15]
Ruggieri S., Efficient C4.5, Knowledge and Data Engineering, IEEE Transactions on, vol. 14, no. 2, pp. 438–444, 2002.
[16]
Joshi M. V., Karypis G., and Kumar V., Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets, in Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing 1998, 1998, pp. 573-579
[17]
Quinlan J. R., Bagging, boosting, and C4.5, AAAI/IAAI, vol. 1, pp. 725–730, 1996.
[18]
Agrawal R., Ghosh S., Imielinski T., Iyer B., and Swami A., An interval classifier for database mining applications, in Proc. of the VLDB Conference, 1992, pp. 560-573.
[19]
Agrawal R., Imielinski T., and Swami A., Database mining: A performance perspective, Knowledge and Data Engineering, IEEE Transactions on, vol. 5, no. 6, pp. 914–925, 1993.
[20]
Shafer J., Agrawal R., and Mehta M., Sprint: A scalable parallel classifier for data mining, in Proc. 1996 Int. Conf. Very Large Data Bases, 1996, pp. 544-555.
[21]
Mehta M., Agrawal R., and Rissanen J., Sliq: A fast scalable classifier for data mining, in Advances in Database Technology EDBT’96, 1996, pp. 18-32
DOI
Publication history
Copyright
Rights and permissions

Publication history

Received: 04 June 2015
Accepted: 11 July 2015
Published: 11 August 2016
Issue date: August 2016

Copyright

© The author(s) 2016

Rights and permissions

Return