AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Taiga: Performance Optimization of the C4.5 Decision Tree Construction Algorithm

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
Technology Innovation Center at Yinzhou, Yangtze Delta Region Institute of Tsinghua University, Yinzhou 315100, China.
Show Author Information

Abstract

Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.

References

[1]
Han J. and Kamber M., Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann, 2001.
[2]
Safavian S. R. and Landgrebe D., A survey of decision tree classifier methodology, IEEE Transactions on Systems, Man and Cybernetics, vol. 21, no. 3, pp. 660–674, 1991.
[3]
Cheng J., Fayyad U. M., Irani K. B., and Qian Z., Improved decision trees: A generalized version of ID3, in Proc. Fifth Int. Conf. Machine Learning, 1988, pp. 100-107
[4]
Quinlan J. R., Learning efficient classification procedures and their application to chess end games, in Machine Learning. Springer, 1983, pp. 463-482
[5]
Quinlan J. R., Induction of decision trees, Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
[6]
Quinlan J. R., C4.5: Programs for Machine Learning, vol. 1. Morgan Kaufmann, 1993.
[7]
Quinlan J. R., Data mining tools see5 and C5.0, http://www.rulequest.com/see5-info.html, 2004.
[8]
Breiman L., Friedman J., Stone C. J., and Olshen R. A., Classification and Regression Trees. CRC Press, 1984.
[9]
Kass G. V., An exploratory technique for investigating large quantities of categorical data, Applied Statistics, vol. 29, no. 2, pp. 119–127, 1980.
[10]
Loh W.-Y. and Vanichsetakul N., Tree-structured classification via generalized discriminant analysis, Journal of the American Statistical Association, vol. 83, no. 403, pp. 715–725, 1988.
[11]
Mehta M., Rissanen J., and Agrawal R., Mdl-based decision tree pruning, KDD, vol. 21, pp. 216–221, 1995.
[12]
Gehrke J., Ramakrishnan R., and Ganti V., Rainforest-a framework for fast decision tree construction of large datasets, VLDB, vol. 98, pp. 416–427, 1998.
[13]
Gehrke J., Ganti V., Ramakrishnan R., and Loh W.-Y., Boat–optimistic decision tree construction, ACM SIGMOD Record, vol. 28, pp. 169–180, 1999.
[14]
Jin R. and Agrawal G., Communication and memory efficient parallel decision tree construction, SDM, pp. 119–129, 2003.
[15]
Ruggieri S., Efficient C4.5, Knowledge and Data Engineering, IEEE Transactions on, vol. 14, no. 2, pp. 438–444, 2002.
[16]
Joshi M. V., Karypis G., and Kumar V., Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets, in Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing 1998, 1998, pp. 573-579
[17]
Quinlan J. R., Bagging, boosting, and C4.5, AAAI/IAAI, vol. 1, pp. 725–730, 1996.
[18]
Agrawal R., Ghosh S., Imielinski T., Iyer B., and Swami A., An interval classifier for database mining applications, in Proc. of the VLDB Conference, 1992, pp. 560-573.
[19]
Agrawal R., Imielinski T., and Swami A., Database mining: A performance perspective, Knowledge and Data Engineering, IEEE Transactions on, vol. 5, no. 6, pp. 914–925, 1993.
[20]
Shafer J., Agrawal R., and Mehta M., Sprint: A scalable parallel classifier for data mining, in Proc. 1996 Int. Conf. Very Large Data Bases, 1996, pp. 544-555.
[21]
Mehta M., Agrawal R., and Rissanen J., Sliq: A fast scalable classifier for data mining, in Advances in Database Technology EDBT’96, 1996, pp. 18-32
Tsinghua Science and Technology
Pages 415-425
Cite this article:
Yang Y, Chen W. Taiga: Performance Optimization of the C4.5 Decision Tree Construction Algorithm. Tsinghua Science and Technology, 2016, 21(4): 415-425. https://doi.org/10.1109/TST.2016.7536719

619

Views

16

Downloads

16

Crossref

N/A

Web of Science

19

Scopus

4

CSCD

Altmetrics

Received: 04 June 2015
Accepted: 11 July 2015
Published: 11 August 2016
© The author(s) 2016
Return