Taiga: Performance Optimization of the C4.5 Decision Tree Construction Algorithm

Yi Yang; Wenguang Chen

doi:10.1109/TST.2016.7536719

| Sign up

PDF (1.3 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Taiga: Performance Optimization of the C4.5 Decision Tree Construction Algorithm

Yi Yang, Wenguang Chen()

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.

Technology Innovation Center at Yinzhou, Yangtze Delta Region Institute of Tsinghua University, Yinzhou 315100, China.

Show Author Information

Abstract

Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.

Keywords

C4.5 RainForest decision trees machine learning performance optimization

References

[1]

Han

and Kamber

, Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann, 2001.

[2]

Safavian

S. R.

and Landgrebe

, A survey of decision tree classifier methodology, IEEE Transactions on Systems, Man and Cybernetics, vol. 21, no. 3, pp. 660–674, 1991.

Crossref Google Scholar

[3]

Cheng

, Fayyad

U. M.

, Irani

K. B.

, and Qian

, Improved decision trees: A generalized version of ID3, in Proc. Fifth Int. Conf. Machine Learning, 1988, pp. 100-107

Crossref

[4]

Quinlan

J. R.

, Learning efficient classification procedures and their application to chess end games, in Machine Learning. Springer, 1983, pp. 463-482

Crossref

[5]

Quinlan

J. R.

, Induction of decision trees, Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.

Crossref Google Scholar

[6]

Quinlan

J. R.

, C4.5: Programs for Machine Learning, vol. 1. Morgan Kaufmann, 1993.

Crossref

[7]

Quinlan

J. R.

, Data mining tools see5 and C5.0, http://www.rulequest.com/see5-info.html, 2004.

[8]

Breiman

, Friedman

, Stone

C. J.

, and Olshen

R. A.

, Classification and Regression Trees. CRC Press, 1984.

[9]

Kass

G. V.

, An exploratory technique for investigating large quantities of categorical data, Applied Statistics, vol. 29, no. 2, pp. 119–127, 1980.

Crossref Google Scholar

[10]

Loh

W.-Y.

and Vanichsetakul

, Tree-structured classification via generalized discriminant analysis, Journal of the American Statistical Association, vol. 83, no. 403, pp. 715–725, 1988.

Crossref Google Scholar

[11]

Mehta

, Rissanen

, and Agrawal

, Mdl-based decision tree pruning, KDD, vol. 21, pp. 216–221, 1995.

Google Scholar

[12]

Gehrke

, Ramakrishnan

, and Ganti

, Rainforest-a framework for fast decision tree construction of large datasets, VLDB, vol. 98, pp. 416–427, 1998.

Google Scholar

[13]

Gehrke

, Ganti

, Ramakrishnan

, and Loh

W.-Y.

, Boat–optimistic decision tree construction, ACM SIGMOD Record, vol. 28, pp. 169–180, 1999.

Crossref Google Scholar

[14]

Jin

and Agrawal

, Communication and memory efficient parallel decision tree construction, SDM, pp. 119–129, 2003.

Crossref Google Scholar

[15]

Ruggieri

, Efficient C4.5, Knowledge and Data Engineering, IEEE Transactions on, vol. 14, no. 2, pp. 438–444, 2002.

Crossref Google Scholar

[16]

Joshi

M. V.

, Karypis

, and Kumar

, Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets, in Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing 1998, 1998, pp. 573-579

[17]

Quinlan

J. R.

, Bagging, boosting, and C4.5, AAAI/IAAI, vol. 1, pp. 725–730, 1996.

Google Scholar

[18]

Agrawal

, Ghosh

, Imielinski

, Iyer

, and Swami

, An interval classifier for database mining applications, in Proc. of the VLDB Conference, 1992, pp. 560-573.

[19]

Agrawal

, Imielinski

, and Swami

, Database mining: A performance perspective, Knowledge and Data Engineering, IEEE Transactions on, vol. 5, no. 6, pp. 914–925, 1993.

Crossref Google Scholar

[20]

Shafer

, Agrawal

, and Mehta

, Sprint: A scalable parallel classifier for data mining, in Proc. 1996 Int. Conf. Very Large Data Bases, 1996, pp. 544-555.

[21]

Mehta

, Agrawal

, and Rissanen

, Sliq: A fast scalable classifier for data mining, in Advances in Database Technology EDBT’96, 1996, pp. 18-32

Crossref

Tsinghua Science and Technology

Volume 21 Issue 4,
August 2016

Pages 415-425

DOI: 10.1109/TST.2016.7536719

Cite this article:

Yang Y, Chen W. Taiga: Performance Optimization of the C4.5 Decision Tree Construction Algorithm. Tsinghua Science and Technology, 2016, 21(4): 415-425. https://doi.org/10.1109/TST.2016.7536719