Novel and Efficient Randomized Algorithms for Feature Selection

Zigeng Wang; Xia Xiao; Sanguthevar Rajasekaran

doi:10.26599/BDMA.2020.9020005

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (3.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Novel and Efficient Randomized Algorithms for Feature Selection

Zigeng Wang, Xia Xiao, Sanguthevar Rajasekaran(

)

∙ Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.

Show Author Information

Abstract

Feature selection is a crucial problem in efficient machine learning, and it also greatly contributes to the explainability of machine-driven decisions. Methods, like decision trees and Least Absolute Shrinkage and Selection Operator (LASSO), can select features during training. However, these embedded approaches can only be applied to a small subset of machine learning models. Wrapper based methods can select features independently from machine learning models but they often suffer from a high computational cost. To enhance their efficiency, many randomized algorithms have been designed. In this paper, we propose automatic breadth searching and attention searching adjustment approaches to further speedup randomized wrapper based feature selection. We conduct theoretical computational complexity analysis and further explain our algorithms’ generic parallelizability. We conduct experiments on both synthetic and real datasets with different machine learning base models. Results show that, compared with existing approaches, our proposed techniques can locate a more meaningful set of features with a high efficiency.

Keywords

feature selection randomized algorithms efficient selection

References

[1]

Z. G. Wang and S. Rajasekaran, Efficient randomized feature selection algorithms, in Proc. 2019 IEEE 21st Int. Conf. High Performance Computing and Communications; IEEE 17th Int. Conf. Smart City; IEEE 5th Int. Conf. on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 796-803.

[2]

R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi, CLARK: Fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers, BMC Genomics, vol. 16. no.1, p. 236, 2015.

Crossref Google Scholar

[3]

P. Menzel, K. L. Ng, and A. Krogh, Fast and sensitive taxonomic classification for metagenomics with Kaiju, Nat. Commun., vol. 7, p. 11 257, 2016.

Crossref Google Scholar

[4]

Y. X. Tong, Y. Q. Chen, Z. M. Zhou, L. Chen, J. Wang, Q. Yang, L. P. Ye and W. F. Lv, The simpler the better: A unified approach to predicting original taxi demands based on large-scale online platforms, in Proc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2017, pp. 1653-1662.

Crossref

[5]

C. F. Zhang, M. X. Dong, T. H. Luan, and K. Ota, Battery maintenance of pedelec sharing system: Big data based usage prediction and replenishment scheduling, IEEE Trans. Network Sci. Eng., vol. 7, no. 1, pp. 127-138, 2019.

Crossref Google Scholar

[6]

H. Moeini, W. X. Zeng, I. L. Yen, and F. Bastani, Toward data discovery in dynamic Smart city applications, in Proc. 2019 IEEE 21st Int. Conf. High Performance Computing and Communications; IEEE 17th Int. Conf. Smart City; IEEE 5th Int. Conf. Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 2572-2579.

[7]

X. R. Zhang, C. A. Huang, M. Y. Liu, A. Stefanopoulou, and T. Ersal, Predictive cruise control with private vehicle-to-vehicle communication for improving fuel consumption and emissions, IEEE Commun. Mag., vol. 57, no. 10, pp. 91-97, 2019.

Crossref Google Scholar

[8]

A. Gledson, T. B. Dhafari, N. Paton, and J. Keane, A smart city dashboard for combining and analysing multi-source data streams, in Proc. 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 2018, pp. 1366-1373.

Crossref

[9]

M. Mohammadi and A. Al-Fuqaha, Enabling cognitive smart cities using big data and machine learning: Approaches and challenges, IEEE Commun. Magaz., vol. 56, no. 2, pp. 94-101, 2018.

Crossref Google Scholar

[10]

S. Wold, K. Esbensen, and P. Geladi, Principal component analysis, Chemometr. Intellig. Lab. Syst., vol. 2, nos. 1-3, pp. 37-52, 1987.

Crossref Google Scholar

[11]

S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, Fisher discriminant analysis with kernels, in Proc. Neural Networks for Signal Processing IX: Proc. 1999 IEEE Signal Processing Society Workshop, Madison, WI, USA, 1999, pp. 41-48.

[12]

L. P. Wang, Y. L. Wang, and Q. Chang, Feature selection methods for big data bioinformatics: A survey from the search perspective, Methods, vol. 111, pp. 21-31, 2016.

Crossref Google Scholar

[13]

D. Mladenić, Feature selection in text mining, in Encyclopedia of Machine Learning, C. Sammut and G. J. Webb, eds. Boston, MA, USA: Springer, 2011, pp. 406-410.

Crossref

[14]

K. Huang and S. Aviyente, Wavelet feature selection for image classification, IEEE Trans. Image Process., vol. 17, no. 9, pp. 1709-1720, 2008.

Crossref Google Scholar

[15]

J. L. Tracy, A. Trabucco, A. M. Lawing, J. T. Giermakowski, M. Tchakerian, G. M. Drus, and R. D. Coulson, Random subset feature selection for ecological niche models of wildfire activity in Western North America, Ecol. Modell., vol. 383, pp. 52-68, 2018.

Crossref Google Scholar

[16]

C. Liu, D. X. Jiang, and W. G. Yang, Global geometric similarity scheme for feature selection in fault diagnosis, Exp. Syst. Applicat., vol. 41, no. 8, pp. 3585-3595, 2014.

Crossref Google Scholar

[17]

J. D. Li, K. W. Cheng, S. H. Wang, F. Morstatter, R. P. Trevino and J. L. Tang, Feature selection: A data perspective, ACM Comput. Sur., vol. 50, no. 6, p. 94, 2017.

Crossref Google Scholar

[18]

A. Jović, K. Brkić, and N. Bogunović, A review of feature selection methods with applications, in Proc. 2015 38th Int. Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 2015, pp. 1200-1205.

[19]

Z. H. Zhang and E. R. Hancock, A graph-based approach to feature selection, in Proc. Int. Workshop on Graph-Based Representations in Pattern Recognition, Münster, Germany, 2011, pp. 205-214.

Crossref

[20]

A. K. Das, S. Goswami, A. Chakrabarti, and B. Chakraborty, A new hybrid feature selection approach using feature association map for supervised and unsupervised classification, Exp. Syst. Appl., vol. 88, pp. 81-94, 2017.

Crossref Google Scholar

[21]

P. Xiao, Z. G. Wang, and S. Rajasekaran, Novel speedup techniques for parallel singular value decomposition, in Proc. 2018 IEEE 20th Int. Conf. High Performance Computing and Communications; IEEE 16th Int. Conf. Smart City; IEEE 4th Int. Conf. Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 2018, pp. 188-195.

[22]

G. Chandrashekar and F. Sahin, A survey on feature selection methods, Comput. Electr. Eng., vol. 40, no. 1, pp. 16-28, 2014.

Crossref Google Scholar

[23]

V. Fonti and E. Belitser, Feature Selection Using Lasso, https://pdfs.semanticscholar.org/24ac/d159910658223209 433cf4cbe3414264de39.pdf, 2017.

[24]

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, Efficient projections onto the l1-ball for learning in high dimensions, in Proc. 25th Int. Conf. Machine Learning, Helsinki, Finland, 2008, pp. 272-279.

Crossref

[25]

N. Grgić-Hlača, M. B. Zafar, K. P. Gummadi, and A. Weller, Beyond distributive fairness in algorithmic decision making: Feature selection for procedurally fair learning, in Proc. Thirty-Second AAAI Conf. on Artificial Intelligence (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symp. on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018.

[26]

X. Zhang, M. M. Khalili, C. Tekin and M. Liu, Group retention when using machine learning in sequential decision making: The interplay between user dynamics and fairness, presented at the 33rd Conf. Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019, pp. 15 243-15 252.

[27]

T. Zhang, T. Q. Zhu, P. Xiong, H. Huo, Z. Tari, and W. L. Zhou, Correlated differential privacy: Feature selection in machine learning, IEEE Trans. Ind. Inf., vol. 16, no. 3, pp. 2115-2124, 2019.

Crossref Google Scholar

[28]

X. R. Zhang, M. M. Khalili, and M. Y. Liu, Improving the privacy and accuracy of ADMM-based distributed algorithms, in Proc. 35th Int. Conf. on Machine Learning, Stockholm, Sweden, 2018, pp. 5796-5805.

[29]

C. Li, D. R. De Celis Leal, S. Rana, S. Gupta, A. Sutti, S. Greenhill, T. Slezak, M. Height, and S. Venkatesh, Rapid Bayesian optimisation for synthesis of short polymer fiber materials, Sci. Rep., vol. 7, p. 5683, 2017.

Crossref Google Scholar

[30]

C. Kim, A. Chandrasekaran, T. D. Huan, D. Das, and R. Ramprasad, Polymer genome: A data-powered polymer informatics platform for property predictions, J. Phys. Chem. C, vol. 122, no. 31, pp. 17 575-17 585, 2018.

Crossref Google Scholar

[31]

P. M. Narendra and K. Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. Comput., vol. C-26, no. 9, pp. 917-922, 1977.

Crossref Google Scholar

[32]

J. Doak, An evaluation of feature selection methods and their application to computer security, Tech. Rep. CSE-92-18 UC Davis, Department of Computer Science, University of California, Davis, CA, USA, 1992.

[33]

S. Nakariyakul and D. P. Casasent, An improvement on floating search algorithms for feature subset selection, Patt. Recognit., vol. 42, no. 9, pp. 1932-1940, 2009.

Crossref Google Scholar

[34]

S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. New York, NY, USA: Prentice Hall, 2009.

[35]

B. Xue, M. J. Zhang, W. N. Browne, and X. Yao, A survey on evolutionary computation approaches to feature selection, IEEE Trans. Evol. Comput., vol. 20, no. 4, pp. 606-626, 2016.

Crossref Google Scholar

[36]

P. Ghamisi and J. A. Benediktsson, Feature selection based on hybridization of genetic algorithm and particle swarm optimization, IEEE Geosci. Remote Sens. Lett., vol. 12, no. 2, pp. 309-313, 2015.

Crossref Google Scholar

[37]

D. J. Stracuzzi and P. E. Utgoff, Randomized variable elimination, J. Mach. Learn. Res., vol. 5, pp. 1331-1362, 2004.

Google Scholar

[38]

S. Saha, S. Rajasekaran, and R. Ramprasad, Novel randomized feature selection algorithms, Int. J. Found. Comput. Sci., vol. 26, no. 3, pp. 321-341, 2015.

Crossref Google Scholar

[39]

A. Brankovic, A. Falsone, M. Prandini, and L. Piroddi, A feature selection and classification algorithm based on randomized extraction of model populations, IEEE Trans. Cybernet., vol. 48, no. 4, pp. 1151-1162, 2018.

Crossref Google Scholar

[40]

I. Foroutan and J. Sklansky, Feature selection for automatic classification of non-Gaussian data, IEEE Trans. Syst. Man Cybernet., vol. 17, no. 2, pp. 187-198, 1987.

Crossref Google Scholar

[41]

P. Pudil, J. Novovičová, and J. Kittler, Floating search methods in feature selection, Patt. Recognit. Lett., vol. 15, no. 11, pp. 1119-1125, 1994.

Crossref Google Scholar

[42]

D. J. Stracuzzi, Randomized feature selection, in Computational Methods of Feature Selection, H. Liu and H. Motoda, eds. London, UK: Chapman and Hall/CRC, 2007, pp. 53-74.

Crossref

[43]

S. W. Lin, Z. J. Lee, S. C. Chen, and T. Y. Tseng, Parameter determination of support vector machine and feature selection using simulated annealing approach, Appl. Soft Comput., vol. 8, no. 4, pp. 1505-1512, 2008.

Crossref Google Scholar

[44]

S. Rajasekaran and J. H. Reif, Derivation of randomized sorting and selection algorithms, in Parallel Algorithm Derivation and Program Transformation, R. Paige, J. Reif, and R. Watcher, eds. Boston, MA, USA: Springer, 1993, pp. 187-205.

[45]

S. Rajasekaran, On simulated annealing and nested annealing, J. Glob. Optim., vol. 16, no. 1, pp. 43-56, 2000.

Google Scholar

[46]

J. Jájá, An Introduction to Parallel Algorithms. Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc., 1992.

[47]

E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms C++: C++ and Pseudocode Versions. Stuttgart, Germany: Macmillan, 1997.

[48]

I. Guyon, S. Gunn, A. B. Hur, and G. Dror, Design and analysis of the NIPS2003 challenge, in Feature Extraction, I. Guyon, M. Nikravesh, S. Gunn, and L. A. Zadeh, eds. Berlin, Germany: Springer, 2006, pp. 237-263.

Crossref

[49]

T. F. Investment and Kaggle, Restaurant revenue prediction: Predict annual restaurant sales based on objective measurements, https://www.kaggle.com/c/restaurant-revenue-prediction, 2019.

[50]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., vol. 12, pp. 2825-2830, 2011.

Google Scholar

[51]

A. Lasek, N. Cercone, and J. Saunders, Smart Restaurants: Survey on customer demand and sales forecasting, in Smart Cities and Homes, M. S. Obaidat and O. Nicopolitidis, eds. Amsterdam, Netherlands: Elsevier, 2016, pp. 361-386.

Crossref

Big Data Mining and Analytics

Volume 3 Issue 3,
September 2020

Pages 208-224

DOI: 10.26599/BDMA.2020.9020005

Cite this article:

Wang Z, Xiao X, Rajasekaran S. Novel and Efficient Randomized Algorithms for Feature Selection. Big Data Mining and Analytics, 2020, 3(3): 208-224. https://doi.org/10.26599/BDMA.2020.9020005

783

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 15 April 2020

Revised: 06 May 2020

Accepted: 07 May 2020

Published: 16 July 2020

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).