Journal Home > Volume 2 , Issue 2

Many real world applications have problems with high dimensionality, which existing algorithms cannot overcome. A critical data preprocessing problem is feature selection, whereby its non-scalability negatively influences both the efficiency and performance of big data applications. In this research, we developed a new algorithm to reduce the dimensionality of a problem using graph-based analysis, which retains the physical meaning of the original high-dimensional feature space. Most existing feature-selection methods are based on a strong assumption that features are independent of each other. However, if the feature-selection algorithm does not take into consideration the interdependencies of the feature space, the selected data fail to correctly represent the original data. We developed a new feature-selection method to address this challenge. Our aim in this research was to examine the dependencies between features and select the optimal feature set with respect to the original data structure. Another important factor in our proposed method is that it can perform even in the absence of class labels. This is a more difficult problem that many feature-selection algorithms fail to address. In this case, they only use wrapper techniques that require a learning algorithm to select features. It is important to note that our experimental results indicates, this proposed simple ranking method performs better than other methods, independent of any particular learning algorithm used.


menu
Abstract
Full text
Outline
About this article

Feature Selection with Graph Mining Technology

Show Author's information Thosini Bamunu Mudiyanselage( )Yanqing Zhang
Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.

Abstract

Many real world applications have problems with high dimensionality, which existing algorithms cannot overcome. A critical data preprocessing problem is feature selection, whereby its non-scalability negatively influences both the efficiency and performance of big data applications. In this research, we developed a new algorithm to reduce the dimensionality of a problem using graph-based analysis, which retains the physical meaning of the original high-dimensional feature space. Most existing feature-selection methods are based on a strong assumption that features are independent of each other. However, if the feature-selection algorithm does not take into consideration the interdependencies of the feature space, the selected data fail to correctly represent the original data. We developed a new feature-selection method to address this challenge. Our aim in this research was to examine the dependencies between features and select the optimal feature set with respect to the original data structure. Another important factor in our proposed method is that it can perform even in the absence of class labels. This is a more difficult problem that many feature-selection algorithms fail to address. In this case, they only use wrapper techniques that require a learning algorithm to select features. It is important to note that our experimental results indicates, this proposed simple ranking method performs better than other methods, independent of any particular learning algorithm used.

Keywords: feature selection, graph mining, network embedding, big data analysis, high-dimensional data

References(27)

[1]
C. Fellbaum, WordNet. Wiley Online Library, 1998.
DOI
[2]
Y. Wang, Y. Yao, H. Tong, F. Xu, and J. Lu, A brief review of network embedding, Big Data Mining and Analytics, vol. 2, no. 1, pp. 35-47, 2019.
[3]
C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, and A. Nowe, A survey on filter techniques for feature selection in gene expression microarray analysis, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1106-1119, 2012.
[4]
A. Sharma, S. Imoto, and S. Miyano, A top-r feature selection algorithm for microarray gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 754-764, 2012.
[5]
L. Cervante, B. Xue, M. Zhang, and L. Shang, Binary particle swarm optimization for feature selection: A filter based approach, in IEEE Congress on Evolutionary Computation, Brisbane, Australia, 2012, pp. 1-8.
DOI
[6]
M. Pedergnana, P. R. Marpu, M. D. Mura, J. A. Benediktsson, and L. Bruzzone, A novel technique for optimal feature selection in attribute profiles based on genetic algorithms, IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 6, pp. 3514-3528, 2013.
[7]
T. Basu and C. A. Murthy, Effective text classification by a supervised feature selection approach, in IEEE 12th International Conference on Data Mining Workshops, Brussels, Belgium, 2012, pp. 918-925.
DOI
[8]
Z. Zhao, L. Wang, H. Liu, and J. Ye, On similarity preserving feature selection, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 3, pp. 619-632, 2013.
[9]
J. Liang, F. Wang, C. Dang, and Y. Qian, A group incremental approach to feature selection applying rough set technique, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 294-308, 2014.
[10]
M. Yaqub, M. K. Javaid, C. Cooper, and J. A. Noble, Investigation of the role of feature selection and weighted voting in random forests for 3-D volumetric segmentation, IEEE Transactions on Medical Imaging, vol. 33, no. 2, pp. 258-271, 2014.
[11]
Q. Song, J. Ni, and G. Wang, A fast clustering-based feature subset selection algorithm for high-dimensional data, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 1-14, 2013.
[12]
Z. Zhang and E. R. Hancock, A graph-based approach to feature selection, in Graph-Based Representations in Pattern Recognition. Springer, 2011, pp. 205-214.
DOI
[13]
L. M. Abualigah and A. T. Khader, Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering, The Journal of Supercomputing, vol. 73, no. 11, pp. 4773-4795, 2017.
[14]
D. Cai, C. Zhang, and X. He, Unsupervised feature selection for multi-cluster data, in 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2010, pp. 333-342.
DOI
[15]
W. Zheng, X. Zhu, G. Wen, Y. Zhu, H. Yu, and J. Gan, Unsupervised feature selection by self-paced learning regularization, Pattern Recognition Letters, https://doi.org/10.1016/j.patrec.2018.06.029.
[16]
C. Lei and X. Zhu, Unsupervised feature selection via local structure learning and sparse learning, Multimedia Tools and Applications, vol. 77, no. 22, pp. 29605-29622, 2018.
[17]
R. Hu, X. Zhu, D. Cheng, W. He, Y. Yan, J. Song, and S. Zhang, Graph self-representation method for unsupervised feature selection, Neurocomputing, vol. 220, pp. 130-137, 2017.
[18]
X. Zhu, X. Li, S. Zhang, C. Ju, and X. Wu, Robust joint graph sparse coding for unsupervised spectral feature selection, IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 6, pp. 1263-1275, 2017.
[19]
X. Zhu, S. Zhang, Y. Li, J. Zhang, L. Yang, and Y. Fang, Low-rank sparse subspace for spectral clustering, IEEE Transactions on Knowledge and Data Engineering, .
[20]
X. Zhu, S. Zhang, R. Hu, Y. Zhu, and J. Song, Local and global structure preservation for robust unsupervised spectral feature selection, IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 3, pp. 517-529, 2018.
[21]
M. Luo, F. Nie, X. Chang, Y. Yang, A. G. Hauptmann, and Q. Zheng, Adaptive unsupervised feature selection with structure regularization, IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 944-956, 2018.
[22]
P. Zhu, W. Zhu, Q. Hu, C. Zhang, and W. Zuo, Subspace clustering guided unsupervised feature selection, Pattern Recognition, vol. 66, pp. 364-374, 2017.
[23]
Y. Zheng, B. Jeon, L. Sun, J. Zhang, and H. Zhang, Student’s t-Hidden Markov model for unsupervised learning using localized feature selection, IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2586-2598, 2018.
[24]
X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, in Advances in Neural Information Processing Systems 18. Cambridge, MA, USA: MIT Press, 2005.
[25]
J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, Feature selection: A data perspective, arXiv preprint arXiv:1601.07996, 2016.
[26]
Q. Gu, Z. Li, and J. Han, Generalized fisher score for feature selection, in Proc. 27th Uncertainty in Artificial Intelligence Conf., Barcelona, Spain, 2011, pp. 266-273.
[27]
M. Robnik-Sikonja and I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF, Machine Learning, vol. 53, nos. 1&2, pp. 23-69, 2003.
Publication history
Copyright
Rights and permissions

Publication history

Received: 20 June 2018
Accepted: 02 August 2018
Published: 14 May 2019
Issue date: June 2019

Copyright

© The author(s) 2019

Rights and permissions

Return