Journal Home > Volume 20 , Issue 5

Essential proteins are vital to the survival of a cell. There are various features related to the essentiality of proteins, such as biological and topological features. Many computational methods have been developed to identify essential proteins by using these features. However, it is still a big challenge to design an effective method that is able to select suitable features and integrate them to predict essential proteins. In this work, we first collect 26 features, and use SVM-RFE to select some of them to create a feature space for predicting essential proteins, and then remove the features that share the biological meaning with other features in the feature space according to their Pearson Correlation Coefficients (PCC). The experiments are carried out on S. cerevisiae data. Six features are determined as the best subset of features. To assess the prediction performance of our method, we further compare it with some machine learning methods, such as SVM, Naive Bayes, Bayes Network, and NBTree when inputting the different number of features. The results show that those methods using the 6 features outperform that using other features, which confirms the effectiveness of our feature selection method for essential protein prediction.


menu
Abstract
Full text
Outline
About this article

A Feature Selection Method for Prediction Essential Protein

Show Author's information Jiancheng ZhongJianxin Wang( )Wei PengZhen ZhangMin Li
School of Information Science and Engineering, Central South University, Changsha 410083, China.
College of Polytechnic, Hunan Normal University, Changsha 410083, China.
Computer Center, Kunming University of Science and Technology, Kunming 650093, China.

Abstract

Essential proteins are vital to the survival of a cell. There are various features related to the essentiality of proteins, such as biological and topological features. Many computational methods have been developed to identify essential proteins by using these features. However, it is still a big challenge to design an effective method that is able to select suitable features and integrate them to predict essential proteins. In this work, we first collect 26 features, and use SVM-RFE to select some of them to create a feature space for predicting essential proteins, and then remove the features that share the biological meaning with other features in the feature space according to their Pearson Correlation Coefficients (PCC). The experiments are carried out on S. cerevisiae data. Six features are determined as the best subset of features. To assess the prediction performance of our method, we further compare it with some machine learning methods, such as SVM, Naive Bayes, Bayes Network, and NBTree when inputting the different number of features. The results show that those methods using the 6 features outperform that using other features, which confirms the effectiveness of our feature selection method for essential protein prediction.

Keywords: machine learning, feature selection, essential protein, Protein-Protein Interaction (PPI), centrality algorithm

References(39)

[1]
Kamath R. S., Fraser A. G., Dong Y., Poulin G., Durbin R., Gotta M., Kanapin A., Le Bot N., Moreno S., and Sohrmann M., Systematic functional analysis of the caenorhabditis elegans genome using rnai, Nature, vol. 421, no. 6920, pp. 231-237, 2003.
[2]
Wang J., Peng X., Peng W., and Wu F., Dynamic protein interaction network construction and applications, Proteomics, vol. 14, nos. 4&5, pp. 338-352, 2014.
[3]
Judson N. and Mekalanos J. J., Tnaraout, a transposon- based approach to identify and characterize essential bacterial genes, Nature Biotechnology, vol. 18, no. 7, pp. 740-745, 2000.
[4]
Giaever G., Chu A. M., Ni L., Connelly C., Riles L., Vronneau S., Dow S., Lucau-Danila A., Anderson K., and Andr B., Functional profiling of the saccharomyces cerevisiae genome, Nature, vol. 418, no. 6896, pp. 387-391, 2002.
[5]
Roemer T., Jiang B., Davison J., Ketela T., Veillette K., Breton A., Tandia F., Linteau A., Sillaots S., and Marta C., Large-scale essential gene identification in candida albicans and applications to antifungal drug discovery, Molecular Microbiology, vol. 50, no. 1, pp. 167-181, 2003.
[6]
Cullen L. M. and Arndt G. M., Genome-wide screening for gene function using rnai in mammalian cells, Immunology and Cell Biology, vol. 83, no. 3, pp. 217-223, 2005.
[7]
Freeman L. C., A set of measures of centrality based on betweenness, Sociometry, vol. 40, no. 1, pp. 35-41, 1977.
[8]
Joy M. P., Brock A., Ingber D. E., and Huang S., High-betweenness proteins in the yeast protein interaction network, BioMed Research International, vol. 2005, no. 2, pp. 96-103, 2005.
[9]
Wuchty S. and Stadler P. F., Centers of complex networks, Journal of Theoretical Biology, vol. 223, no. 1, pp. 45-53, 2003.
[10]
Vallabhajosyula R. R., Chakravarti D., Lutfeali S., Ray A., and Raval A., Identifying hubs in protein interaction networks, PLoS One, vol. 4, no. 4, p. e5344, 2009.
[11]
Bonacich P., Power and centrality: A family of measures, American Journal of Sociology, vol. 92, no. 5, pp. 1170-1182, 1987.
[12]
Stephenson K. and Zelen M., Rethinking centrality: Methods and examples, Social Networks, vol. 11, no. 1, pp. 1-37, 1989.
[13]
Wang J., Li M., Wang H., and Pan Y., Identification of essential proteins based on edge clustering coefficient, Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 9, no. 4, pp. 1070-1080, 2012.
[14]
Estrada E. and Rodriguez-Velazquez J. A., Subgraph centrality in complex networks, Physical Review E, vol. 71, no. 5, p. 056103, 2005.
[15]
Koschützki D., Schwöbbermeyer H., and Schreiber F., Ranking of network elements based on functional substructures, Journal of Theoretical Biology, vol. 248, no. 3, pp. 471-479, 2007.
[16]
Li M., Zhang H., Wang J.-X., and Pan Y., A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data, BMC Systems Biology, vol. 6, no. 1, p. 15, 2012.
[17]
Tang X., Wang J., Zhong J., and Pan Y., Predicting essential proteins based on weighted degree centrality, Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 11, no. 2, pp. 407-418, 2014.
[18]
Ito T., Chiba T., Ozawa R., Yoshida M., Hattori M., and Sakaki Y., A comprehensive two-hybrid analysis to explore the yeast protein interactome, Proceedings of the National Academy of Sciences, vol. 98, no. 8, pp. 4569-4574, 2001.
[19]
Puig O., Caspary F., Rigaut G., Rutz B., Bouveret E., Bragado-Nilsson E., Wilm M., and Sèraphin B., The tandem affinity purification (tap) method: A general procedure of protein complex purification, Methods, vol. 24, no. 3, pp. 218-229, 2001.
[20]
Ho Y., Gruhler A., Heilbut A., Bader G. D., Moore L., Adams S.-L., Millar A., Taylor P., Bennett K., and Boutilier K., Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry, Nature, vol. 415, no. 6868, pp. 180-183, 2002.
[21]
Sprinzak E., Sattath S., and Margalit H., How reliable are experimental proteincprotein interaction data? Journal of Molecular Biology, vol. 327, no. 5, pp. 919-923, 2003.
[22]
Seringhaus M., Paccanaro A., Borneman A., Snyder M., and Gerstein M., Predicting essential genes in fungal genomes, Genome Research, vol. 16, no. 9, pp. 1126-1135, 2006.
[23]
Gustafson A. M., Snitkin E. S., Parker S. C., DeLisi C., and Kasif S., Towards the identification of essential genes using targeted genome sequencing and comparative analysis, Bmc Genomics, vol. 7, no. 1, p. 265, 2006.
[24]
Hwang Y.-C., Lin C.-C., Chang J.-Y., Mori H., Juan H.-F., and Huang H.-C., Predicting essential genes based on network and sequence analysis, Molecular BioSystems, vol. 5, no. 12, pp. 1672-1678, 2009.
[25]
Acencio M. L. and Lemke N., Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information, BMC Bioinformatics, vol. 10, no. 1, p. 290, 2009.
[26]
Deng J., Deng L., Su S., Zhang M., Lin X., Wei L., Minai A. A., Hassett D. J., and Lu L. J., Investigating the predictability of essential genes across distantly related organisms using an integrative approach, Nucleic Acids Research, vol. 39, no. 3, pp. 795-807, 2011.
[27]
Kim W., Prediction of essential proteins using topological properties in go-pruned ppi network based on machine learning methods, Tsinghua Science and Technology, vol. 17, no. 6, pp. 645-658, 2012.
[28]
del Rio G., Koschtzki D., and Coello G., How to identify essential genes from molecular networks? BMC Systems Biology, vol. 3, no. 1, p. 102, 2009.
[29]
Plaimas K., Eils R., and König R., Identifying essential genes in bacterial metabolic networks with machine learning methods, BMC Systems Biology, vol. 4, no. 1, p. 56, 2010.
[30]
Peng W., Wang J., Wang W., Liu Q., Wu F.-X., and Pan Y., Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks, BMC Systems Biology, vol. 6, no. 1, p. 87, 2012.
[31]
Guyon I., Weston J., Barnhill S., and Vapnik V., Gene selection for cancer classification using support vector machines, Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[32]
Xenarios I., Salwinski L., Duan X. J., Higney P., Kim S.-M., and Eisenberg D., Dip, the database of interacting proteins: A research tool for studying cellular networks of protein interactions, Nucleic Acids Research, vol. 30, no. 1, pp. 303-305, 2002.
[33]
Mewes H.-W., Frishman D., Mayer K. F., Münsterkötter M., Noubibou O., Pagel P., Rattei T., Oesterheld M., Ruepp A., and Stümpflen V., Mips: Analysis and annotation of proteins from whole genomes in 2005, Nucleic Acids Research, vol. 34, no. suppl 1, pp. D169-D172, 2006.
[34]
Cherry J. M., Adler C., Ball C., Chervitz S. A., Dwight S. S., Hester E. T., Jia Y., Juvik G., Roe T., Schroeder M., et al., Sgd: Saccharomyces genome database, Nucleic Acids Research, vol. 26, no. 1, pp. 73-79, 1998.
[35]
Zhang R. and Lin Y., Deg 5.0, a database of essential genes in both prokaryotes and eukaryotes, Nucleic Acids Research, vol. 37, no. suppl 1, pp. D455-D458, 2009.
[36]
Antoniotti M., Bader G. D., Caravagna G., Crippa S., Graudenzi A., and Mauri G., Gestodifferent: A cytoscape plugin for the generation and the identification of gene regulatory networks describing a stochastic cell differentiation process, Bioinformatics, vol. 29, no. 4, pp. 513-514, 2013.
[37]
Östlund G., Schmitt T., Forslund K., Köstler T., Messina D. N., Roopra S., Frings O., and Sonnhammer E. L., Inparanoid 7: New algorithms and tools for eukaryotic orthology analysis, Nucleic Acids Research, vol. 38, no. suppl 1, pp. D196-D203, 2010.
[38]
Tu B. P., Kudlicki A., Rowicka M., and McKnight S. L., Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes, Science, vol. 310, no. 5751, pp. 1152-1158, 2005.
[39]
Pierleoni A., Martelli P. L., Fariselli P., and Casadio R., esldb: Eukaryotic subcellular localization database, Nucleic Acids Research, vol. 35, no. suppl 1, pp. D208-D212, 2007.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 21 May 2015
Accepted: 06 August 2015
Published: 13 October 2015
Issue date: October 2015

Copyright

The author(s) 2015

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 61232001, 61502166, 61502214, 61379108, and 61370024) and Scientific Research Fund of Hunan Provincial Education Department (Nos. 15CY007 and 10A076).

Rights and permissions

Return