Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data

Jiancheng Zhong; Zuohang Qu; Ying Zhong; Chao Tang; Yi Pan

doi:10.26599/BDMA.2022.9020019

Big Data Mining and Analytics 2023, 6(2): 185-200 https://doi.org/10.26599/BDMA.2022.9020019

Open Access | Issue | Published: 26 January 2023

Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data

Show Author's Information Hide Author's Information Jiancheng Zhong^¹(

), Zuohang Qu^¹, Ying Zhong^¹, Chao Tang^¹, Yi Pan^²(

)

1College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China

2Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shenzhen, Guangzhou 518055, China

Keywords:

essential proteins, Protein-Protein Interaction (PPI) network, continuous and discrete similarity coefficient

Cite this article:

Zhong J, Qu Z, Zhong Y, et al. Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data. Big Data Mining and Analytics, 2023, 6(2): 185-200. https://doi.org/10.26599/BDMA.2022.9020019

Download citation

EndNote(RIS)

BibTeX

613

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.

Full text

Abstract

Full text

Outline

About this article

Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data

Show Author's information Hide Author's Information Jiancheng Zhong^¹(

), Zuohang Qu^¹, Ying Zhong^¹, Chao Tang^¹, Yi Pan^²(

)

1College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China

2Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Shenzhen, Guangzhou 518055, China

Abstract

Keywords: essential proteins, Protein-Protein Interaction (PPI) network, continuous and discrete similarity coefficient

References(33)

[1]

P. R. Graves and T. A. J. Haystead, Molecular biologist’s guide to proteomics, Microbiol. Mol. Biol. Rev., vol. 66, no. 1, pp. 39–63, 2002.

DOI Google Scholar

[2]

E. A. Winzeler, D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. Andre, R. Bangham, R. Benito, J. D. Boeke, H. Bussey, et al., Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis, Science, vol. 285, no. 5429, pp. 901–906, 1999.

DOI Google Scholar

[3]

S. Asur, D. Ucar, and S. Parthasarathy, An ensemble framework for clustering protein-protein interaction networks, Bioinformatics, vol. 23, no. 13, pp. i29–i40, 2007.

DOI Google Scholar

[4]

G. Butland, J. M. Peregrín-Alvarez, J. Li, W. H. Yang, X. C. Yang, V. Canadien, A. Starostine, D. Richards, B. Beattie, N. Krogan, et al., Interaction network containing conserved and essential protein complexes in Escherichia coli, Nature, vol. 433, no. 7025, pp. 531–537, 2005.

DOI Google Scholar

[5]

G. Giaever, A. M. Chu, L. Ni, C. Connelly, L. Riles, S. Véronneau, S. Dow, A. Lucau-Danila, K. Anderson, B. André, et al., Functional profiling of the Saccharomyces cerevisiae genome, Nature, vol. 418, no. 6896, pp. 387–391, 2002.

DOI Google Scholar

[6]

L. M. Cullen and G. M. Arndt, Genome-wide screening for gene function using RNAi in mammalian cells, Immunol. Cell Biol., vol. 83, no. 3, pp. 217–223, 2005.

DOI Google Scholar

[7]

T. Roemer, B. Jiang, J. Davison, T. Ketela, K. Veillette, A. Breton, F. Tandia, A. Linteau, S. Sillaots, C. Marta, et al., Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery, Mol. Microbiol., vol. 50, no. 1, pp. 167–181, 2003.

DOI Google Scholar

[8]

H. Jeong, S. P. Mason, A. L. Barabási, and Z. N. Oltvai, Lethality and centrality in protein networks, Nature, vol. 411, no. 6833, pp. 41–42, 2001.

DOI Google Scholar

[9]

M. W. Hahn and A. D. Kern, Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks, Mol. Biol. Evol., vol. 22, no. 4, pp. 803–806, 2005.

DOI Google Scholar

[10]

M. P. Joy, A. Brock, D. E. Ingber, and S. Huang, High-betweenness proteins in the yeast protein interaction network, J. Biomed. Biotechnol., vol. 2005, no. 2, pp. 96–103, 2005.

DOI Google Scholar

[11]

S. Wuchty and P. F. Stadler, Centers of complex networks, J. Theor. Biol., vol. 223, no. 1, pp. 45–53, 2003.

DOI Google Scholar

[12]

E. Estrada and J. A. Rodríguez-Velázquez, Subgraph centrality in complex networks, Phys. Rev.E. Stat. Nonlin. Soft. Matter. Phys., vol. 71, no. 5Pt2, p. 056103, 2005.

DOI Google Scholar

[13]

P. Bonacich, Power and centrality: A family of measures, Am.J. Sociol., vol. 92, no. 5, pp. 1170–1182, 1987.

DOI Google Scholar

[14]

K. Stephenson and M. Zelen, Rethinking centrality: Methods and examples, Soc. Networks, vol. 11, no. 1, pp. 1–37, 1989.

DOI Google Scholar

[15]

M. Li, H. H. Zhang, J. X. Wang, and Y. Pan, A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data, BMC Syst. Biol., vol. 6, p. 15, 2012.

DOI Google Scholar

[16]

X. W. Tang, J. X. Wang, and Y. Pan. Identifying essential proteins via integration of protein interaction and gene expression data, in Proc. 2012 IEEE Int. Conf. on Bioinformatics and Biomedicine, Philadelphia, PA, USA, 2012, pp. 1–4.

DOI Google Scholar

[17]

W. Peng, J. X. Wang, W. P. Wang, Q. Liu, F. X. Wu, and Y. Pan, Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks, BMC Syst. Biol., vol. 6, p. 87, 2012.

DOI Google Scholar

[18]

G. S. Li, M. Li, J. X. Wang, Y. H. Li, and Y. Pan, United neighborhood closeness centrality and orthology for predicting essential proteins, IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 17, no. 4, pp. 1451–1458, 2020.

Google Scholar

[19]

S. Y. Li, Z. P. Chen, X. He, Z. Zhang, T. Pei, Y. H. Tan, and L. Wang, An iteration method for identifying yeast essential proteins from weighted PPI network based on topological and functional features of proteins, IEEE Access, vol. 8, pp. 90792–90804, 2020.

DOI Google Scholar

[20]

X. Y. Zhu, Y. C. Zhu, Y. H. Tan, Z. P. Chen, and L. Wang, An iterative method for predicting essential proteins based on multifeature fusion and linear neighborhood similarity, Front. Aging Neurosci., vol. 13, p. 799500, 2021.

DOI Google Scholar

[21]

B. H. Zhao, X. Han, X. E. Liu, Y. C. Luo, S. Hu, Z. H. Zhang, and L. Wang, A novel method to predict essential proteins based on diffusion distance networks, IEEE Access, vol. 8, pp. 29385–29394, 2020.

DOI Google Scholar

[22]

U. de Lichtenberg, L. J. Jensen, S. Brunak, and P. Bork, Dynamic complex formation during the yeast cell cycle, Science, vol. 307, no. 5710, pp. 724–727, 2005.

DOI Google Scholar

[23]

Q. H. Xiao, J. X. Wang, X. Q. Peng, F. X. Wu, and Y. Pan, Identifying essential proteins from active PPI networks constructed with dynamic gene expression, BMC Genomics, vol. 16, no. 3, p. S1, 2015.

DOI Google Scholar

[24]

M. Li, P. Ni, X. P. Chen, J. X. Wang, F. X. Wu, and Y. Pan, Construction of refined protein interaction network for predicting essential proteins, IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 16, no. 4, pp. 1386–1397, 2019.

DOI Google Scholar

[25]

F. Y. Zhang, W. Peng, Y. F. Yang, W. Dai, and J. R. Song, A novel method for identifying essential genes by fusing dynamic protein-protein interactive networks, Genes, vol. 10, no. 1, p. 31, 2019.

DOI Google Scholar

[26]

J. C. Zhong, C. Tang, W. Peng, M. Z. Xie, Y. S. Sun, Q. Tang, Q. Xiao, and J. H. Yang, A novel essential protein identification method based on PPI networks and gene expression data, BMC Bioinformatics, vol. 22, no. 1, p. 248, 2021.

DOI Google Scholar

[27]

W. M. Sun, L. Wang, J. X. Peng, Z. Zhang, T. R. Pei, Y. H. Tan, X. Y. Li, and Z. P. Chen, A cross-entropy-based method for essential protein identification in yeast protein-protein interaction network, Curr. Bioinf., vol. 16, no. 4, pp. 565–575, 2021.

DOI Google Scholar

[28]

D. Sahoo, Boolean analysis of high-throughput biological datasets, PhD dissertation, Stanford University, Palo Alto, CA, USA, 2008.

[29]

C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, BioGRID: A general repository for interaction datasets, Nucleic Acids Res., vol. 34, no. suppl_1, pp. D535–D539, 2006.

DOI Google Scholar

[30]

P. Pagel, S. Kovac, M. Oesterheld, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, P. Mark, V. Stümpflen, H. W. Mewes, et al., The MIPS mammalian protein-protein interaction database, Bioinformatics, vol. 21, no. 6, pp. 832–834, 2005.

DOI Google Scholar

[31]

S. S. Dwight, M. A. Harris, K. Dolinski, C. A. Ball, G. Binkley, K. R. Christie, D. G. Fisk, L. Issel-Tarver, M. Schroeder, G. Sherlock, et al., Saccharomyces genome database (SGD) provides secondary gene annotation using the gene ontology (GO), Nucleic Acids Res., vol. 30, no. 1, pp. 69–72, 2002.

DOI Google Scholar

[32]

R. Zhang and Y. Lin, DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes, Nucleic Acids Res., vol. 37, no. suppl_1, pp. D455–D458, 2009.

DOI Google Scholar

[33]

G. Giaever and C. Nislow, The yeast deletion collection: A decade of functional genomics, Genetics, vol. 197, no. 2, pp. 451–465, 2014.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 27 June 2022

Accepted: 13 August 2022

Published: 26 January 2023

Issue date: June 2023

Copyright

Acknowledgements

This work was supported by the Shenzhen KQTD Project (No. KQTD20200820113106007), China Scholarship Council (No. 201906725017), the Collaborative Education Project of Industry-University cooperation of the Chinese Ministry of Education (No. 201902098015), the Teaching Reform Project of Hunan Normal University (No. 82), and the National Undergraduate Training Program for Innovation (No. 202110542004).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).