Journal Home > Volume 27 , Issue 5

Sequence-based protein tertiary structure prediction is of fundamental importance because the function of a protein ultimately depends on its 3D structure. An accurate residue-residue contact map is one of the essential elements for current ab initio prediction protocols of 3D structure prediction. Recently, with the combination of deep learning and direct coupling techniques, the performance of residue contact prediction has achieved significant progress. However, a considerable number of current Deep-Learning (DL)-based prediction methods are usually time-consuming, mainly because they rely on different categories of data types and third-party programs. In this research, we transformed the complex biological problem into a pure computational problem through statistics and artificial intelligence. We have accordingly proposed a feature extraction method to obtain various categories of statistical information from only the multi-sequence alignment, followed by training a DL model for residue-residue contact prediction based on the massive statistical information. The proposed method is robust in terms of different test sets, showed high reliability on model confidence score, could obtain high computational efficiency and achieve comparable prediction precisions with DL methods that relying on multi-source inputs.


menu
Abstract
Full text
Outline
About this article

Protein Residue Contact Prediction Based on Deep Learning and Massive Statistical Features from Multi-Sequence Alignment

Show Author's information Huiling ZhangMin HaoHao WuHing-Fung TingYihong Tang( )Wenhui Xi( )Yanjie Wei( )
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
University of Chinese Academy of Sciences, Beijing 100049, China
College of Electronic and Information Engineering, Southwest University, Chongqing 400715, China
School of Software Engineering, University of Science and Technology of China, Hefei 230051, China
Department of Computer Science, The University of Hong Kong, Hong Kong 999077, China
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

†Huiling Zhang, Min Hao, and Hao Wu contribute equally to this work.

Abstract

Sequence-based protein tertiary structure prediction is of fundamental importance because the function of a protein ultimately depends on its 3D structure. An accurate residue-residue contact map is one of the essential elements for current ab initio prediction protocols of 3D structure prediction. Recently, with the combination of deep learning and direct coupling techniques, the performance of residue contact prediction has achieved significant progress. However, a considerable number of current Deep-Learning (DL)-based prediction methods are usually time-consuming, mainly because they rely on different categories of data types and third-party programs. In this research, we transformed the complex biological problem into a pure computational problem through statistics and artificial intelligence. We have accordingly proposed a feature extraction method to obtain various categories of statistical information from only the multi-sequence alignment, followed by training a DL model for residue-residue contact prediction based on the massive statistical information. The proposed method is robust in terms of different test sets, showed high reliability on model confidence score, could obtain high computational efficiency and achieve comparable prediction precisions with DL methods that relying on multi-source inputs.

Keywords:

multi-sequence alignment, residue-residue contact prediction, feature extraction, statistical information, Deep Learning (DL), high computational efficiency
Received: 20 July 2021 Revised: 17 August 2021 Accepted: 20 August 2021 Published: 17 March 2022 Issue date: October 2022
References(46)
[1]
J. S. Zhang, W. K. Li, M. Zeng, X. M. Meng, L. Kurgan, F. X. Wu, and M. Li, NetEPD: A network-based essential protein discovery platform, Tsinghua Science and Technology, vol. 25, no. 4, pp. 542–552, 2020.
[2]
D. S. Marks, T. A. Hopf, and C. Sander, Protein structure prediction from sequence variation, Nat. Biotechnol., vol. 30, no. 11, pp. 1072–1080, 2012.
[3]
B. Adhikari, D. Bhattacharya, R. Z. Cao, and J. L. Cheng, CONFOLD: Residue-residue contact-guided ab initio protein folding, Proteins: Struct., Funct., Bioinformatics, vol. 83, no. 8, pp. 1436–1449, 2015.
[4]
J. B. Xu, Distance-based protein folding powered by deep learning, Proc. Natl. Acad. Sci. USA, vol. 116, no. 34, pp. 16856–16865, 2019.
[5]
A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. L. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, et al., Improved protein structure prediction using potentials from deep learning, Nature, vol. 577, no. 7792, pp. 706–710, 2020.
[6]
J. Y. Yang, I. Anishchenko, H. Park, Z. L. Peng, S. Ovchinnikov, and D. Baker, Improved protein structure prediction using predicted interresidue orientations, Proc. Natl. Acad. Sci. USA, vol. 117, no. 3, pp. 1496–1503, 2020.
[7]
M. Baek, F. Dimaio, I. Anishchenko, J. Dauparas, S. Ovchinnikov, G. R. Lee, J. Wang, Q. Cong, L. N. Kinch, R. D. Schaeffer, et al., Accurate prediction of protein structures and interactions using a three-track neural network, Science, vol. 373, no. 6557, pp. 871–876, 2021.
[8]
A. Raval, S. Piana, M. P. Eastwood, and D. E. Shaw, Assessment of the utility of contact-based restraints in accelerating the prediction of protein structure using molecular dynamics simulations, Protein Sci., vol. 25, no. 1, pp. 19–29, 2016.
[9]
E. A. Lubecka and A. Liwo, Introduction of a bounded penalty function in contact-assisted simulations of protein structures to omit false restraints, J. Comput. Chem., vol. 40, no. 25, pp. 2164–2178, 2019.
[10]
Q. Cong, I. Anishchenko, S. Ovchinnikov, and D. Baker, Protein interaction networks revealed by proteome coevolution, Science, vol. 365, no. 6449, pp. 185–189, 2019.
[11]
D. D. Pollock and W. R. Taylor, Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution, Protein Eng. Des. Sel., vol. 10, no. 6, pp. 647–657, 1997.
[12]
S. D. Dunn, L. M. Wahl, and G. B. Gloor, Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction, Bioinformatics, vol. 24, no. 3, pp. 333–340, 2007.
[13]
B. C. Lee and D. Kim, A new method for revealing correlated mutations under the structural and functional constraints in proteins, Bioinformatics, vol. 25, no. 19, pp. 2506–2513, 2009.
[14]
R. Rajgaria, S. R. McAllister, and C. A. Floudas, Towards accurate residue-residue hydrophobic contact prediction for α helical proteins via integer linear optimization, Proteins: Struct., Funct., Bioinformatics, vol. 74, no. 4, pp. 929–947, 2009.
[15]
R. Rajgaria, Y. Wei, and C. A. Floudas, Contact prediction for beta and alpha-beta proteins using integer linear optimization and its impact on the first principles 3D structure prediction method ASTRO-FOLD, Proteins: Struct., Funct., Bioinformatics, vol. 78, no. 8, pp. 1825–1846, 2010.
[16]
J. L. Cheng and P. Baldi, Improved residue contact prediction using support vector machines and a large feature set, BMC Bioinformatics, vol. 8, no. 1, p. 113, 2007.
[17]
A. N. Tegge, Z. Wang, J. Eickholt, and J. L. Cheng, NNcon: Improved protein contact map prediction using 2D-recursive neural networks, Nucl. Acids Res., vol. 37, no. S2, pp. W515–W518, 2009.
[18]
S. T. Wu and Y. Zhang, A comprehensive assessment of sequence-based and template-based methods for protein contact prediction, Bioinformatics, vol. 24, no. 7, pp. 924–931, 2008.
[19]
Z. Y. Wang and J. B. Xu, Predicting protein contact map using evolutionary and physical constraints by integer programming, Bioinformatics, vol. 29, no. 13, pp. i266–i273, 2013.
[20]
H. L. Zhang, Q. S. Huang, Z. D. Bei, Y. J. Wei, and C. A. Floudas, COMSAT: Residue contact prediction of transmembrane proteins based on support vector machines and mixed integer linear programming, Proteins: Struct., Funct., Bioinformatics, vol. 84, no. 3, pp. 332–348, 2016.
[21]
M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa, Identification of direct residue contacts in protein-protein interaction by message passing, Proc. Natl. Acad. Sci. USA, vol. 106, no. 1, pp. 67–72, 2009.
[22]
D. T. Jones, D. W. A. Buchan, D. Cozzetto, and M. Pontil, PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments, Bioinformatics, vol. 28, no. 2, pp. 184–90, 2012.
[23]
F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, Direct-coupling analysis of residue coevolution captures native contacts across many protein families, Proc. Natl. Acad. Sci. USA, vol. 108, no. 49, pp. E1293–E1301, 2011.
[24]
C. Baldassi, M. Zamparo, C. Feinauer, A. Procaccini, R. Zecchina, M. Weigt, and A. Pagnani, Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners, PLoS One, vol. 9, no. 3, p. e92721, 2014.
[25]
M. Ekeberg, C. Lövkvist, Y. H. Lan, M. Weigt, and E. Aurell, Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models, Phys. Rev.E, vol. 87, no. 1, p. 012707, 2013.
[26]
H. Kamisetty, S. Ovchinnikov, and D. Baker, Assessing the utility of coevolution-based residue-residue contact predictions in a sequence-and structure-rich era, Proc. Natl. Acad. Sci. USA, vol. 110, no. 39, pp. 15674–15679, 2013.
[27]
S. Seemayer, M. Gruber, and J. Söding, CCMpred-fast and precise prediction of protein residue-residue contacts from correlated mutations, Bioinformatics, vol. 30, no. 21, pp. 3128–3130, 2014.
[28]
M. J. Skwark, A. Abdel-Rehim, and A. Elofsson, PconsC: Combination of direct information methods and alignments improves contact prediction, Bioinformatics, vol. 29, no. 14, pp. 1815–1816, 2013.
[29]
D. T. Jones, T. Singh, T. Kosciolek, and S. Tetchner., MetaPSICOV: Combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins, Bioinformatics, vol. 31, no. 7, pp. 999–1006, 2015.
[30]
B. He, S. M. Mortuza, Y. T. Wang, H. B. Shen, and Y. Zhang, NeBcon: Protein contact map prediction using neural network training coupled with naïve Bayes classifiers, Bioinformatics, vol. 33, no. 15, pp. 2296–2306, 2017.
[31]
D. T. Jones and S. M. Kandathil, High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features, Bioinformatics, vol. 34, no. 19, pp. 3308–3315, 2018.
[32]
S. Wang, S. Q. Sun, Z. Li, R. Y. Zhang, and J. B. Xu, Accurate de novo prediction of protein contact map by ultra-deep learning model, PLoS Comput. Biol., vol. 13, no. 1, p. e1005324, 2017.
[33]
W. Z. Ding, W. Z. Mao, D. Shao, W. X. Zhang, and H. P. Gong, DeepConPred2: An improved method for the prediction of protein residue contacts, Comput. Struct. Biotechnol. J., vol. 16. pp. 503–510, 2018.
[34]
B. Adhikari, J. Hou, and J. L. Cheng, DNCON2: Improved protein contact prediction using two-level deep convolutional neural networks, Bioinformatics, vol. 34, no. 9, pp. 1466–1472, 2018.
[35]
B. Adhikari, DEEPCON: Protein contact prediction using dilated convolutional neural networks with dropout, Bioinformatics, vol. 36, no. 2, pp. 470–477, 2020.
[36]
J. Hanson, K. Paliwal, T. Litfin, Y. D. Yang, and Y. Q. Zhou, Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks, Bioinformatics, vol. 34, no. 23, pp. 4039–4045, 2018.
[37]
Q. Wu, Z. L. Peng, I. Anishchenko, Q. Cong, D. Baker, and J. Y. Yang, Protein contact prediction using metagenome sequence data and residual neural networks, Bioinformatics, vol. 36, no. 1, pp. 41–48, 2020.
[38]
A. Lo, Y. Y. Chiu, E. A. Rødland, P. C. Lyu, T. Y. Sung, and W. L. Hsu, Predicting helix-helix interactions from residue contacts in membrane proteins, Bioinformatics, vol. 25, no. 8, pp. 996–1003, 2009.
[39]
T. Nugent and D. T. Jones, Predicting transmembrane helix packing arrangements using residue contacts and a force-directed algorithm, PLoS Comput. Biol., vol. 6, no. 3, p. e1000714, 2010.
[40]
H. L. Zhang, Z. D. Bei, W. H. Xi, M. Hao, Z. Ju, K. M. Saravanan, H. P. Zhang, N. Guo, and Y. J. Wei, Evaluation of residue-residue contact prediction methods: From retrospective to prospective, PLoS Comput. Biol., vol. 17, no. 5, p. e1009027, 2021.
[41]
D. Kozma, I. Simon, and G. E. Tusnády, PDBTM: Protein data bank of transmembrane proteins after 8 years, Nucl. Acids Res., vol. 41, no. D1, pp. D524–D529, 2013.
[42]
Y. Zhang, J. W. T. Chan, F. Y. L. Chin, H. F. Ting, D. S. Ye, F. Zhang, and J. Y. Shi, Constrained pairwise and center-star sequences alignment problems, J. Comb. Optim., vol. 32, no. 1, pp. 79–94, 2016.
[43]
W. T. Chan, Y. Zhang, S. P. Y. Fung, D. S. Ye, and H. Zhu, Efficient algorithms for finding a longest common increasing subsequence, J. Comb. Optim., vol. 13, no. 3, pp. 277–288, 2007.
[44]
C. X. Zhang, W. Zheng, S. M. Mortuza, Y. Li, and Y. Zhang, DeepMSA: Constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins, Bioinformatics, vol. 36, no. 7, pp. 2105–2112, 2020.
[45]
A. J. Hockenberry and C. O. Wilke, Evolutionary couplings detect side-chain interactions, PeerJ, vol. 7, p. e7280, 2019.
[46]
M. Chonofsky, S. H. P. De Oliveira, K. Krawczyk, and C. M. Deane, The evolution of contact prediction: Evidence that contact selection in statistical contact prediction is changing, Bioinformatics, vol. 36, no. 6, pp. 1750–1756, 2020.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 20 July 2021
Revised: 17 August 2021
Accepted: 20 August 2021
Published: 17 March 2022
Issue date: October 2022

Copyright

© The author(s) 2022.

Acknowledgements

This work was partly supported by the Strategic Priority CAS Project (No. XDB38050100), the National Key Research and Development Program of China (No. 2018YFB0204403), the National Natural Science Foundation of China (No. U1813203), the Shenzhen Basic Research Fund (Nos. RCYX2020071411473419, JCYJ20200109114818703, and JSGG20201102163800001), CAS Key Lab (No. 2011DP173015), Hong Kong Research Grant Council (No. GRF-17208019), and the Outstanding Youth Innovation Fund (Doctoral Students) of CAS-SIAT (No. Y9G054).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return