Journal Home > Volume 19 , Issue 6

Protein sequence motifs extraction is an important field of bioinformatics since its relevance to the structural analysis. Two major problems are related to this field: (1) searching the motifs within the same protein family; and (2) assuming a window size for the motifs search. This work proposes the Hierarchically Clustered Hidden Markov Model (HC-HMM) approach, which represents the behavior and structure of proteins in terms of a Hidden Markov Model chain and hierarchically clusters each chain by minimizing distance between two given chains’ structure and behavior. It is well known that HMM can be utilized for clustering, however, methods for clustering on Hidden Markov Models themselves are rarely studied. In this paper, we developed a hierarchical clustering based algorithm for HMMs to discover protein sequence motifs that transcend family boundaries with no assumption on the length of the motif. This paper carefully examines the effectiveness of this approach for motif extraction on 2593 proteins that share no more than 25% sequence identity. Many interesting motifs are generated. Three example motifs generated by the HC-HMM approach are analyzed and visualized with their tertiary structure. We believe the proposed method provides a unique protein sequence motif extraction strategy. The related data mining fields using Hidden Markova Model may also benefit from this clustering on HMM themselves approach.


menu
Abstract
Full text
Outline
About this article

Hierarchically Clustered HMM for Protein Sequence Motif Extraction with Variable Length

Show Author's information Cody HudsonBernard Chen( )Dongsheng Che
Department of Computer Science, University of Central Arkansas, Conway, AR 72034, USA.
Department of Computer Science, East Stroudsburg University, East Stroudsburg, PA 18301, USA.

Abstract

Protein sequence motifs extraction is an important field of bioinformatics since its relevance to the structural analysis. Two major problems are related to this field: (1) searching the motifs within the same protein family; and (2) assuming a window size for the motifs search. This work proposes the Hierarchically Clustered Hidden Markov Model (HC-HMM) approach, which represents the behavior and structure of proteins in terms of a Hidden Markov Model chain and hierarchically clusters each chain by minimizing distance between two given chains’ structure and behavior. It is well known that HMM can be utilized for clustering, however, methods for clustering on Hidden Markov Models themselves are rarely studied. In this paper, we developed a hierarchical clustering based algorithm for HMMs to discover protein sequence motifs that transcend family boundaries with no assumption on the length of the motif. This paper carefully examines the effectiveness of this approach for motif extraction on 2593 proteins that share no more than 25% sequence identity. Many interesting motifs are generated. Three example motifs generated by the HC-HMM approach are analyzed and visualized with their tertiary structure. We believe the proposed method provides a unique protein sequence motif extraction strategy. The related data mining fields using Hidden Markova Model may also benefit from this clustering on HMM themselves approach.

Keywords: Hidden Markov Model, hierarchical clustering, sequential motif, bioinformatics

References(21)

[1]
J. Chandonia and S. E. Brenner, The impact of structural genomics: Expectations and outcomes, Lawrence Berkeley National Laboratory, California, USA, Dec. 2005.
[2]
T. Lengauer and R. Zimmer, Protein structure prediction methods for drug design, Briefings in Bioinformatics, vol. 1, no. 3, pp. 275-288, 2000.
[3]
G. Karp, Cell and Molecular Biology: Concepts and Experiments, 6th ed. New York, USA: John Wiley & Sons Inc, 2009, pp. 52-66.
[4]
A. L. Spek, Structure validation in chemical crystallography, Acta CrystalloGraphica, Section D, vol. 60, no. 4, pp. 148-155, 2004.
[5]
J. K. M. Sanders and B. K. Hunter, Modern NMR Spectroscopy: A Guide for Chemists. New York, USA: Oxford University Press, 1998.
[6]
A. S. Nair, Computational biology & bioinformatics: A gentle overview, Communications of the Computer Society of India, pp. 1-13, 2007.
[7]
C. J. A. Sigrist, E. de Castro, L. Cerutti, B. A. Cuche, N. Hulo, A. Bridge, L. Bougueleret, and I. Xenarios, New and continuing developments at PROSITE, Nucleic Acids Research, vol. 41, pp. 1-4, 2012.
[8]
T. K. Attwood, A. Coletta, G. Muirhead, A. Pavlopoulou, P. B. Philippou, I. Popov, C. Romá-Mateo, A. Theodosiou, and A. L. Mitchell, The PRINTS database: A fine-grained protein sequence annotation and analysis resource—its status in 2012, Journal of Biological Databases and Curation, vol. 2012, 2012. .
[9]
T. L. Bailey, MEME SUITE: Tools for motif discovery and searching, Nucleic Acids Research, vol. 37, no. 2, pp. 202-208, 2009.
[10]
T. Mi, J. C. Merlin, S. Deverasetty, M. R. Gryk, T. J. Bill, A. W. Brooks, L. Y. Lee, V. Rathnayake, C. A. Ross, D. P. Sargeant, C. L. Strong, P. Watts, S. Rajasekaran, and M. R. Schiller, Minimotif Miner 3.0: Database expansion and significantly improved reduction of false-positive predictions from consensus sequences, Nucleric Acids Research, vol. 40, pp. 252-260, 2012.
[11]
S. Chakrabarti, K. Venkatramanan, and R. Sowdhamini, SMoS: A database of stuctural motifs of protein superfamilies, Protein Eng., vol. 16, no. 11, pp. 791-793, 2003.
[12]
V. Neduva, R. Linding, I. Su-Angrand, A. Stark, F. de Masi, T. J. Gibson, J. Lewis, L. Serrano, and R. B. Russell, Systematic discovery of new recognition peptides mediating protein interaction networks, PLoS Biol., vol. 3, no. 12, 2005. .
[13]
V. Neduva and R. B. Russell, Linear motifs: Evolutionary interaction switches, FEBS Lett., pp. 3342-3345, 2005.
[14]
K. F. Han and D. Baker, Recurring local sequence motifs in proteins, J. Mol. Biol., vol. 251, pp. 176-187, 1998.
[15]
B. Chen, P. C. Tai, R. Harrison, and Y. Pan, FGK model: An efficient granular computing model for protein sequence motifs information discovery, in Proceedings of IASTED CASB 2006, Dallas, USA, 2006, pp. 56-61.
DOI
[16]
C. Sander and R. Schneider, Database of homology-derived protein structures and the structural meaning of sequence alignment, Proteins Struct. Funct. Genet., vol. 9, no. 11, pp. 56-68, 1991.
[17]
W. Kabsch and C. Sander, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, vol. 22, pp. 2577-2637, 1983.
[18]
H. M. Berman, The Protein Data Bank: A historical perspective, Acta Crystallographica Section A: Foundations of Crystallography, vol. 64, no. 1, pp. 88-95, 2008.
[19]
G. Wang and R. Dunbrack Jr., PISCES: A protein sequence culling server, Bioinformatics, vol. 19. no. 12, pp. 1589-1591, 2003.
[20]
L. R. Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[21]
P. Baldi, Y. Chauvin, T. Hunkapiller, and M. A. McClure, Hidden Markov models of biological primary sequence information, Proceedings of Natural Academy of Science, USA, vol. 91, pp. 1059-1063, 1994.
Publication history
Copyright
Rights and permissions

Publication history

Received: 23 June 2014
Accepted: 30 June 2014
Published: 20 November 2014
Issue date: December 2014

Copyright

The Author(s)

Rights and permissions

Return