Journal Home > Volume 34 , Issue 1

Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools, the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB) recently. In this work, we identify a protein structural dictionary (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural “keywords” that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.

File
jcst-34-1-3-Highlights.pdf (721.9 KB)
Publication history
Copyright

Publication history

Received: 13 July 2018
Revised: 04 December 2018
Published: 18 January 2019
Issue date: January 2019

Copyright

©2019 Springer Science + Business Media, LLC & Science Press, China
Return