AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning

Department of Computing Sciences, College at Brockport, State University of New York, Brockport, NY 14422, USA.
Department of Computer Science and Technology at Jiangnan University, Wuxi 214122, China.
School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China.
Show Author Information

Abstract

Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing (GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect, analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.

References

[1]
F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, J. C. Fiddes, C. A. Hutchison III, P. M. Slocombe, and M. Smith, Nucleotide sequence of bacteriophage ϕX174 DNA, Nature, vol. 265, no. 5596, pp. 687-695, 1977.
[2]
N. Yu, X. Guo, F. Gu, and Y. Pan, Signalign: An ontology of DNA as signal for comparative gene structure prediction using information-coding-and-processing techniques, IEEE Trans. NanoBioscience, vol. 15, no. 2, pp. 119-130, 2016.
[3]
D. Anastassiou, Genomic signal processing, IEEE Signal Process. Mag., vol. 18, no. 4, pp. 8-20, 2001.
[4]
T. Holden, R. Subramaniam, R. Sullivan, E. Cheung, C. Schneider, G. Tremberger Jr., A. Flamholz, D. H. Lieberman, and T. D. Cheung, ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes, in Proc. Instruments, Methods, and Missions for Astrobiology X, San Diego, CA, USA, 2007, p. 669417.
[5]
N. Yu, Z. Yu, F. Gu, and Y. Pan, Evaluating the impact of encoding schemes on deep auto- encoders for DNA annotation, in Bioinformatics Research and Applications, Z. Cai, O. Daescu, and M. Li, eds. Springer International Publishing, 2017, pp. 390-395.
[6]
P. D. Cristea, Conversion of nucleotides sequences into genomic signals, J. Cell. Mol. Med., vol. 6, no. 2, pp. 279-303, 2002.
[7]
R. F. Voss, Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett., vol. 68, no. 25, pp. 3805-3808, 1992.
[8]
E. Borrayo, E. G. Mendizabal-Ruiz, H. Vlez-Pérez, R. Romo-Vázquez, A. P. Mendizabal, and J. A. Morales, Genomic signal processing methods for computation of alignment-free distances from DNA sequences, PLoS One, vol. 9, no. 11, p. e110954, 2014.
[9]
B. Hutter, V. Helms, and M. Paulsen, Tandem repeats in the CpG islands of imprinted genes, Genomics, vol. 88, no. 3, pp. 323-332, 2006.
[10]
Z. M. Ning, A. J. Cox, and J. C. Mullikin, SSAHA: A fast search method for large DNA databases, Genome Res., vol. 11, no. 10, pp. 1725-1729, 2001.
[11]
K. Katoh, K. Misawa, K. I. Kuma, and T. Miyata, MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res., vol. 30, no. 14, pp. 3059-3066, 2002.
[12]
B. R. King, M. Aburdene, A. Thompson, and Z. Warres, Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity, EURASIP    J. Bioinform. Syst. Biol., vol. 2014, no. 1, p. 8, 2014.
[13]
T. Hoang, C. C. Yin, H. Zheng, C. L. Yu, R. L. He, and S. S. T. Yau, A new method to cluster DNA sequences using Fourier power spectrum, J. Theor. Biol., vol. 372, pp. 135-145, 2015.
[14]
W. Peng, J. X. Wang, B. H. Zhao, and L. S. Wang, Identification of protein complexes using weighted PageRank-nibble algorithm and core-attachment structure, IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 12, no. 1, pp. 179-192, 2015.
[15]
F. Cervantes-De la Torre, J. I. González-Trejo, C. A. Real-Ramírez, and L. F. Hoyos-Reyes, Fractal dimension algorithms and their application to time series associated with natural phenomena, J. Phys. Conf. Ser., vol. 475, no. 1, p. 012002, 2013.
[16]
S. Vinga, A. M. Carvalho, A. P. Francisco, L. M. Russo, and J. S. Almeida, Pattern matching through chaos game representation: Bridging numerical and discrete data structures for biological sequence analysis, Algorithms Mol. Biol., vol. 7, no. 1, p. 10, 2012.
[17]
H. K. Kwan and S. B. Arniker, Numerical representation of DNA sequences, in Proc. 2009 IEEE International Conf. Electro/Information Technology, Windsor, ON, Canada, 2009, pp. 307-310.
[18]
S. bai Arniker and H. K. Kwan, Advanced numerical representation of DNA sequences, in Proc. 2012 Int. Conf. Bioscience, Biochemistry and Bioinformatices, Singapore, 2012, pp. 1-5.
[19]
D. Bielinska-Waz, Graphical and numerical representations of DNA sequences: statistical aspects of similarity, J. Math. Chem., vol. 49, no. 10, pp. 2345-2407, 2011.
[20]
A. Roy, C. Raychaudhury, and A. Nandy, Novel techniques of graphical representation and analysis of DNA sequences—A review, J. Biosci., vol. 23, no. 1, pp. 55-71, 1998.
[21]
I. Cosic, Macromolecular bioactivity: Is it resonant interaction between macromolecules?—Theory and applications, IEEE Trans. Biomed. Eng., vol. 41, no. 12, pp. 1101-1114, 1994.
[22]
E. Pirogova and I. Cosic, Examination of amino acid indexes within the resonant recognition model, in Proc. 2nd Conf. Victorian Chapter of the IEEE EMBS, Melbourne, Australia, 2001, pp. 1-4.
[23]
J. Ning, C. N. Moore, and J. C. Nelson, Preliminary wavelet analysis of genomic sequences, in Proc. 2003 IEEE Bioinformatics Conf. Computational Systems Bioinformatics, Stanford, CA, USA, 2003, pp. 509-510.
[24]
A. Nair and S. P. Sreenadhan, A coding measure scheme employing electron-ion interaction pseudopotential (EIIP), Bioinformation, vol. 1, no. 6, pp. 197-202, 2006.
[25]
H. E. Stanley, S. V. Buldyrev, A. L. Goldberger, Z. D. Goldberger, S. Havlin, R. N. Mantegna, S. M. Ossadnik, C. K. Peng, and M. Simons, Statistical mechanics in biology: How ubiquitous are long-range correlations? Phys. A, vol. 205, nos. 1-3, pp. 214-253, 1994.
[26]
W. Li and K. Kaneko, Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence, EPL, vol. 17, no. 7, p. 655, 1992.
[27]
A. T. M. G. Bari, M. R. Reaz, A. K. M. T. Islam, H. J. Choi, and B. S. Jeong, Effective encoding for DNA sequence visualization based on nucleotide’s ring structure, Evol. Bioinform., vol. 9, pp. 251-261, 2013.
[28]
K. J. Breslauer, R. Frank, H. Blcker, and L. A. Marky, Predicting DNA duplex stability from the base sequence, Proc. Natl. Acad. Sci. USA, vol. 83, no. 11, pp. 3746-3750, 1986.
[29]
N. Yu, X. Guo, F. Gu, and Y. Pan, DNA AS X: An information-coding-based model to improve the sensitivity in comparative gene analysis, in Bioinformatics Research and Applications, R. Harrison, Y. H. Li, and I. Mandoiu, eds. Springer International Publishing, 2015, pp. 366-377.
[30]
M. H. Garzon and R. J. Deaton, Codeword design and information encoding in DNA ensembles, Nat. Comput., vol. 3, no. 3, pp. 253-292, 2004.
[31]
W. Deng and Y. H. Luan, Analysis of similarity/ dissimilarity of DNA sequences based on chaos game representation, Abstr. Appl. Anal., vol. 2013, p. 926519, 2013.
[32]
J. Gao and Z. Y. Xu, Chaos game representation (CGR)-walk model for DNA sequences, Chin. Phys. B, vol. 18, no. 1, pp. 370-376, 2009.
[33]
J. S. Almeida, J. A. Carriço, A. Maretzek, P. A. Noble, and M. Fletcher, Analysis of genomic sequences by chaos game representation, Bioinformatics, vol. 17, no. 5, pp. 429-437, 2001.
[34]
L. C. B. Faria, A. S. L. Rocha, J. H. Kleinschmidt, M. C. Silva-Filho, E. Bim, R. H. Herai, M. E. B. Yamagishi, and R. Palazzo Jr., Is a genome a codeword of an error-correcting code? PLoS One, vol. 7, no. 5, p. e36644, 2012.
[35]
X. Liu and X. L. Geng, A convolutional code-based sequence analysis model and its application, Int. J. Mol. Sci., vol. 14, no. 4, pp. 8393-8405, 2013.
[36]
Z. B. Liu, B. Liao, W. Zhu, and G. H. Huang, A 2D graphical representation of DNA sequence based on dual nucleotides and its application, Int. J. Quantum Chem., vol. 109, no. 5, pp. 948-958, 2009.
[37]
A. S. S. Nair and T. Mahalakshmi, Visualization of genomic data using inter-nucleotide distance signals, in Proc. IEEE Genomic Signal Processing, Bucharest, Romania, 2005.
[38]
M. Hackenberg, C. Previti, P. L. Luque-Escamilla, P. Carpena, J. Martínez-Aroza, and J. L. Oliver, CpGcluster: A distance-based algorithm for CpG-island detection, BMC Bioinf., vol. 7, p. 446, 2006.
[39]
N. Yu, X. Guo, A. Zelikovsky, and Y. Pan, GaussianCpG: A Gaussian model for detection of human CpG island, in Proc. 5th Int. Conf. Computational Advances in Bio and Medical Sciences, Miami, FL, USA, 2015, p. 1.
[40]
V. Afreixo, C. A. C. Bastos, A. J. Pinho, S. P. Garcia, and P. J. S. G. Ferreira, Genome analysis with inter-nucleotide distances, Bioinformatics, vol. 25, no. 23, pp. 3064-3070, 2009.
[41]
L. Q. Zhou, R. Li, and G. S. Han, A method based on the improved inter-nucleotide distances of genomes to construct vertebrates phylogeny tree, in Proc. 7th Int. Conf. Biomedical Engineering and Informatics, Dalian, China, 2014, pp. 776-780.
[42]
C. A. Bastos, V. Afreixo, A. J. Pinho, S. P. Garcia, J. M. Rodrigues, and P. J. Ferreira, Inter-dinucleotide distances in the human genome: an analysis of the whole-genome and protein-coding distributions, J. Integr. Bioinform., vol. 8, no. 3, p. 172, 2011.
[43]
, I. Wasito, and I. Veritawati, Fractal dimension approach for clustering of DNA sequences based on internucleotide distance, in Proc. 2013 Int. Conf. Information and Communication Technology, Bandung, Indonesia, 2013, pp. 82-87.
[44]
C. A. C. Bastos, V. Afreixo, A. J. Pinho, S. P. Garcia, J. M. O. S. Rodrigues, and P. J. S. G. Ferreira, Distances between dinucleotides in the human genome, in Proc. 5th Int. Conf. Practical Applications of Computational Biology & Bioinformatics, 2011, pp. 205-211.
[45]
S. Y. Ding, Y. Li, X. W. Yang, and T. M. Wang, A simple k-word interval method for phylogenetic analysis of DNA sequences, J. Theor. Biol., vol. 317, pp. 192-199, 2013.
[46]
J. Tang, K. R. Hua, M. Y. Chen, R. M. Zhang, and X. L. Xie, A novel k-word relative measure for sequence comparison, Comput. Biol. Chem., vol. 53, pp. 331-338, 2014.
[47]
X. H. Xie, Z. G. Yu, G. S. Han, W. F. Yang, and V. Anh, Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles, Mol. Phylogenet. Evol., vol. 89, pp. 37-45, 2015.
[48]
S. Zou, L. Wang, and J. F. Wang, A 2D graphical representation of the sequences of DNA based on triplets and its application, EURASIP J. Bioinform. Syst. Biol., vol. 2014, no. 1, p. 1, 2014.
[49]
M. Akhtar, J. Epps, and E. Ambikairajah, On DNA numerical representations for period-3 based exon prediction, in Proc. 2007 IEEE Int. Workshop on Genomic Signal Processing and Statistics, Tuusula, Finland, 2007, pp. 1-4.
[50]
K. Jabbari and G. Bernardi, Cytosine methylation and CpG, TpG (CpA) and TpA frequencies, Gene, vol. 333, pp. 143-149, 2004.
[51]
S. Datta and A. Asif, A fast DFT based gene prediction algorithm for identification of protein coding regions, in Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2005, pp. 653-656.
[52]
A. S. Motahari, G. Bresler, and D. N. C. Tse, Information theory of DNA shotgun sequencing, IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6273-6289, 2013.
[53]
M. W. Simmen, Genome-scale relationships between cytosine methylation and dinucleotide abundances in animals, Genomics, vol. 92, no. 1, pp. 33-40, 2008.
[54]
J. Tuqan and A. Rushdi, A DSP approach for finding the codon bias in DNA sequences, IEEE J. Sel. Top. Signal Process., vol. 2, no. 3, pp. 343-356, 2008.
[55]
L. Galleani and R. Garello, The minimum entropy mapping spectrum of a DNA sequence, IEEE Trans. Inf. Theory, vol. 56, no. 2, pp. 771-783, 2010.
[56]
R. Román-Roldán, P. Bernaola-Galván, and J. Oliver, Application of information theory to DNA sequence analysis: A review, Pattern Recognition, vol. 29, no. 7, pp. 1187-1194, 1996.
[57]
P. Bernaola-Galván, I. Grosse, P. Carpena, J. L. Oliver, R. Román-Roldán, and H. E. Stanley, Finding borders between coding and noncoding DNA regions by an entropic segmentation method, Phys. Rev. Lett., vol. 85, no. 6, pp. 1342-1345, 2000.
[58]
P. Dan Cristea, Genetic signal representation and analysis, in Proc. Functional Monitoring and Drug-Tissue Interaction, San Jose, CA, USA, 2002, pp. 77-84.
[59]
P. Cristea, Genetic signal analysis, in Proc. 6th Int. Symp. Signal Processing and Its Applications, Kuala Lumpur, Malaysia, 2001, pp. 703-706.
[60]
P. D. N. Hebert, A. Cywinska, S. L. Ball, and J. R. deWaard, Biological identifications through DNA barcodes, Proc. Roy. Soc. B Biol. Sci., vol. 270, no. 1512, pp. 313-321, 2003.
[61]
S. Ratnasingham and P. D. N. Hebert, Bold: The barcode of life data system, Mol. Ecol. Notes, vol. 7, no. 3, pp. 355-364, 2007.
[62]
V. Afreixo, C. A. C. Bastos, A. J. Pinho, S. P. Garcia, and P. J. S. G. Ferreira, Genome analysis with distance to the nearest dissimilar nucleotide, J. Theor. Biol., vol. 275, no. 1, pp. 52-58, 2011.
[63]
W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler, The human genome browser at UCSC, Genome Res., vol. 12, no. 6, pp. 996-1006, 2002.
[64]
G. Kauer and H. Blöcker, Applying signal theory to the analysis of biomolecules, Bioinformatics, vol. 19, no. 16, pp. 2016-2021, 2003.
[65]
E. A. Cheever, D. B. Searls, W. Karunaratne, and G. C. Overton, Using signal processing techniques for DNA sequence comparison, in Proc. 15th Annu. Northeast Bioengineering Conference, Boston, MA, USA, 1989, pp. 173-174.
[66]
H. K. Kwan, B. Y. M. Kwan, and J. Y. Y. Kwan, Novel methodologies for spectral classification of exon and intron sequences, EURASIP J. Adv. Signal Process., vol. 2012, no. 1, p. 50, 2012.
[67]
J. A. Berger, S. K. Mitra, M. Carli, and A. Neri, New Approaches to Genome Sequence Analysis Based on Digital Signal Processing. University of California, CA, USA, 2002.
[68]
N. Rao and S. J. Shepherd, Detection of 3- periodicity for small genomic sequences based on AR technique, in Proc. 2004 Int. Conf. Communications, Circuits and Systems, Chengdu, China, 2004, pp. 1032-1036.
[69]
N. Chakravarthy, A. Spanias, L. D. Iasemidis, and K. Tsakalis, Autoregressive modeling and feature analysis of DNA sequences, EURASIP J. Appl. Signal Process., vol. 2004, p. 952689, 2004.
[70]
Z. G. Yu, V. V. Anh, Y. Zhou, and L. Q. Zhou, Numerical sequence representation of DNA sequences and methods to distinguish coding and non-coding sequences in a complete genome, in Proc. 11th World Multi-Conf. Systemics, Cybernetics and Informatics: WMSCI 2007, 2007, pp. 171-176.
[71]
A. K. Brodzik and O. Peters, Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences, in Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2005, pp. 373-376.
[72]
G. Rosen, Examining coding structure and redundancy in DNA, IEEE Eng. Med. Biol. Mag., vol. 25, no. 1, pp. 62-68, 2006.
[73]
G. L. Rosen and J. D. Moore, Investigation of coding structure in DNA, in Proc. 2003 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Hong Kong, China, 2003, p. II-361-4.
[74]
C. K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Simons, and H. E. Stanley, Long-range correlations in nucleotide sequences, Nature, vol. 356, no. 6365, pp. 168-170, 1992.
[75]
J. A. Berger, S. K. Mitra, M. Carli, and A. Neri, Visualization and analysis of DNA sequences using DNA walks, J. Franklin Inst., vol. 341, nos. 1&2, pp. 37-53, 2004.
[76]
S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, and R. Ramaswamy, Prediction of probable genes by Fourier analysis of genomic sequences, Bioinformatics, vol. 13, no. 3, pp. 263-270, 1997.
[77]
W. T. Li, T. G. Marr, and K. Kaneko, Understanding long-range correlations in DNA sequences, Phys. D Nonlinear Phenom., vol. 75, nos. 1-3, pp. 392-416, 1994.
[78]
O. Abbasi, A. Rostami, and G. Karimian, Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform, BMC Bioinformatics, vol. 12, p. 430, 2011.
[79]
S. P. Deng, Y. X. Shi, L. Y. Yuan, Y. X. Li, and G. H. Ding, Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics, BMC Genomics, vol. 13, no. Suppl 8, p. S19, 2012.
[80]
C. A. C. Bastos, V. Afreixo, S. P. Garcia, and A. J. Pinho, Inter-stop symbol distances for the identification of coding regions, J. Integr. Bioinform., vol. 10, no. 3, p. 230, 2013.
[81]
G. L. Rosen, Signal processing for biologically-inspired gradient source localization and DNA sequence analysis, PhD dissertation, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 2006.
[82]
D. Limbachiya, B. Rao, and M. K. Gupta, The art of DNA strings: Sixteen years of DNA coding theory, arXiv preprint arXiv: 1607.00266, 2016.
[83]
L. C. B. Faria, A. S. L. Rocha, J. H. Kleinschmidt, R. Palazzo, and M. C. Silva-Filho, DNA sequences generated by BCH codes over GF(4), Electron. Lett., vol. 46, no. 3, pp. 203-204, 2010.
[84]
L. Zhang, F. C. Tian, S. Y. Wang, and X. Liu, A novel coding method for gene mutation correction during protein translation process, J. Theor. Biol., vol. 296, pp. 33-40, 2012.
[85]
F. Castro-Chavez, A tetrahedral representation of the genetic code emphasizing aspects of symmetry, BIOcomplexity, vol. 2012, no. 2, pp. 1-6, 2012.
[86]
F. Castro-Chavez, Defragged binary I Ching genetic code chromosomes compared to Nirenberg’s and transformed into rotating 2D circles and squares and into a 3D 100% symmetrical tetrahedron coupled to a functional one to discern start from non-start methionines through a Stella octangula, J. Proteome Sci. Comput. Biol., vol. 1, no. 1, p. 3, 2012.
[87]
H. J. Jeffrey, Chaos game representation of gene structure, Nucleic Acids Res., vol. 18, no. 8, pp. 2163-2170, 1990.
[88]
Y. W. Wang, K. Hill, S. Singh, and L. Kari, The spectrum of genomic signatures: From dinucleotides to chaos game representation, Gene, vol. 346, pp. 173-185, 2005.
[89]
J. Joseph and R. Sasikumar, Chaos game representation for comparison of whole genomes, BMC Bioinformatics, vol. 7, p. 243, 2006.
[90]
C. Dutta and J. Das, Mathematical characterization of chaos game representation: New algorithms for nucleotide sequence analysis, J. Mol. Biol., vol. 228, no. 3, pp. 715-719, 1992.
[91]
N. Goldman, Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences, Nucleic Acids Res., vol. 21, no. 10, pp. 2487-2491, 1993.
[92]
F. Castro-Chavez, Most used codons per amino acid and per genome in the code of man compared to other organisms according to the rotating circular genetic code, Neuroquantology, vol. 9, no. 4, p. 500, 2011.
[93]
S. Delgado, F. Morán, A. Mora, J. J. Merelo, and C. Briones, A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps, Bioinformatics, vol. 31, no. 5, pp. 736-744, 2015.
[94]
Z. G. Yu and V. Anh, Time series model based on global structure of complete genome, Chaos, Solitons & Fractals, vol. 12, no. 10, pp. 1827-1834, 2001.
[95]
H. T. Chang, N. W. Lo, W. C. Lu, and C. J. Kuo, Visualization and comparison of DNA sequences by use of three-dimensional trajectories, in Proc. 1st Asia-Pacific Bioinformatics Conf. Bioinformatics 2003, Adelaide, Australia, 2003, pp. 81-85.
[96]
T. Kohonen, Self-organized formation of topologically correct feature maps, Biol. Cybern., vol. 43, no. 1, pp. 59-69, 1982.
[97]
T. Kohonen and P. Somervuo, How to make large self-organizing maps for nonvectorial data, Neural Netw., vol. 15, nos. 8&9, pp. 945-952, 2002.
[98]
A. P. Boyle, C. L. Araya, C. Brdlik, P. Cayting, C. Cheng, Y. Cheng, K. Gardner, L. W. Hillier, J. Janette, L. X. Jiang, D. Kasper, et al., Comparative analysis of regulatory information and circuits across distant species, Nature, vol. 512, no. 7515, pp. 453-456, 2014.
[99]
E. Hamori and J. Ruskin, H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem., vol. 258, no. 2, pp. 1318-1327, 1983.
[100]
M. A. Gates, Simpler DNA sequence representations, Nature, vol. 316, no. 6025, p. 219, 1985.
[101]
S. S. T. Yau, J. S. Wang, A. Niknejad, C. X. Lu, N. Jin, and Y. K. Ho, DNA sequence representation without degeneracy, Nucleic Acids Res., vol. 31, no. 12, pp. 3078-3080, 2003.
[102]
R. Zhang and C. T. Zhang, Z curves, an intutive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767-782, 1994.
[103]
H. K. Kwan, R. Atwal, and B. Y. M. Kwan, Wavelet analysis of DNA sequences, in Proc. 2008 Int. Conf. Communications, Circuits and Systems, Fujian, China, 2008, pp. 816-820.
[104]
C. L. Yu, M. Deng, L. Zheng, R. L. He, J. Yang, and S. S. T. Yau, DFA7, a new method to distinguish between intron-containing and intronless genes, PLoS One, vol. 9, no. 7, p. e101363, 2014.
[105]
M. Akhtar, J. Epps, and E. Ambikairajah, Signal processing in sequence analysis: Advances in eukaryotic gene prediction, IEEE J. Sel. Top. Signal Process., vol. 2, no. 3, pp. 310-321, 2008.
[106]
G. Mendizabal-Ruiz, I. Román-Godínez, S. Torres-Ramos, R. A. Salido-Ruiz, and J. A. Morales, On DNA numerical representations for genomic similarity computation, PLoS One, vol. 12, no. 3, p. e0173288, 2017.
[107]
R. Ranawana and V. Palade, A neural network based multi-classifier system for gene identification in DNA sequences, Neural Comput. Appl., vol. 14, no. 2, pp. 122-131, 2005.
[108]
S. B. Arniker, H. K. Kwan, N. F. Law, and D. P. K. Lun, DNA numerical representation and neural network based human promoter prediction system, in Proc. 2011 Annu. IEEE India Conf., Hyderabad, India, 2011, pp. 1-4.
[109]
X. Xie, S. Wu, K. M. Lam, and H. Yan, Promoterexplorer: An effective promoter identification method based on the AdaBoost algorithm, Bioinformatics, vol. 22, no. 22, pp. 2722-2728, 2006.
[110]
L. Deng and D. Yu, Deep learning: Methods and applications, Tech. Rep. MSR-TR-2014-21, 2014, http://research.microsoft.com/apps/pubs/default.aspx?id=209355
[111]
Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798-1828, 2013.
[112]
M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler, Improved splice site detection in genie, J. Comput. Biol., vol. 4, no. 3, pp. 311-323, 1997.
[113]
N. Yu, Z. Yu, and Y. Pan, A deep learning method for lincRNA detection using auto-encoder algorithm, BMC Bioinformatics, vol. 18, no. Suppl 15, p. 511, 2017.
[114]
G. B. Orr and K. R. Müller, Neural Networks: Tricks of the Trade. Springer, 1998, p. 1524.
[115]
S. Wiesler, A. Richard, R. Schluter, and H. Ney, Mean-normalized stochastic gradient for large-scale deep learning, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 180-184.
[116]
T. Raiko, H. Valpola, and Y. LeCun, Deep learning made easier by linear transformations in perceptrons, in Proc. 15th Int. Conf. Artificial Intelligence and Statistics, La Palma, Canary Islands, 2012, pp. 924-932.
[117]
S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv: 1502.03167, 2015.
[118]
I. Danihelka, G. Wayne, B. Uria, N. Kalchbrenner, and A. Graves, Associative long short-term memory, arXiv preprint arXiv: 1602.03032, 2016.
[119]
C. Jose, M. Cisse, and F. Fleuret, Kronecker recurrent units, arXiv preprint arXiv: 1705.10142, 2017.
[120]
L. Jing, Ç. Gülçehre, J. Peurifoy, Y. C. Shen, M. Tegmark, M. Soljacic, and Y. Bengio, Gated orthogonal recurrent units: On learning to forget, arXiv preprint arXiv: 1706.02761, 2017.
[121]
M. Arjovsky, A. Shah, and Y. Bengio, Unitary evolution recurrent neural networks, arXiv preprint arXiv: 1511.06464, 2015.
[122]
C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, Deep complex networks, arXiv preprint arXiv: 1705.09792, 2017.
[123]
L. Mescheder, S. Nowozin, and A. Geiger, The numerics of GANs, arXiv preprint arXiv: 1705.10461, 2017.
Big Data Mining and Analytics
Pages 191-210
Cite this article:
Yu N, Li Z, Yu Z. Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning. Big Data Mining and Analytics, 2018, 1(3): 191-210. https://doi.org/10.26599/BDMA.2018.9020018

1026

Views

71

Downloads

55

Crossref

33

Web of Science

58

Scopus

0

CSCD

Altmetrics

Received: 21 January 2018
Accepted: 24 January 2018
Published: 24 May 2018
© The author(s) 2018
Return