Journal Home > Volume 18 , Issue 5

The recent breakthroughs in next-generation sequencing technologies, such as those of Roche 454, Illumina/Solexa, and ABI SOLID, have dramatically reduced the cost of producing short reads of the genome of new species. The huge volume of reads, along with short read length, high coverage, and sequencing errors, poses a great challenge to de novo genome assembly. However, the paired-end information provides a new solution to these problems. In this paper, we review and compare some current assembly tools, including Newbler, CAP3, Velvet, SOAPdenovo, AllPaths, Abyss, IDBA, PE-Assembly, and Telescoper. In general, we compare the seed extension and graph-based methods that use the overlap/lapout/consensus approach and the de Bruijn graph approach for assembly. At the end of the paper, we summarize these methods and discuss the future directions of genome assembly.


menu
Abstract
Full text
Outline
About this article

De Novo Assembly Methods for Next Generation Sequencing Data

Show Author's information Yiming HeZhen ZhangXiaoqing PengFangxiang WuJianxin Wang( )
School of Information Science and Engineering, Central South University, Changsha 410083, China
Morehouse School of Medicine, Atlanta, GA 30310, USA
Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada

Abstract

The recent breakthroughs in next-generation sequencing technologies, such as those of Roche 454, Illumina/Solexa, and ABI SOLID, have dramatically reduced the cost of producing short reads of the genome of new species. The huge volume of reads, along with short read length, high coverage, and sequencing errors, poses a great challenge to de novo genome assembly. However, the paired-end information provides a new solution to these problems. In this paper, we review and compare some current assembly tools, including Newbler, CAP3, Velvet, SOAPdenovo, AllPaths, Abyss, IDBA, PE-Assembly, and Telescoper. In general, we compare the seed extension and graph-based methods that use the overlap/lapout/consensus approach and the de Bruijn graph approach for assembly. At the end of the paper, we summarize these methods and discuss the future directions of genome assembly.

Keywords: next-generation sequencing, genome assembly, overlap/lapout/consensus, de Bruijn graph

References(87)

[1]
W. J.Ansorge, Next-generation dna sequencing techniques, N. Biotechnol., vol. 25, no. 4, pp. 195-203, 2009.
[2]
S.Fox, S.Filichkin, and T. C.Mockler, Applications of ultra-high-throughput sequencing, Methods Mol. Biol., vol. 553, pp. 79-108, 2009.
[3]
D.MacLean, J. D.Jones, and D. J.Studholme, Application of “next-generation” sequencing technologies to microbial genetics, Nat. Rev. Microbiol., vol. 7, no. 4, pp. 287-296, 2009.
[4]
N.Hall, Advanced sequencing technologies and their wider impact in microbiology, J. Exp. Biol., vol. 210, no. Pt 9, pp. 1518-1525, 2007.
[5]
E. R.Mardis, Next-generation dna sequencing methods, Annu. Rev. Genomics Hum. Genet., vol. 9, pp. 387-402, 2008.
[6]
M. L.Metzker, Sequencing technologies — the next generation, Nat. Rev. Genet., vol. 11, no. 1, pp. 31-46, 2010.
[7]
O.Morozovaand M. A.Marra, Applications of next-generation sequencing technologies in functional genomics, Genomics, vol. 92, no. 5, pp. 255-264, 2008.
[8]
J.Shendure, R. D.Mitra, C.Varma, and G. M.Church, Advanced sequencing technologies: Methods and goals, Nat. Rev. Genet., vol. 5, no. 5, pp. 335-344, 2004.
[9]
P.Flicekand E.Birney, Sense from sequence reads: Methods for alignment and assembly, Nat. Methods, vol. 6, no. Suppl 11, pp. S6-S12, 2009.
[10]
M.Margulies, M.Egholm, W. E.Altman, S.Attiya, J. S.Bader, L. A.Bemben, J.Berka, M. S.Braverman, Y. J.Chen, Z.Chen, et al., Genome sequencing in microfabricated high-density picolitre reactors, Nature, vol. 437, no. 7057, pp. 376-380, 2005.
[11]
M.Pop, Genome assembly reborn: Recent computational challenges, Brief. Bioinform., vol. 10, no. 4, pp. 354-366, 2009.
[12]
D. S.Horner, G.Pavesi, T.Castrignano, P. D.De Meo, S.Liuni, M.Sammeth, E.Picardi, and G.Pesole, Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing, Brief. Bioinform., vol. 11, no. 2, pp. 181-197, 2010.
[13]
M.Imelfort, Sequence Comparison Tools. Springer, 2009. pp. 13-37.
[14]
S. D.Jackmanand I.Birol, Assembling genomes using short-read sequencing technology, Genome Biol., vol. 11, no. 1, p. 202, 2010.
[15]
M.Popand S. L.Salzberg, Bioinformatics challenges of new sequencing technology, Trends in Genetics, vol. 24, no. 3, pp. 142-149, 2008.
[16]
G.Narzisiand B.Mishra, Comparing de novo genome assembly: The long and short of it, PLoS One, vol. 6, no. 4, p. e19175, 2011.
[17]
M. C.Schatz, A. L.Delcher, and S. L.Salzberg, Assembly of large genomes using second-generation sequencing, Genome Res., vol. 20, no. 9, pp. 1165-1173, 2010.
[18]
C.Alkan, S.Sajjadian, and E. E.Eichler, Limitations of next-generation genome sequence assembly, Nat. Methods, vol. 8, no. 1, pp. 61-65, 2011.
[19]
Z.Li, Y.Chen, D.Mu, J.Yuan, Y.Shi, H.Zhang, J.Gan, N.Li, X.Hu, B.Liu, et al., Comparison of the two major classes of assembly algorithms: Overlap-layout-consensus and de-bruijn-graph, Brief. Funct. Genomics, vol. 11, no. 1, pp. 25-37, 2012.
[20]
J. R.Miller, S.Koren, and G.Sutton, Assembly algorithms for next-generation sequencing data, Genomics, vol. 95, no. 6, pp. 315-327, 2010.
[21]
Y.Lin, J.Li, H.Shen, L.Zhang, C. J.Papasian, and H. W.Deng, Comparative studies of de novo assembly tools for next-generation sequencing technologies, Bioinformatics, vol. 27, no. 15, pp. 2031-2037, 2011.
[22]
E. R.Mardis, The impact of next-generation sequencing technology on genetics, Trends Genet., vol. 24, no. 3, pp. 133-141, 2008.
[23]
R. L.Strausberg, S.Levy, and Y. H.Rogers, Emerging dna sequencing technologies for human genomic medicine, Drug Discov. Today, vol. 13, no. 13-14, pp. 569-577, 2008.
[24]
E.Pettersson, J.Lundeberg, and A.Ahmadian, Generations of sequencing technologies, Genomics, vol. 93, no. 2, pp. 105-111, 2009.
[25]
S. C.Schuster, Next-generation sequencing transforms today’s biology, Nat. Methods, vol. 5, no. 1, pp. 16-18, 2008.
[26]
B.Ewingand P.Green, Base-calling of automated sequencer traces using phred. ii. error probabilities, Genome Res., vol. 8, no. 3, pp. 186-194, 1998.
[27]
B.Ewing, L.Hillier, M. C.Wendl, and P.Green, Basecalling of automated sequencer traces using phred. i. accuracy assessment, Genome Res., vol. 8, no. 3, pp. 175-185, 1998.
[28]
D. H.Huson, K.Reinert, and E. W.Myers, The greedy path-merging algorithm for contig scaffolding, Journal of the ACM, vol. 49, no. 5, pp. 603-615, 2002.
[29]
P. E. C.Compeau, P. A.Pevzner, and G.Tesler, How to apply de bruijn graphs to genome assembly, Nature Biotechnology, vol. 29, no. 11, pp. 987-991, 2011.
[30]
X.Huangand S. P.Yang, Generating a genome assembly with pcap, Curr. Protoc. Bioinformatics, 2005. .
[31]
E. W.Myers, G. G.Sutton, A. L.Delcher, I. M.Dew, D. P.Fasulo, M. J.Flanigan, S. A.Kravitz, C. M.Mobarry, K. H.Reinert, K. A.Remington, et al., A whole-genome assembly of drosophila, Science, vol. 287, no. 5461, pp. 2196-2204, 2000.
[32]
X.Huangand A.Madan, Cap3: A dna sequence assembly program, Genome Res., vol. 9, no. 9, pp. 868-877, 1999.
[33]
S.Batzoglou, D. B.Jaffe, K.Stanley, J.Butler, S.Gnerre, E.Mauceli, B.Berger, J. P.Mesirov, and E. S.Lander, Arachne: A whole-genome shotgun assembler, Genome Res., vol. 12, no. 1, pp. 177-189, 2002.
[34]
M.de la Bastideand W. R.McCombie, Assembling genomic dna sequences with phrap, Curr. Protoc. Bioinformatics, 2007. .
[35]
J. C.Mullikinand Z.Ning, The phusion assembler, Genome Res., vol. 13, no. 1, pp. 81-90, 2003.
[36]
R. M.Iduryand M. S.Waterman, A new algorithm for dna sequence assembly, J. Comput. Biol., vol. 2, no. 2, pp. 291-306, 1995.
[37]
P. A.Pevzner, H.Tang, and M. S.Waterman, An eulerian path approach to dna fragment assembly, Proc. Natl. Acad. Sci., vol. 98, no. 17, pp. 9748-9753, 2001.
[38]
M. J.Chaisson, D.Brinza, and P. A.Pevzner, De novo fragment assembly with short mate-paired reads: Does the read length matter?, Genome Res., vol. 19, no. 2, pp. 336-346, 2009.
[39]
D. R.Zerbinoand E.Birney, Velvet: Algorithms for de novo short read assembly using de bruijn graphs, Genome Res., vol. 18, no. 5, pp. 821-829, 2008.
[40]
J. T.Simpson, K.Wong, S. D.Jackman, J. E.Schein, S. J.Jones, and I.Birol, Abyss: A parallel assembler for short read sequence data, Genome Res., vol. 19, no. 6, pp. 1117-1123, 2009.
[41]
I.Maccallum, D.Przybylski, S.Gnerre, J.Burton, I.Shlyakhter, A.Gnirke, J.Malek, K.McKernan, S.Ranade, T. P.Shea, et al., Allpaths 2: Small genomes assembled accurately and with high continuity from short paired reads, Genome Biol., vol. 10, no. 10, p. R103, 2009.
[42]
J.Butler, I.MacCallum, M.Kleber, I. A.Shlyakhter, M. K.Belmonte, E. S.Lander, C.Nusbaum, and D. B.Jaffe, Allpaths: De novo assembly of whole-genome shotgun microreads, Genome Res., vol. 18, no. 5, pp. 810-820, 2008.
[43]
R.Li, H.Zhu, J.Ruan, W.Qian, X.Fang, Z.Shi, Y.Li, S.Li, G.Shan, K.Kristiansen, et al., De novo assembly of human genomes with massively parallel short read sequencing, Genome Res., vol. 20, no. 2, pp. 265-272, 2010.
[44]
S.Gnerre, I.MacCallum, D.Przybylski, F. J.Ribeiro, J. N.Burton, B. J.Walker, T.Sharpe, G.Hall, T. P.Shea, S.Sykes, et al., High-quality draft assemblies of mammalian genomes from massively parallel sequence data, Proc. Natl. Acad. Sci., vol. 108, no. 4, pp. 1513-1518, 2011.
[45]
Y.Peng, H. C. M.Leung, S. M.Yiu, and F. Y. L.Chin, Idba - a practical iterative de bruijn graph de novo assembler, Research in Computational Molecular Biology, Proceedings, vol. 6044, pp. 426-440, 2010.
[46]
Z.Iqbal, M.Caccamo, I.Turner, P.Flicek, and G.McVean, De novo assembly and genotyping of variants using colored de bruijn graphs, Nat. Genet., vol. 44, no. 2, pp. 226-232, 2012.
[47]
J. T.Simpsonand R.Durbin, Efficient de novo assembly of large genomes using compressed data structures, Genome Res., vol. 22, no. 3, pp. 549-556, 2012.
[48]
S.Huang, R.Li, Z.Zhang, L.Li, X.Gu, W.Fan, W. J.Lucas, X.Wang, B.Xie, P.Ni, et al., The genome of the cucumber, cucumis sativus l, Nat. Genet., vol. 41, no. 12, pp. 1275-1281, 2009.
[49]
R.Li, W.Fan, G.Tian, H.Zhu, L.He, J.Cai, Q.Huang, Q.Cai, B.Li, Y.Bai, et al., The sequence and de novo assembly of the giant panda genome, Nature, vol. 463, no. 7279, pp. 311-317, 2010.
[50]
E. S.Landerand M. S.Waterman, Genomic mapping by fingerprinting random clones: A mathematical analysis, Genomics, vol. 2, no. 3, pp. 231-239, 1988.
[51]
S. F.Altschul, W.Gish, W.Miller, E. W.Myers, and D. J.Lipman, Basic local alignment search tool, J. Mol. Biol., vol. 215, no. 3, pp. 403-410, 1990.
[52]
E. W.Myers, Toward simplifying and accurately formulating fragment assembly, J. Comput. Biol., vol. 2, no. 2, pp. 275-290, 1995.
[53]
L.Wangand T.Jiang, On the complexity of multiple sequence alignment, J. Comput. Biol., vol. 1, no. 4, pp. 337-348, 1994.
[54]
D.Hernandez, P.Francois, L.Farinelli, M.Osteras, and J.Schrenzel, De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer, Genome Res., vol. 18, no. 5, pp. 802-809, 2008.
[55]
M. S.Hossain, N.Azimi, and S.Skiena, Crystallizing shortread assemblies around seeds, BMC Bioinformatics, vol. 10, no. Suppl 1, p. S16, 2009.
[56]
J. R.Miller, A. L.Delcher, S.Koren, E.Venter, B. P.Walenz, A.Brownley, J.Johnson, K.Li, C.Mobarry, and G.Sutton, Aggressive assembly of pyrosequencing reads with mates, Bioinformatics, vol. 24, no. 24, pp. 2818-2824, 2008.
[57]
P. A.Pevznerand H.Tang, Fragment assembly with double-barreled data, Bioinformatics, vol. 17, no. Suppl 1, pp. S225-S233, 2001.
[58]
P. A.Pevzner, 1-tuple dna sequencing: Computer analysis, J. Biomol. Struct. Dyn., vol. 7, no. 1, pp. 63-73, 1989.
[59]
D. D.Sommer, A. L.Delcher, S. L.Salzberg, and M.Pop, Minimus: A fast, lightweight genome assembler, BMC Bioinformatics, vol. 8, p. 64, 2007.
[60]
T.Pluskal, T.Uehara, and M.Yanagida, Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, ms/ms fragmentation, heuristic rules, and isotope pattern matching, Anal. Chem., vol. 84, pp. 4396-4403, 2012.
[61]
J. C.Venter, M. D.Adams, E. W.Myers, P. W.Li, R. J.Mural, G. G.Sutton, H. O.Smith, M.Yandell, C. A.Evans, R. A.Holt, et al., The sequence of the human genome, Science, vol. 291, no. 5507, pp. 1304-1351, 2001.
[62]
D. R.Kelley, M. C.Schatz, and S. L.Salzberg, Quake: Quality-aware detection and correction of sequencing errors, Genome Biol., vol. 11, no. 11, p. R116, 2010.
[63]
L.Ilie, F.Fazayeli, and S.Ilie, Hitec: Accurate error correction in high-throughput sequencing data, Bioinformatics, vol. 27, no. 3, pp. 295-302, 2011.
[64]
W.Qu, S.Hashimoto, and S.Morishita, Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing, Genome Res., vol. 19, no. 7, pp. 1309-1315, 2009.
[65]
P.Medvedev, E.Scott, B.Kakaradov, and P.Pevzner, Error correction of high-throughput sequencing datasets with non-uniform coverage, Bioinformatics, vol. 27, no. 13, pp. i137-i141, 2011.
[66]
R. L.Warren, G. G.Sutton, S. J.Jones, and R. A.Holt, Assembling millions of short dna sequences using ssake, Bioinformatics, vol. 23, no. 4, pp. 500-501, 2007.
[67]
W. R.Jeck, J. A.Reinhardt, D. A.Baltrus, M. T.Hickenbotham, V.Magrini, E. R.Mardis, J. L.Dang, and C. D.Jones, Extending assembly of short dna sequences to handle error, Bioinformatics, vol. 23, no. 21, pp. 2942-2944, 2007.
[68]
J. C.Dohm, C.Lottaz, T.Borodina, and H.Himmelbauer, Sharcgs, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing, Genome Res., vol. 17, no. 11, pp. 1697-1706, 2007.
[69]
D. W.Bryant, W. K.Wong, and T. C.Mockler, Qsra: A quality-value guided de novo short read assembler, BMC Bioinformatics, vol. 10, p. 69, 2009.
[70]
F.Nadalin, F.Vezzi, and A.Policriti, Gapfiller: A de novo assembly approach to fill the gap within paired reads, BMC Bioinformatics, vol. 13, no. Suppl 14, p. S8, 2012.
[71]
M.Bresler, S.Sheehan, A. H.Chan, and Y. S.Song, Telescoper: De novo assembly of highly repetitive regions, Bioinformatics, vol. 28, no. 18, pp. i311-i317, 2012.
[72]
P. N.Ariyaratneand W. K.Sung, Pe-assembler: De novo assembler using short paired-end reads, Bioinformatics, vol. 27, no. 2, pp. 167-174, 2011.
[73]
Q.Xia, Y.Guo, Z.Zhang, D.Li, Z.Xuan, Z.Li, F.Dai, Y.Li, D.Cheng, R.Li, et al., Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (bombyx), Science, vol. 326, no. 5951, pp. 433-436, 2009.
[74]
J.Wang, W.Wang, R.Li, Y.Li, G.Tian, L.Goodman, W.Fan, J.Zhang, J.Li, J.Zhang, et al., The diploid genome sequence of an asian individual, Nature, vol. 456, pp. 60-65, 2008.
[75]
D. R.Bentley, S.Balasubramanian, H. P.Swerdlow, G. P.Smith, J.Milton, C. G.Brown, K. P.Hall, D. J.Evers, C. L.Barnes, H. R.Bignell, et al., Accurate whole human genome sequencing using reversible terminator chemistry, Nature, vol. 456, no. 7218, pp. 53-59, 2008.
[76]
D. R.Bentley, Whole-genome re-sequencing, Curr. Opin. Genet. Dev., vol. 16, no. 6, pp. 545-552, 2006.
[77]
W. J.Kent, Blat-the blast-like alignment tool, Genome Res., vol. 12, no. 4, pp. 656-664, 2002.
[78]
H.Li, J.Ruan, and R.Durbin, Mapping short dna sequencing reads and calling variants using mapping quality scores, Genome Res., vol. 18, no. 11, pp. 1851-1858, 2008.
[79]
R.Li, Y.Li, K.Kristiansen, and J.Wang, Soap: Short oligonucleotide alignment program, Bioinformatics, vol. 24, no. 5, pp. 713-714, 2008.
[80]
R.Li, C.Yu, Y.Li, T. W.Lam, S. M.Yiu, K.Kristiansen, and J.Wang, Soap2: An improved ultrafast tool for short read alignment, Bioinformatics, vol. 25, no. 15, pp. 1966-1967, 2009.
[81]
H.Liand R.Durbin, Fast and accurate short read alignment with burrows-wheeler transform, Bioinformatics, vol. 25, no. 14, pp. 1754-1760, 2009.
[82]
H.Liand R.Durbin, Fast and accurate long-read alignment with burrows-wheeler transform, Bioinformatics, vol. 26, no. 5, pp. 589-595, 2010.
[83]
B.Langmead, C.Trapnell, M.Pop, and S. L.Salzberg, Ultrafast and memory-efficient alignment of short dna sequences to the human genome, Genome Biol., vol. 10, no. 3, p. R25, 2009.
[84]
B.Langmeadand S. L.Salzberg, Fast gapped-read alignment with bowtie 2, Nat. Methods, vol. 9, no. 4, pp. 357-359, 2012.
[85]
I.Birol, S. D.Jackman, C. B.Nielsen, J. Q.Qian, R.Varhol, G.Stazyk, R. D.Morin, Y.Zhao, M.Hirst, J. E.Schein, et al., De novo transcriptome assembly with abyss, Bioinformatics, vol. 25, no. 21, pp. 2872-2877, 2009.
[86]
I.Milne, M.Bayer, L.Cardle, P.Shaw, G.Stephen, F.Wright, and D.Marshall, Tablet-next generation sequence assembly visualization, Bioinformatics, vol. 26, no. 3, pp. 401-402, 2010.
[87]
J. T.Simpsonand R.Durbin, Efficient construction of an assembly string graph using the fm-index, Bioinformatics, vol. 26, no. 12, pp. i367-i373, 2010.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 09 August 2013
Revised: 24 August 2013
Accepted: 25 August 2013
Published: 03 October 2013
Issue date: October 2013

Copyright

© The author(s) 2013

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Nos. 61232001, 61128006, and 61073036).

Rights and permissions

Return