Journal Home > Volume 28 , Issue 4

With the development of sequencing technologies, somatic mutation analysis has become an important component in cancer research and treatment. VarDict is a commonly used somatic variant caller for this task. Although the heuristic-based VarDict algorithm exhibits high sensitivity and versatility, it may detect higher amounts of false positive variants than callers, limiting its clinical practicality. To address this problem, we propose DeepFilter, a deep-learning based filter for VarDict, which can filter out the false positive variants detected by VarDict effectively. Our approach trains two models for insertion-deletion mutations (InDels) and single nucleotide variants (SNVs), respectively. Experiments show that DeepFilter can filter at least 98.5% of false positive variants and retain 93.5% of true positive variants for InDels and SNVs in the commonly used tumor-normal paired mode. Source code and pre-trained models are available at https://github.com/LeiHaoa/DeepFilter.


menu
Abstract
Full text
Outline
About this article

DeepFilter: A Deep Learning Based Variant Filter for VarDict

Show Author's information Hao Zhang1Zekun Yin1,2( )Yanjie Wei3Bertil Schmidt4Weiguo Liu1,2( )
School of Software, Shandong University, Jinan 250100, China
Shenzhen Research Institute of Shandong University, Shenzhen 518057, China
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
Institute for Computer Science, Johannes Gutenberg University, Mainz 55128, Germany

Abstract

With the development of sequencing technologies, somatic mutation analysis has become an important component in cancer research and treatment. VarDict is a commonly used somatic variant caller for this task. Although the heuristic-based VarDict algorithm exhibits high sensitivity and versatility, it may detect higher amounts of false positive variants than callers, limiting its clinical practicality. To address this problem, we propose DeepFilter, a deep-learning based filter for VarDict, which can filter out the false positive variants detected by VarDict effectively. Our approach trains two models for insertion-deletion mutations (InDels) and single nucleotide variants (SNVs), respectively. Experiments show that DeepFilter can filter at least 98.5% of false positive variants and retain 93.5% of true positive variants for InDels and SNVs in the commonly used tumor-normal paired mode. Source code and pre-trained models are available at https://github.com/LeiHaoa/DeepFilter.

Keywords: deep learning, variant filter, somatic variant

References(21)

[1]
D. Benjamin, T. Sato, K. Cibulskis, G. Getz, C. Stewart, and L. Lichtenstein, Calling somatic SNVs and indels with mutect2, arXiv preprint arXiv: 10.1101/861054, 2019.
[2]
E. Garrison and G. Marth, Haplotype-based variant detection from short-read sequencing, arXiv preprint arXiv: 1207.3907, 2012.
[3]
S. Kim, K. Scheffler, A. L. Halpern, M. A. Bekritsky, E. Noh, M. Kallberg, X. Chen, Y. Kim, D. Beyter, P. Krusche, et al., Strelka2: Fast and accurate calling of germline and somatic variants, Nature Methods, vol. 15, no. 8, pp. 591–594, 2018.
[4]
D. C. Koboldt, Q. Zhang, D. E. Larson, D. Shen, M. D. McLellan, L. Lin, C. A. Miller, E. R. Mardis, L. Ding, and R. K. Wilson, Varscan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing, Genome Res., vol. 22, no. 3, pp. 568–576, 2012.
[5]
R. Luo, F. J. Sedlazeck, T. -W. Lam, and M. C. Schatz, A multi-task convolutional deep neural network for variant calling in single molecule sequencing, Nature Communications, vol. 10, no. 1, p. 998, 2019.
[6]
R. Poplin, P. -C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, et al., A universal SNP and small-indel variant caller using deep neural networks, Nature Biotech., vol. 36, no. 10, pp. 983–987, 2018.
[7]
Z. Lai, A. Markovets, M. Ahdesmaki, B. Chapman, O. Hofmann, R. McEwen, J. Johnson, B. Dougherty, J. C. Barrett, and J. R. Dry, Vardict: A novel and versatile variant caller for next-generation sequencing in cancer research, Nucleic Acids Research, vol. 44, no. 11, p. e108, 2016.
[8]
S. Sandmann, A. O. D. Graaf, M. Karimi, B. A. V. D. Reijden, E. Hellström-Lindberg, J. H. Jansen, and M. Dugas, Evaluating variant calling tools for non-matched next-generation sequencing data, Scientific Rep., vol. 7, no. 1, p. 43169, 2017.
[9]
X. He, S. Chen, R. Li, X. Han, Z. He, D. Yuan, S. Zhang, X. Duan, and B. Niu, Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes, Briefings in Bioinformatics, vol. 22, no. 3, p. bbaa083, 2021.
[10]
X. Bian, B. Zhu, M. Wang, Y. Hu, Q. Chen, C. Nguyen, B. Hicks, and D. Meerzaman, Comparing the performance of selected variant callers using synthetic data and genome segmentation, BMC Bioinformatics, vol. 19, no. 1, p. 429, 2018.
[11]
P. Cingolani, A. Platts, L. L. Wang, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden, A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3, Fly, vol. 6, no. 2, pp. 80–92, 2012.
[12]
P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, et al., The variant call format and VCFtools, Bioinformatics, vol. 27, no. 15, pp. 2156–2158, 2011.
[13]
C. P. Wardell, C. Ashby, and M. A. Bauer, FiNGS: High quality somatic mutations using filters for next generation sequencing, BMC Bioinformatics, vol. 22, no. 1, p. 77, 2021.
[14]
S. M. E. Sahraeian, R. Liu, B. Lau, K. Podesta, M. Mohiyuddin, and H. Y. K. Lam, Deep convolutional neural networks for accurate somatic mutation detection, Nature Communications, vol. 10, no. 1, p. 1041, 2019.
[15]
M. Wang, W. Luo, K. Jones, X. Bian, R. Williams, H. Higson, D. Wu, B. Hicks, M. Yeager, and B. Zhu, SomaticCombiner: Improving the performance of somatic variant calling based on evaluation tests and a consensus approach, Scientific Reports, vol. 10, no. 1, p. 12898, 2020.
[16]
V. Ravasio, M. Ritelli, A. Legati, and E. Giacopuzzi, Garfield-NGS: Genomic variants filtering by deep learning models in NGS, Bioinformatics, vol. 34, no. 17, pp. 3038–3040, 2018.
[17]
X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proc. 13th International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 2010, pp. 249–256.
[18]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[19]
W. Chen, Y. Zhao, X. Chen, Z. Yang, X. Xu, Y. Bi, V. Chen, J. Li, H. Choi, B. Ernest, et al., A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples, Nature Biotechnology, vol. 39, no. 9, pp. 1103–1114, 2020.
[20]
R. V. Guimera, Bcbio-nextgen: Automated, distributed next-gen sequencing pipeline, Embnet Journal, vol. 17, no. B, p. 30, 2011.
[21]
A. D. Ewing, K. E. Houlahan, Y. Hu, K. Ellrott, C. Caloian, T. N. Yamaguchi, J. C. Bare, C. P’ng, D. Waggott, V. Y. Sabelnykova, et al., Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection, Nature Methods, vol. 12, no. 7, pp. 623–630, 2015.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 26 February 2022
Revised: 27 April 2022
Accepted: 22 August 2022
Published: 06 January 2023
Issue date: August 2023

Copyright

© The author(s) 2023.

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (NSFC) (Nos. 62102231 and 61972231); the Shenzhen Basic Research Fund (No. JCYJ20180507182818013); the Key Project of Joint Fund of Shandong Province (No. ZR2019LZH007); Shandong Provincial Natural Science Foundation (No. ZR2021QF089); the PPP project from CSC and DAAD; and Engineering Research Center of Digital Media Technology, Ministry of Education, China.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return