Journal Home > Volume 3 , Issue 3

The rapid progress and plummeting costs of human-genome sequencing enable the availability of large amount of personal biomedical information, leading to one of the most important concerns — genomic data privacy. Since personal biomedical data are highly correlated with relatives, with the increasing availability of genomes and personal traits online (i.e., leakage unwittingly, or after their releasing intentionally to genetic service platforms), kin-genomic data privacy is threatened. We propose new inference attacks to predict unknown Single Nucleotide Polymorphisms (SNPs) and human traits of individuals in a familial genomic dataset based on probabilistic graphical models and belief propagation. With this method, the adversary can predict the unobserved genomes or traits of targeted individuals in a family genomic dataset where some individuals’ genomes and traits are observed, relying on SNP-trait association from Genome-Wide Association Study (GWAS), Mendel’s Laws, and statistical relations between SNPs. Existing genome inferences have relatively high computational complexity with the input of tens of millions of SNPs and human traits. Then, we propose an approach to publish genomic data with differential privacy guarantee. After finding an approximate distribution of the input genomic dataset relying on Bayesian networks, a noisy distribution is obtained after injecting noise into the approximate distribution. Finally, synthetic genomic dataset is sampled and it is proved that any query on synthetic dataset satisfies differential privacy guarantee.


menu
Abstract
Full text
Outline
About this article

Inference Attacks on Genomic Data Based on Probabilistic Graphical Models

Show Author's information Zaobo He( )Junxiu Zhou
Department of Computer Science and Software Engineering, Miami University, Oxford, OH 45011, USA.
Department of Computer Science, Northern Kentucky University, Highland Heights, KY 41099, USA.

Abstract

The rapid progress and plummeting costs of human-genome sequencing enable the availability of large amount of personal biomedical information, leading to one of the most important concerns — genomic data privacy. Since personal biomedical data are highly correlated with relatives, with the increasing availability of genomes and personal traits online (i.e., leakage unwittingly, or after their releasing intentionally to genetic service platforms), kin-genomic data privacy is threatened. We propose new inference attacks to predict unknown Single Nucleotide Polymorphisms (SNPs) and human traits of individuals in a familial genomic dataset based on probabilistic graphical models and belief propagation. With this method, the adversary can predict the unobserved genomes or traits of targeted individuals in a family genomic dataset where some individuals’ genomes and traits are observed, relying on SNP-trait association from Genome-Wide Association Study (GWAS), Mendel’s Laws, and statistical relations between SNPs. Existing genome inferences have relatively high computational complexity with the input of tens of millions of SNPs and human traits. Then, we propose an approach to publish genomic data with differential privacy guarantee. After finding an approximate distribution of the input genomic dataset relying on Bayesian networks, a noisy distribution is obtained after injecting noise into the approximate distribution. Finally, synthetic genomic dataset is sampled and it is proved that any query on synthetic dataset satisfies differential privacy guarantee.

Keywords: belief propagation, factor graph, Single Nucleotide Polymorphism (SNP)-trait association, data sanitization

References(25)

[1]
23andMe, Find out what your DNA says about you and your family, https://www.23andme.com/, 2020.
[2]
openSNP, https://opensnp.org/, 2020.
[3]
Patientslikeme, https://www.patientslikeme.com/, 2020.
[4]
M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich, Identifying personal genomes by surname inference, Science, vol. 339, no. 6117, pp. 321-324, 2013.
[5]
L. Sweeney, A. Abu, and J. Winn, Identifying participants in the personal genome project by name (A re-identification experiment), arXiv preprint arXiv: 1304.7605, 2013.
DOI
[6]
E. Ayday, E. De Cristofaro, J. P. Hubaux, and G. Tsudik, The chills and thrills of whole genome sequencing, arXiv preprint arXiv: 1306.1264, 2013.
DOI
[7]
New York Times, The immortal life of Henrietta Lacks, the sequel by Rebecca Skloot, https://www.nytimes.com/2013/03/24/opinion/sunday/the-immortal-life-of-henrietta-lacks-the-sequel.html, 2013.
[8]
Z. B. He, Z. P. Cai, Y. C. Sun, Y. S. Li, and X. Z. Cheng, Customized privacy preserving for inherent data and latent data, Personal and Ubiquit. Comput., vol. 21, no. 1, pp. 43-54, 2017.
[9]
GWAS Catalog, The NHGRI-EBI catalog of human genome-wide association studies, https://www.ebi.ac.uk/gwas/docs/about, 2020.
[10]
D. R. Nyholt, C. E. Yu, and P. M. Visscher, On Jim Watson’s APOE status: Genetic information is hard to hide, European Journal of Human Genetics, vol. 17, no. 2, pp. 147-149, 2009.
[11]
D. S. Falconer and T. F. C. Mackay, Introduction to Quantitative Genetics, 4th ed. Harlow, UK: Longmans, 1996.
[12]
Y. Yu, M. Li, L. L. Liu, Y. H. Li, and J. X. Wang, Clinical big data and deep learning: Applications, challenges, and future outlooks, Big Data Mining and Analytics, vol. 2, no. 4, pp. 288-305, 2019.
[13]
S. Kumar and M. Singh, Big data analytics for healthcare industry: Impact, applications, and tools, Big Data Mining and Analytics, vol. 2, no. 1, pp. 48-57, 2019.
[14]
C. Dwork, Differential privacy, in Proc. 33rd Int. Colloquium on Automata, Languages and Programming, Venice, Italy, 2006, pp. 1-12.
DOI
[15]
F. R. Kschischang, B. J. Frey, H. A. Loeliger, Factor graphs and the sum-product algorithm, IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498-519, 2001.
[16]
F. R. Kschischang, B. J. Frey, H. A. Loeliger, Factor graphs and the sum-product algorithm, IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498-519, 2001.
[17]
B. Liu, S. Feng, X. Guo, and J. Zhang, Bayesian analysis of complex mutations in HBV, HCV, and HIV studies, Big Data Mining and Analytics, vol. 2, no. 3, pp. 145-158, 2019.
[18]
X. Ding and X. Guo, A survey of SNP data analysis, Big Data Mining and Analytics, vol. 1, no. 3, pp. 173-190, 2018.
[19]
Z. B. He, J. G. Yu, J. Li, Q. L. Han, G. C. Luo, and Y. S. Li, Inference attacks and controls on genotypes and phenotypes for individual genomic data, IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 17, no. 3, pp. 930-937, 2020.
[20]
J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. K. Xiao, PrivBayes: Private data release via Bayesian networks, in Proc. 2014 ACM SIGMOD Int. Conf. on Management of Data, Snowbird, UT, USA, 2014, pp. 1423-1434.
DOI
[21]
F. McSherry and K. Talwar, Mechanism design via differential privacy, in Proc. 48th Ann. IEEE Symp. on Foundations of Computer Science, Providence, RI, USA, 2007, pp. 94-103.
DOI
[22]
Centers for Disease Control and Prevention, Hypertension, https://www.cdc.gov/nchs/fastats/hypertension.htm, 2018.
[23]
D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA, USA: MIT Press, 2009.
[24]
Z. B. He, Z. P. Cai, Q. L. Han, W. T. Tong, L. M. Sun, and Y. S. Li, An energy efficient privacy-preserving content sharing scheme in mobile social networks, Personal Ubiquit. Comput., vol. 20, no. 5, pp. 833-846, 2016.
[25]
Z. B. He, Z. P. Cai, and J. G. Yu, Latent-data privacy preserving with customized data utility for social network data, IEEE Trans. Veh. Technol., vol. 67, no. 1, pp. 665-673, 2018.
Publication history
Copyright
Rights and permissions

Publication history

Received: 10 May 2020
Accepted: 24 June 2020
Published: 16 July 2020
Issue date: September 2020

Copyright

© The author(s) 2020

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return