Journal Home > Volume 27 , Issue 1

Identifying the association between metabolites and diseases will help us understand the pathogenesis of diseases, which has great significance in diagnosing and treating diseases. However, traditional biometric methods are time consuming and expensive. Accordingly, we propose a new metabolite-disease association prediction algorithm based on DeepWalk and random forest (DWRF), which consists of the following key steps: First, the semantic similarity and information entropy similarity of diseases are integrated as the final disease similarity. Similarly, molecular fingerprint similarity and information entropy similarity of metabolites are integrated as the final metabolite similarity. Then, DeepWalk is used to extract metabolite features based on the network of metabolite-gene associations. Finally, a random forest algorithm is employed to infer metabolite-disease associations. The experimental results show that DWRF has good performances in terms of the area under the curve value, leave-one-out cross-validation, and five-fold cross-validation. Case studies also indicate that DWRF has a reliable performance in metabolite-disease association prediction.


menu
Abstract
Full text
Outline
About this article

Metabolite-Disease Association Prediction Algorithm Combining DeepWalk and Random Forest

Show Author's information Jiaojiao TieXiujuan Lei( )Yi Pan( )
School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA

Abstract

Identifying the association between metabolites and diseases will help us understand the pathogenesis of diseases, which has great significance in diagnosing and treating diseases. However, traditional biometric methods are time consuming and expensive. Accordingly, we propose a new metabolite-disease association prediction algorithm based on DeepWalk and random forest (DWRF), which consists of the following key steps: First, the semantic similarity and information entropy similarity of diseases are integrated as the final disease similarity. Similarly, molecular fingerprint similarity and information entropy similarity of metabolites are integrated as the final metabolite similarity. Then, DeepWalk is used to extract metabolite features based on the network of metabolite-gene associations. Finally, a random forest algorithm is employed to infer metabolite-disease associations. The experimental results show that DWRF has good performances in terms of the area under the curve value, leave-one-out cross-validation, and five-fold cross-validation. Case studies also indicate that DWRF has a reliable performance in metabolite-disease association prediction.

Keywords: random forest, DeepWalk, metabolite-disease associations, molecular fingerprint similarity of metabolites

References(30)

[1]
J. A. Harris and F. G. Benedict, A biometric study of human basal metabolism, Proc. Natl. Acad. Sci. USA, vol. 4, no. 12, pp. 370-373, 1918.
[2]
L. Cheng, H. X. Yang, H. Q. Zhao, X. Y. Pei, H. B. Shi, J. Sun, Y. P. Zhang, Z. Z. Wang, and M. Zhou, MetSigDis: A manually curated resource for the metabolic signatures of diseases, Brief. Bioinform., vol. 20, no. 1, pp. 203-209, 2019.
[3]
Y. M. Chen, Y. Liu, R. F. Zhou, X. L. Chen, C. Wang, X. Y. Tan, L. J. Wang, R. D. Zheng, H. W. Zhang, W. H. Ling, et al., Associations of gut-flora-dependent metabolite trimethylamine-N-oxide, betaine and choline with non-alcoholic fatty liver disease in adults, Sci. Rep., vol. 6, no. 1, p. 19076, 2016.
[4]
D. Y. Hui, Intestinal phospholipid and lysophospholipid metabolism in cardiometabolic disease, Curr. Opin. Lipidol., vol. 27, no. 5, pp. 507-512, 2016.
[5]
E. T. Oni, R. Kalathiya, E. C. Aneni, S. S. Martin, M. J. Blaha, T. Feldman, A. S. Agatston, R. S. Blumenthal, R. D. Conceiçao, J. A. M. Carvalho, et al., Relation of physical activity to prevalence of nonalcoholic Fatty liver disease independent of cardiometabolic risk, Am.J. Cardiol., vol. 115, no. 1, pp. 34-39, 2015.
[6]
A. Budhu, A. Terunuma, G. Zhang, S. P. Hussain, S. Ambs, and X. W. Wang, Metabolic profiles are principally different between cancers of the liver, pancreas and breast, Int.J. Biol. Sci., vol. 10, no. 9, pp. 966-972, 2014.
[7]
R. A. Moats, T. Ernst, T. K. Shonk, and B. D. Ross, Abnormal cerebral metabolite concentrations in patients with probable Alzheimer disease, Magn. Reson. Med., vol. 32, no. 1, pp. 110-115, 1994.
[8]
P. G. Unschuld, R. A. E. Edden, A. Carass, X. Y. Liu, M. Shanahan, X. Wang, K. Oishi, J. Brandt, S. S. Bassett, G. W. Redgrave, et al., Brain metabolite alterations and cognitive dysfunction in early Huntington’s disease, Mov. Disord., vol. 27, no. 7, pp. 895-902, 2012.
[9]
S. Hori, S. Nishiumi, K. Kobayashi, M. Shinohara, Y. Hatakeyama, Y. Kotani, N. Hatano, Y. Maniwa, W. Nishio, T. Bamba, et al., A metabolomic approach to lung cancer, Lung Cancer, vol. 74, no. 2, pp. 284-292, 2011.
[10]
C. Cheng, S. M. Zhuo, B. Zhang, X. Zhao, Y. Liu, C. L. Liao, J. Quan, Z. Z. Li, A. M. Bode, Y. Cao, et al., Treatment implications of natural compounds targeting lipid metabolism in nonalcoholic fatty liver disease, obesity and cancer, Int.J. Biol. Sci., vol. 15, no. 8, pp. 1654-1663, 2019.
[11]
Y. J. Xu, H. X. Yang, T. Wu, Q. Dong, Z. G. Sun, D. S. Shang, F. Li, Y. Q. Xu, F. Su, and S. Y. Liu, BioM2MetDisease: A manually curated database for associations between microRNAs, metabolites, small molecules and metabolic diseases, Database, vol. 2017, p. bax037, 2017.
[12]
D. S. Wishart, Y. D. Feunang, A. Marcu, A. C. Guo, K. Liang, R. Vázquez-Fresno, T. Sajed, D. Johnson, C. Li, N. Karu, et al., HMDB 4.0: The human metabolome database for 2018, Nucleic Acids Res., vol. 46, no. D1, pp. D608-D617, 2018.
[13]
D. S. Shang, C. Q. Li, Q. L. Yao, H. X. Yang, Y. J. Xu, J. W. Han, J. Li, F. Su, Y. P. Zhang, C. L. Zhang, et al., Prioritizing candidate disease metabolites based on global functional relationships between metabolites in the context of metabolic pathways, PLoS One, vol. 9, no. 8, p. e104934, 2014.
[14]
Y. Hu, T. Y. Zhao, N. Y. Zhang, T. Y. Zang, J. Zhang, and L. Cheng, Identifying diseases-related metabolites using random walk, BMC Bioinformatics, vol. 19, no. S5, p. 116, 2018
[15]
Y. T. Wang, L. R. Juan, J. J. Peng, T. Y. Zang, and Y. D. Wang, Prioritizing candidate diseases-related metabolites based on literature and functional similarity, BMC Bioinformatics, vol. 20, no. 18, p. 574, 2019.
[16]
Y. J. Qi, Random forest for bioinformatics, in Ensemble Machine Learning: Methods and Applications, C. Zhang and Y. Q. Ma, eds. Boston, MA, USA: Springer, 2012, pp. 307-323.
[17]
C. Chen, A. Liaw, and L. Breiman, Using Random Forest to Learn Imbalanced Data, Berkeley, CA, USA: University of California, 2004.
[18]
H. J. Lowe and G. O. Barnett, Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches, JAMA, vol. 271, no. 14, pp. 1103-1108, 1994.
[19]
Z. Q. Fang and X. J. Lei, Prediction of miRNA-circRNA associations based on k-NN multi-label with random walk restart on a heterogeneous network, Big Data Mining and Analytics, vol. 2, no. 4, pp. 261-272, 2019.
[20]
X. Y. Li, Y. P. Lin, C. L. Gu, and J. L. Yang, FCMDAP: Using miRNA family and cluster information to improve the prediction accuracy of disease related miRNAs, BMC Syst. Biol., vol. 13, no. 2, p. 26, 2019.
[21]
B. Perozzi, R. Al-Rfou, and S. Skiena, DeepWalk: Online learning of social representations, in Proc. 20th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, USA, 2014, pp. 701-710.
[22]
[23]
A. Liaw and M. Wiener, Classification and regression by randomForest, R News, vol. 2, no. 3, pp. 18-22, 2002.
[24]
W. Jiang, J. Y. Lin, H. Q. Wang, and S. C. Zou. Hybrid semantic service matchmaking method based on a random forest, Tsinghua Sci. Technol., vol. 25, no. 6, pp. 798-812, 2020.
[25]
G. Y. Wu, X. Guo, and B. H. Xu. BAM: A block-based Bayesian method for detecting genome-wide associations with multiple diseases, Tsinghua Sci. Technol., vol. 25, no. 5, pp. 678-689, 2020.
[26]
M. Bouazizi and T. Ohtsuki, Multi-class sentiment analysis on twitter: Classification performance and challenges, Big Data Mining and Analytics, vol. 2, no. 3, pp. 181-194, 2019.
[27]
P. J. Rousseeuw, I. Ruts, and J. W. Tukey, The bagplot: A bivariate boxplot, Am. Stat., vol. 53, no. 4, pp. 382-387, 1999.
[28]
M. Goedert and M. G. Spillantini, A century of Alzheimer’s disease, Science, vol. 314, no. 5800, pp. 777-781, 2006.
[29]
R. L. Siegel, K. D. Miller, S. A. Fedewa, D. J. Ahnen, R. G. S. Meester, A. Barzi, and A. Jemal, Colorectal cancer statistics, 2017, CA: A Cancer J. Clin., vol. 67, no. 3, pp. 177-193, 2017.
[30]
C. C. Zhang, L. F. Ma, Y. J. Niu, Z. X. Wang, X. Xu, Y. Li, and Y. C. Yu, Circular RNA in lung cancer research: Biogenesis, functions, and roles, Int.J. Biol. Sci., vol. 16, no. 5, pp. 803-814, 2020.
Publication history
Copyright
Rights and permissions

Publication history

Received: 21 December 2020
Accepted: 13 January 2021
Published: 17 August 2021
Issue date: February 2022

Copyright

© The author(s) 2022

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return