Journal Home > Volume 9 , Issue 2

Identifying cancer-related differentially expressed genes provides significant information for diagnosing tumors, predicting prognoses, and effective treatments. Recently, deep learning methods have been used to perform gene differential expression analysis using microarray-based high-throughput gene profiling and have achieved good results. In this study, we proposed a new robust multiple-datasets-based semi-supervised learning model, MSSL, to perform tumor type classification and candidate cancer-specific biomarkers discovery across multiple tumor types and multiple datasets, which addressed the following long-lasting obstacles: (1) the data volume of the existing single dataset is not enough to fully exert the advantages of deep learning; (2) a large number of datasets from different research institutions cannot be effectively used due to inconsistent internal variances and low quality; (3) relatively uncommon cancers have limited effects on deep learning methods. In our article, we applied MSSL to The Cancer Genome Atlas (TCGA) and the Gene Expression Comprehensive Database (GEO) pan-cancer normalized-level3 RNA-seq data and got 97.6% final classification accuracy, which had a significant performance leap compared with previous approaches. Finally, we got the ranking of the importance of the corresponding genes for each cancer type based on classification results and validated that the top genes selected in this way were biologically meaningful for corresponding tumors and some of them had been used as biomarkers, which showed the efficacy of our method.


menu
Abstract
Full text
Outline
About this article

Tumor type classification and candidate cancer-specific biomarkers discovery via semi-supervised learning

Show Author's information Peng Chen1Zhenlei Li1Zhaolin Hong1Haoran Zheng1,2,3( )Rong Zeng4,5( )
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
Anhui Key Laboratory of Software Engineering in Computing and Communication, University of Science and Technology of China, Hefei 230026, China
Department of Systems Biology, University of Science and Technology of China, Hefei 230026, China
CAS Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China

Abstract

Identifying cancer-related differentially expressed genes provides significant information for diagnosing tumors, predicting prognoses, and effective treatments. Recently, deep learning methods have been used to perform gene differential expression analysis using microarray-based high-throughput gene profiling and have achieved good results. In this study, we proposed a new robust multiple-datasets-based semi-supervised learning model, MSSL, to perform tumor type classification and candidate cancer-specific biomarkers discovery across multiple tumor types and multiple datasets, which addressed the following long-lasting obstacles: (1) the data volume of the existing single dataset is not enough to fully exert the advantages of deep learning; (2) a large number of datasets from different research institutions cannot be effectively used due to inconsistent internal variances and low quality; (3) relatively uncommon cancers have limited effects on deep learning methods. In our article, we applied MSSL to The Cancer Genome Atlas (TCGA) and the Gene Expression Comprehensive Database (GEO) pan-cancer normalized-level3 RNA-seq data and got 97.6% final classification accuracy, which had a significant performance leap compared with previous approaches. Finally, we got the ranking of the importance of the corresponding genes for each cancer type based on classification results and validated that the top genes selected in this way were biologically meaningful for corresponding tumors and some of them had been used as biomarkers, which showed the efficacy of our method.

Keywords: Deep learning, Tumor type classification, Cancer-specific biomarkers, MSSL

References(33)

Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17: 509−519

Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R (2007) NCBI GEO: mining tens of millions of expression profiles — Database and tools update. Nucleic Acids Res 35: D760−D765

Carvalho BS, Irizarry RA (2010) A framework for oligonucleotide microarray preprocessing. Bioinformatics 26: 2363−2367

Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (Chapelle O et al. Eds, 2006) [Book reviews]. IEEE T Neur Net 20: 542−542
DOI

Chen C-R, McLachlan SM, Hubbard PA, McNally R, Murali R, Rapoport B (2018) Structure of a thyrotropin receptor monoclonal antibody variable region provides insight into potential mechanisms for its inverse agonist activity. Thyroid 28: 933−940

Cheriyath V, Leaman DW, Borden EC (2011) Emerging roles of FAM14 family members (G1P3/ISG 6–16 and ISG12/IFI27) in innate immunity and cancer. J Interf Cytok Res 31: 173−181

Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: learning augmentation policies from data. arXiv: 180509501. https://doi.org/10.48550/arXiv.1805.09501
DOI

da Silveira W, Palma P, Sicchieri R, Villacis RA, Mandarano L, Oliveira T, Antonio H, Andrade J, Muglia V, Rogatto S (2017) Transcription factor networks derived from breast cancer stem cells control the immune response in the basal subtype. Sci Rep 7(1): 2851. https://doi.org/10.1038/s41598-017-02761-6

Dai W, Chang Q, Peng W, Zhong J, Li Y (2020) Network embedding the protein–protein interaction network for human essential genes identification. Genes 11: 153. https://doi.org/10.3390/genes11020153

Danaee P, Ghaeini R, Hendrix DA (2017) A deep learning approach for cancer detection and relevant gene identification. Pacific symposium on biocomputing 2017: 219−229

Díaz-Uriarte R, de Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7: 3. https://doi.org/10.1186/1471-2105-7-3

Gautier L, Cope L, Bolstad BM, Irizarry RA (2004) Affy — Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20: 307−315

Goldman M, Craft B, Brooks A, Zhu J, Haussler D (2018) The UCSC Xena Platform for cancer genomics data visualization and interpretation. bioRxiv: 326470. https://doi.org/10.1101/326470
DOI
Guo F-B, Dong C, Hua H-L, Liu S, Luo H, Zhang H-W, Jin Y-T, Zhang K-Y (2017) Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33: 1758−1764
DOI

Jafari P, Azuaje F (2006) An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med Inform Decis Mak 6: 27. https://doi.org/10.1186/1472-6947-6-27

Khoshghalbvash F, Gao JX (2019) Integrative feature ranking by applying deep learning on multi source genomic data. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. pp. 207−216. https://doi.org/10.1145/3307339.3342139
DOI

Kuang S, Wei Y, Wang L (2021) Expression-based prediction of human essential genes and candidate lncRNAs in cancer cells. Bioinformatics 37: 396−403

Leary RJ, Kinde I, Diehl F, Schmidt K, Clouser C, Duncan C, Antipova A, Lee C, McKernan K, Francisco M (2010) Development of personalized tumor biomarkers using massively parallel sequencing. Sci Transl Med 2: 20ra14. https://doi.org/10.1126/scitranslmed.3000702

Liu JJ, Cutler G, Li W, Pan Z, Peng S, Hoey T, Chen L, Ling XB (2005) Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics 21: 2691−2697

Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv: 160803983. https://doi.org/10.48550/arXiv.1608.03983
Lyu B, Haque A (2018) Deep learning based tumor type classification using gene expression data. Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics. pp. 89−96
DOI

Mooney SM, Talebian V, Jolly MK, Jia D, Gromala M, Levine H, McConkey BJ (2017) The GRHL2/ZEB feedback loop — A key axis in the regulation of EMT in breast cancer. J Cell Biochem 118: 2559−2570

Novaković S (2004) Tumor markers in clinical oncology. Radiol Oncol 38(2): 73−83 + 155

The Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45: 1113−1120

Tseng I, Yeh MM, Yang C-Y, Jeng Y-M (2015) NKX6-1 is a novel immunohistochemical marker for pancreatic and duodenal neuroendocrine tumors. Am J Surg Pathol 39: 850−857

Wang H (2015) The distribution and expression of BAMBI in breast cancer cell lines. Open Access Library Journal 2: 1−7

Way GP, Greene CS (2018) Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium. World Scientific, pp. 80−91
DOI
Xie Q, Dai Z, Hovy E, Luong M-T, Le QV (2019) Unsupervised data augmentation for consistency training. arXiv: 190412848. https://doi.org/10.48550/arXiv.1904.12848

Yang B, Li M, Tang W, Liu W, Zhang S, Chen L, Xia J (2018) Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma. Nat Commun 9(1): 678. https://doi.org/10.1038/s41467-018-03024-2

Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv: 160507146. https://doi.org/10.48550/arXiv.1605.07146
DOI
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: beyond empirical risk minimization. arXiv: 171009412. https://doi.org/10.48550/arXiv.1710.09412
DOI

Zhu H, Peng Y-G, Ma S-G, Liu H (2015) TPO gene mutations associated with thyroid carcinoma: case report and literature review. Cancer Biomark 15: 909−913

Zhuo H, Zhao Y, Cheng X, Xu M, Wang L, Lin L, Lyu Z, Hong X, Cai J (2019) Tumor endothelial cell-derived cadherin-2 promotes angiogenesis and has prognostic significance for lung adenocarcinoma. Mol cancer 18(1): 34. https://doi.org/10.1186/s12943-019-0987-1

Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 22 March 2023
Accepted: 26 April 2023
Published: 30 April 2023
Issue date: April 2023

Copyright

© The Author(s) 2023

Acknowledgements

Acknowledgements

This work has been supported by the National Key Technologies R&D Program [2017YFA0505502] and the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) (XDB38000000).

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Return