AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (7.9 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Method | Open Access

Tumor type classification and candidate cancer-specific biomarkers discovery via semi-supervised learning

Peng Chen1Zhenlei Li1Zhaolin Hong1Haoran Zheng1,2,3( )Rong Zeng4,5( )
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
Anhui Key Laboratory of Software Engineering in Computing and Communication, University of Science and Technology of China, Hefei 230026, China
Department of Systems Biology, University of Science and Technology of China, Hefei 230026, China
CAS Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
Show Author Information

Graphical Abstract

Abstract

Identifying cancer-related differentially expressed genes provides significant information for diagnosing tumors, predicting prognoses, and effective treatments. Recently, deep learning methods have been used to perform gene differential expression analysis using microarray-based high-throughput gene profiling and have achieved good results. In this study, we proposed a new robust multiple-datasets-based semi-supervised learning model, MSSL, to perform tumor type classification and candidate cancer-specific biomarkers discovery across multiple tumor types and multiple datasets, which addressed the following long-lasting obstacles: (1) the data volume of the existing single dataset is not enough to fully exert the advantages of deep learning; (2) a large number of datasets from different research institutions cannot be effectively used due to inconsistent internal variances and low quality; (3) relatively uncommon cancers have limited effects on deep learning methods. In our article, we applied MSSL to The Cancer Genome Atlas (TCGA) and the Gene Expression Comprehensive Database (GEO) pan-cancer normalized-level3 RNA-seq data and got 97.6% final classification accuracy, which had a significant performance leap compared with previous approaches. Finally, we got the ranking of the importance of the corresponding genes for each cancer type based on classification results and validated that the top genes selected in this way were biologically meaningful for corresponding tumors and some of them had been used as biomarkers, which showed the efficacy of our method.

References

 

Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17: 509−519

 

Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R (2007) NCBI GEO: mining tens of millions of expression profiles — Database and tools update. Nucleic Acids Res 35: D760−D765

 

Carvalho BS, Irizarry RA (2010) A framework for oligonucleotide microarray preprocessing. Bioinformatics 26: 2363−2367

 
Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (Chapelle O et al. Eds, 2006) [Book reviews]. IEEE T Neur Net 20: 542−542
 

Chen C-R, McLachlan SM, Hubbard PA, McNally R, Murali R, Rapoport B (2018) Structure of a thyrotropin receptor monoclonal antibody variable region provides insight into potential mechanisms for its inverse agonist activity. Thyroid 28: 933−940

 

Cheriyath V, Leaman DW, Borden EC (2011) Emerging roles of FAM14 family members (G1P3/ISG 6–16 and ISG12/IFI27) in innate immunity and cancer. J Interf Cytok Res 31: 173−181

 
Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: learning augmentation policies from data. arXiv: 180509501. https://doi.org/10.48550/arXiv.1805.09501
 

da Silveira W, Palma P, Sicchieri R, Villacis RA, Mandarano L, Oliveira T, Antonio H, Andrade J, Muglia V, Rogatto S (2017) Transcription factor networks derived from breast cancer stem cells control the immune response in the basal subtype. Sci Rep 7(1): 2851. https://doi.org/10.1038/s41598-017-02761-6

 

Dai W, Chang Q, Peng W, Zhong J, Li Y (2020) Network embedding the protein–protein interaction network for human essential genes identification. Genes 11: 153. https://doi.org/10.3390/genes11020153

 

Danaee P, Ghaeini R, Hendrix DA (2017) A deep learning approach for cancer detection and relevant gene identification. Pacific symposium on biocomputing 2017: 219−229

 

Díaz-Uriarte R, de Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7: 3. https://doi.org/10.1186/1471-2105-7-3

 

Gautier L, Cope L, Bolstad BM, Irizarry RA (2004) Affy — Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20: 307−315

 
Goldman M, Craft B, Brooks A, Zhu J, Haussler D (2018) The UCSC Xena Platform for cancer genomics data visualization and interpretation. bioRxiv: 326470. https://doi.org/10.1101/326470
 
Guo F-B, Dong C, Hua H-L, Liu S, Luo H, Zhang H-W, Jin Y-T, Zhang K-Y (2017) Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33: 1758−1764
 

Jafari P, Azuaje F (2006) An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med Inform Decis Mak 6: 27. https://doi.org/10.1186/1472-6947-6-27

 
Khoshghalbvash F, Gao JX (2019) Integrative feature ranking by applying deep learning on multi source genomic data. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. pp. 207−216. https://doi.org/10.1145/3307339.3342139
 

Kuang S, Wei Y, Wang L (2021) Expression-based prediction of human essential genes and candidate lncRNAs in cancer cells. Bioinformatics 37: 396−403

 

Leary RJ, Kinde I, Diehl F, Schmidt K, Clouser C, Duncan C, Antipova A, Lee C, McKernan K, Francisco M (2010) Development of personalized tumor biomarkers using massively parallel sequencing. Sci Transl Med 2: 20ra14. https://doi.org/10.1126/scitranslmed.3000702

 

Liu JJ, Cutler G, Li W, Pan Z, Peng S, Hoey T, Chen L, Ling XB (2005) Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics 21: 2691−2697

 
Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv: 160803983. https://doi.org/10.48550/arXiv.1608.03983
 
Lyu B, Haque A (2018) Deep learning based tumor type classification using gene expression data. Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics. pp. 89−96
 

Mooney SM, Talebian V, Jolly MK, Jia D, Gromala M, Levine H, McConkey BJ (2017) The GRHL2/ZEB feedback loop — A key axis in the regulation of EMT in breast cancer. J Cell Biochem 118: 2559−2570

 

Novaković S (2004) Tumor markers in clinical oncology. Radiol Oncol 38(2): 73−83 + 155

 

The Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45: 1113−1120

 

Tseng I, Yeh MM, Yang C-Y, Jeng Y-M (2015) NKX6-1 is a novel immunohistochemical marker for pancreatic and duodenal neuroendocrine tumors. Am J Surg Pathol 39: 850−857

 

Wang H (2015) The distribution and expression of BAMBI in breast cancer cell lines. Open Access Library Journal 2: 1−7

 
Way GP, Greene CS (2018) Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium. World Scientific, pp. 80−91
 
Xie Q, Dai Z, Hovy E, Luong M-T, Le QV (2019) Unsupervised data augmentation for consistency training. arXiv: 190412848. https://doi.org/10.48550/arXiv.1904.12848
 

Yang B, Li M, Tang W, Liu W, Zhang S, Chen L, Xia J (2018) Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma. Nat Commun 9(1): 678. https://doi.org/10.1038/s41467-018-03024-2

 
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv: 160507146. https://doi.org/10.48550/arXiv.1605.07146
 
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: beyond empirical risk minimization. arXiv: 171009412. https://doi.org/10.48550/arXiv.1710.09412
 

Zhu H, Peng Y-G, Ma S-G, Liu H (2015) TPO gene mutations associated with thyroid carcinoma: case report and literature review. Cancer Biomark 15: 909−913

 

Zhuo H, Zhao Y, Cheng X, Xu M, Wang L, Lin L, Lyu Z, Hong X, Cai J (2019) Tumor endothelial cell-derived cadherin-2 promotes angiogenesis and has prognostic significance for lung adenocarcinoma. Mol cancer 18(1): 34. https://doi.org/10.1186/s12943-019-0987-1

Biophysics Reports
Pages 57-66
Cite this article:
Chen P, Li Z, Hong Z, et al. Tumor type classification and candidate cancer-specific biomarkers discovery via semi-supervised learning. Biophysics Reports, 2023, 9(2): 57-66. https://doi.org/10.52601/bpr.2023.230005

311

Views

59

Downloads

0

Crossref

0

Scopus

0

CSCD

Altmetrics

Received: 22 March 2023
Accepted: 26 April 2023
Published: 30 April 2023
© The Author(s) 2023

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Return