Scholar - SciOpen

The gut microbiota has been increasingly recognized as a promising non-invasive biomarker source for Colorectal Cancer (CRC) detection. In this study, we develop a multi-layer stacking ensemble learning framework that integrates multiple machine learning models to improve the classification accuracy of CRC based on gut microbiota profiles. Our framework is trained using publicly available microbiome datasets comprising 1129 samples from diverse geographical regions (Europe, America, and Asia), and independently evaluated on an external validation cohort collected from Peking Union Medical College Hospital (PUMCH) in China. Based on our experiments, feature extraction of gut microbiota at the genus and species levels is found to achieve the best performance. The framework integrates multiple base classifiers—including Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), Random Forests (RF), and Support Vector Machines (SVM)—combined through a weighting layer to optimize final classifications. Analysis of the feature importance in the trained model reveals several microbial populations previously reported to be associated with CRC, such as Gemella morbillorum and Fusobacterium nucleatum. These findings support the microbiological interpretability of our proposed framework. Experimental results show that our ensemble model achieves an Area Under the Receiver Operating Characteristic Curve (namely ROC_AUC) of 77.04% when validated on a real-world clinical dataset from Peking Union Medical College Hospital, surpassing existing microbiome-based CRC classification approaches.

Regular Paper Issue

Context-Aware Semantic Type Identification for Relational Attributes

Yue Ding, Yu-He Guo, Wei Lu, Hai-Xiang Li, Mei-Hui Zhang, Hui Li, An-Qun Pan, Xiao-Yong Du

Journal of Computer Science and Technology 2023, 38(4): 927-946

Published: 06 December 2023

Abstract Collect Collected

Identifying semantic types for attributes in relations, known as attribute semantic type (AST) identification, plays an important role in many data analysis tasks, such as data cleaning, schema matching, and keyword search in databases. However, due to a lack of unified naming standards across prevalent information systems (a.k.a. information islands), AST identification still remains as an open problem. To tackle this problem, we propose a context-aware method to figure out the ASTs for relations in this paper. We transform the AST identification into a multi-class classification problem and propose a schema context aware (SCA) model to learn the representation from a collection of relations associated with attribute values and schema context. Based on the learned representation, we predict the AST for a given attribute from an underlying relation, wherein the predicted AST is mapped to one of the labeled ASTs. To improve the performance for AST identification, especially for the case that the predicted semantic types of attributes are not included in the labeled ASTs, we then introduce knowledge base embeddings (a.k.a. KBVec) to enhance the above representation and construct a schema context aware model with knowledge base enhanced (SCA-KB) to get a stable and robust model. Extensive experiments based on real datasets demonstrate that our context-aware method outperforms the state-of-the-art approaches by a large margin, up to 6.14% and 25.17% in terms of macro average $F_{1}$ score, and up to 0.28% and 9.56% in terms of weighted $F_{1}$ score over high-quality and low-quality datasets respectively.

Regular Paper Issue

Efficient Model Store and Reuse in an OLML Database System

Jian-Wei Cui, Wei Lu, Xin Zhao, Xiao-Yong Du

Journal of Computer Science and Technology 2021, 36(4): 792-805

Published: 05 July 2021

Abstract Collect Collected

Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.

Open Access Issue

Deep Sequential Model for Anchor Recommendation on Live Streaming Platforms

Shuai Zhang, Hongyan Liu, Jun He, Sanpu Han, Xiaoyong Du

Big Data Mining and Analytics 2021, 4(3): 173-182

Published: 12 May 2021

Abstract

PDF (2.3 MB) Collect Collected

Downloads：215

Live streaming has grown rapidly in recent years, attracting increasingly more participation. As the number of online anchors is large, it is difficult for viewers to find the anchors they are interested in. Therefore, a personalized recommendation system is important for live streaming platforms. On live streaming platforms, the viewer’s and anchor’s preferences are dynamically changing over time. How to capture the user’s preference change is extensively studied in the literature, but how to model the viewer’s and anchor’s preference changes and how to learn their representations based on their preference matching are less studied. Taking these issues into consideration, in this paper, we propose a deep sequential model for live streaming recommendation. We develop a component named the multi-head related-unit in the model to capture the preference matching between anchor and viewer and extract related features for their representations. To evaluate the performance of our proposed model, we conduct experiments on real datasets, and the results show that our proposed model outperforms state-of-the-art recommendation models.

Total 4