Scholar - SciOpen

The prediction of linear B cell epitopes is crucial for understanding the mechanisms of B cell immunity, accelerating the screening of B cell epitopes, and expediting the development of related drugs. Most current prediction methods focus on features such as amino acid composition and k-mers, as well as machine learning models. However, these methods usually ignore hidden information in linear B cell epitopes, such as the positional information of amino acids in the sequences and the physicochemical properties of amino acids, thus resulting in poor prediction performance. To address this limitation, we develop CGABepi, a deep learning framework based on amino acid and physicochemical feature encoding. CGABepi employs convolutional neural networks to capture local amino acid associations and BiGRU to capture contextual relationships in sequences. To verify the superiority of the CGABepi architecture, we conduct extensive fair comparative experiments. We train CGABepi on data from two methods (epitope1D and NetBCE), both of which demonstrate significantly better performance than the original method. The ablation study confirms the importance of each module in CGABepi, demonstrating that the CGABepi architecture is well suited for predicting linear B cell epitopes. Additionally, we compare the results on four independent test sets, and CGABepi achieved the best results on all of these test sets. Finally, we successfully predict two epitope datasets for SARS-CoV-1 and SARS-CoV-2 using CGABepi. It is worth noting that out of the 10 epitopes of SARS-CoV-1, 7 epitopes are screened with ultra-high confidence, with predicted scores exceed 99.9%. The multifaceted results demonstrate that CGABepi is currently the state-of-the-art method for linear B cell epitope prediction.

Open Access Issue

Benchmarking Analysis of scHi-C Data Imputation Methods

Xuyan Du, Xinrui Ji, Li Tang, Min Li

Big Data Mining and Analytics 2026, 9(2): 500-518

Published: 09 February 2026

Abstract

PDF (8.4 MB) Collect Collected

Downloads：157

Single-cell Hi-C (scHi-C) technology is widely used to measure individual cells’ three-dimensional genome structures and investigate cell-to-cell heterogeneity of multi-scale chromatin structures and cellular functions. It facilitates the identification of rare cell types and enhances the understanding of disease mechanisms. However, the sparsity of scHi-C data poses significant challenges for downstream analyses, such as cell clustering. Several scHi-C imputation methods have been proposed in recent years, including statistics-based and deep learning based methods. Nevertheless, these methods have not been comprehensively evaluated and analyzed in previous studies to the best of our knowledge. In this paper, seven state-of-the-art imputation methods are assessed and compared in terms of various metrics based on nine simulated datasets and one real dataset. Specifically, the performance of these methods in data recovery and cell clustering is evaluated. Experimental results show that deep learning based methods achieve better performance than statistics-based methods, but no method performs the best in all cases. Finally, we provide method recommendations for different scenarios.

Open Access Research Article Online First

SAGAN: A Subgraph-Aware Graph Attention Network for Drug Repositioning

Xiangmao Meng, Xinqiang Wen, Xinliang Sun, Ju Xiang, Yahui Long, Xuan Lin, Min Li

Tsinghua Science and Technology

Published: 06 February 2026

Abstract

PDF (5.2 MB) Collect Collected

Downloads：233

Drug repositioning has been widely applied to explore new therapeutic applications for existing drugs, significantly reducing the transition time from laboratory research to clinical application. However, most existing models rely on static training over complex and large-scale networks, lacking detailed analysis and specificity for individual drug-disease pairs. To address this limitation, we propose the subgraph-aware graph attention network (SAGAN), which constructs subgraphs centered on target drug-disease pairs by extracting their surrounding interactions. Within each extracted subgraph, SAGAN employs an attention mechanism to focus on critical interactions among directly associated nodes within the subgraph, while simultaneously capturing cross-level relational patterns. A hierarchical pooling technique is then applied to aggregate the nodes and edges within the subgraph into more compact representations. Additionally, SAGAN integrates neighborhood features and interaction information from drug and disease similarity networks to enhance the expressive power of subgraph features. Finally, the method predicts drug-disease associations as a graph classification task. On three widely used benchmark datasets, SAGAN demonstrates outstanding performance, achieving an average area under the receiver operating characteristic curve (AUROC) of 0.9648 and an average area under the precision-recall curve (AUPR) of 0.9678, showcasing its robustness in handling sparse and imbalanced data. Furthermore, case studies validate the practical utility of SAGAN in predicting potential effective drugs for Alzheimer’s disease and breast cancer.

Open Access Research Article Issue

Group Collaborative Unsupervised Deep Metric Learning for Feature Embedding

Zeqian Chen, Shichao Kan, Min Li

Tsinghua Science and Technology 2026, 31(4): 2092-2103

Published: 24 December 2025

Abstract

PDF (4.9 MB) Collect Collected

Downloads：78

Learning a compact feature embedding is crucial for effective image representation. Current feature embedding methods, including both supervised and unsupervised approaches, rely on deep metric learning techniques that aim to pull positive samples of the same class closer and push negative samples from different classes farther apart. However, supervised metric learning methods may exhibit bias towards the ground truth labels, leading to overfitting on the training set. On the other hand, unsupervised metric learning methods could suffer from degraded performance due to the long-tailed distribution of the clusters. To address these challenges, we propose a group collaborative unsupervised deep metric learning method for feature embedding. Specifically, we train the deep feature embedding model based on the teacher-student framework. The student network produces the final compact embedding, while the teacher network generates pseudo-labels for group collaborative learning and knowledge distillation. Both networks share a similar network structure, and the parameters of the teacher network are updated using the momentum-based moving average of the parameters of the student network. Experimental results on benchmark image retrieval datasets demonstrate the effectiveness and efficiency of the proposed method, achieving an improvement in Recall@1 of up to 1.8%.

Open Access Issue

GoM-ICD: Automatic ICD Coding with Gap Schemes and Mixture of Experts

Yifan Wu, Weiyan Qiu, Min Zeng, Xi Chen, Min Li, Hongtao Zhu

Big Data Mining and Analytics 2025, 8(6): 1211-1224

Published: 19 September 2025

Abstract

PDF (5 MB) Collect Collected

Downloads：323

Assigning standardized International Classification of Disease (ICD) codes to Electronic Medical Records (EMR) is crucial for enhancing the efficiency and accuracy of medical coding processes. However, existing methods face challenges in effectively capturing, integrating, and amalgamating specialized medical knowledge from complex textual data. In this study, we propose GoM-ICD, an advanced automatic ICD coding framework that integrates multiple gap schemes with a Mixture of Experts (MoE) architecture. GoM-ICD is designed to address the extreme multilabel text classification in ICD coding. It segments and reassembles text to facilitate seamless information exchange across different chunks, employing various segmentation methods derived from different gap schemes. Then the model-level MoE consolidates the predictions of these methods to enhance the prediction performance. Specifically, the segmented text is input to a Pretrained Language Model (PLM) to extract textual features. Next, a Bidirectional Long Short-Term Memory network (BiLSTM) is employed to capture long-term contextual information from the textual features. Finally, a text-level MoE, followed by a label-level MoE, enables precise attention matching between text and labels, thereby improving the fidelity of the coding process. The three levels of MoE leverage the collective insights of diverse expert models, effectively processing multi-dimensional text features and unifying model-level insights from various gap schemes. Extensive experimental results demonstrate that GoM-ICD achieves the state-of-the-art performance in automatic ICD coding tasks, reaching micro-F1 of 0.617, 0.620, and 0.613 on datasets MIMIC-III full, MIMIC-III clean, and MIMIC-IV ICD-10, respectively. The source code can be obtained from https://github.com/CSUBioGroup/GoM-ICD.

Open Access Original Paper Just Accepted

Incorporating multi-scale module kernel for disease-gene identification in biological networks

Ju Xiang, Kaixin Zeng, Shengkai Chen, Xiangmao Meng, Ruiqing Zheng, Ying Zheng, Yahui Long, Min Li

Tsinghua Science and Technology

Available online: 05 August 2025

Abstract

PDF (1.5 MB) Collect Collected

Downloads：68

Biomedical data mining plays a crucial role in studying diseases, with disease-gene identification being one of the most prominent areas of research in this field. Many biomolecule networks are known to have multi-scale module structures, which may be helpful for studying complex diseases, but the mining and utilization of multi-scale module structure is an open issue. Therefore, we present a kind of novel hybrid network-based method (HyMSMK) for disease-gene identification through incorporating multi-scale module kernel in biomolecule networks. We first apply exponential sampling to construct multi-scale module profile containing local to global structural information, where modules at different scales are extracted from comprehensive interactome by multi-scale modularity optimization. Then, the multi-scale module profile is preprocessed by the relative information content, and is used to generate multi-scale module kernel, which is further preprocessed by kernel sparsification. We design multiple schemes for incorporating multi-scale module kernel to discover potential disease-related genes. We investigate the performance of these schemes by experimental evaluations, show the positive effect of kernel sparsification on reducing the requirement for space and time, and confirm the superior performance of our method compared to other state-of-art network-based baselines. The study demonstrates the utility of multi-scale module kernel in discovering disease genes, which could provide insights for the research of relevant issues.

Open Access Issue

A Flexible Data-Driven Framework for Correcting Coarsely Annotated scRNA-seq Data

Ruiqing Zheng, Yongxin He, Jiawen Huang, Shichao Kan, Hui Wang, Edwin Wang, Min Li

Big Data Mining and Analytics 2025, 8(5): 997-1010

Published: 14 July 2025

Abstract

PDF (5.5 MB) Collect Collected

Downloads：156

Cells are the fundamental units of life and exhibit significant diversity in structure, behavior, and function, known as cell heterogeneity. The advent and development of single-cell RNA sequencing (scRNA-seq) technology have provided a crucial data foundation for studying cellular heterogeneity. Currently, most computational methods based on scRNA-seq involve a sequential process of clustering followed by annotation. However, those clustering-based methods are susceptible to the selection of genes and clustering parameters, resulting in inaccuracies in cell annotation. To address this issue, we develop a flexible data-driven cell correction framework based on partially annotated scRNA-seq data. This framework employs a neighborhood purity strategy and global selection strategies to select the anchor cells. Then, it optimizes a prediction neural network model using a classification loss with a contrastive regularization term to correct the labels of the remaining cells. The validity of this correction framework is demonstrated through various assessments on real scRNA-seq datasets. Based on the correct labels of scRNA-seq data, we further assess the latest unsupervised clustering methods, thereby establishing a more objective benchmark to compare their performance.

Open Access Issue

Large Language Model for Medical Images: A Survey of Taxonomy, Systematic Review, and Future Trends

Peng Wang, Wenpeng Lu, Chunlin Lu, Ruoxi Zhou, Min Li, Libo Qin

Big Data Mining and Analytics 2025, 8(2): 496-517

Published: 28 January 2025

Abstract

PDF (1.3 MB) Collect Collected

Downloads：582

The advent of Large Language Models (LLMs) has sparked considerable interest in the medical image domain, as they can generalize to multiple tasks and offer outstanding performance. While LLMs achieve promising results, there is currently a lack of a comprehensive summary of medical images, making it challenging for researchers to understand the progress within this domain. To fill this gap, we make the first attempt to present a comprehensive survey for LLM on medical images. In addition, to better summarize the current progress comprehensively, we further introduce a novel x-stage tuning paradigm for summarization, including zero-stage tuning, one-stage tuning, and multi-stage tuning, offering a unified perspective on LLMs for medical images. Finally, we discuss challenges and future directions in this domain, aiming to spur more breakthroughs in the future. We hope this work can pave the way for the broad application of LLMs in medical images and provide a valuable resource for this domain.

Open Access Issue

Multiplex Networks and Pan-Cancer Multiomics-Based Driver Gene Identification Using Graph Neural Networks

Xingyi Li, Junming Li, Jun Hao, Xingyu Liao, Min Li, Xuequn Shang

Big Data Mining and Analytics 2024, 7(4): 1262-1272

Published: 04 December 2024

Abstract

PDF (16.5 MB) Collect Collected

Downloads：231

Identifying cancer driver genes has paramount significance in elucidating the intricate mechanisms underlying cancer development, progression, and therapeutic interventions. Abundant omics data and interactome networks provided by numerous extensive databases enable the application of graph deep learning techniques that incorporate network structures into the deep learning framework. However, most existing models primarily focus on individual network, inevitably neglecting the incompleteness and noise of interactions. Moreover, samples with imbalanced classes in driver gene identification hamper the performance of models. To address this, we propose a novel deep learning framework MMGN, which integrates multiplex networks and pan-cancer multiomics data using graph neural networks combined with negative sample inference to discover cancer driver genes, which not only enhances gene feature learning based on the mutual information and the consensus regularizer, but also achieves balanced class of positive and negative samples for model training. The reliability of MMGN has been verified by the Area Under the Receiver Operating Characteristic curves (AUROC) and the Area Under the Precision-Recall Curves (AUPRC). We believe MMGN has the potential to provide new prospects in precision oncology and may find broader applications in predicting biomarkers for other intricate diseases. Implementations of MMGN can be found at https://github.com/xingyili/MMGN.

Open Access Issue

SGCL-LncLoc: An Interpretable Deep Learning Model for Improving lncRNA Subcellular Localization Prediction with Supervised Graph Contrastive Learning

Min Li, Baoying Zhao, Yiming Li, Pingjian Ding, Rui Yin, Shichao Kan, Min Zeng

Big Data Mining and Analytics 2024, 7(3): 765-780

Published: 28 August 2024

Abstract

PDF (7.9 MB) Collect Collected

Downloads：649

Understanding the subcellular localization of long non-coding RNAs (lncRNAs) is crucial for unraveling their functional mechanisms. While previous computational methods have made progress in predicting lncRNA subcellular localization, most of them ignore the sequence order information by relying on k-mer frequency features to encode lncRNA sequences. In the study, we develope SGCL-LncLoc, a novel interpretable deep learning model based on supervised graph contrastive learning. SGCL-LncLoc transforms lncRNA sequences into de Bruijn graphs and uses the Word2Vec technique to learn the node representation of the graph. Then, SGCL-LncLoc applies graph convolutional networks to learn the comprehensive graph representation. Additionally, we propose a computational method to map the attention weights of the graph nodes to the weights of nucleotides in the lncRNA sequence, allowing SGCL-LncLoc to serve as an interpretable deep learning model. Furthermore, SGCL-LncLoc employs a supervised contrastive learning strategy, which leverages the relationships between different samples and label information, guiding the model to enhance representation learning for lncRNAs. Extensive experimental results demonstrate that SGCL-LncLoc outperforms both deep learning baseline models and existing predictors, showing its capability for accurate lncRNA subcellular localization prediction. Furthermore, we conduct a motif analysis, revealing that SGCL-LncLoc successfully captures known motifs associated with lncRNA subcellular localization. The SGCL-LncLoc web server is available at http://csuligroup.com:8000/SGCL-LncLoc. The source code can be obtained from https://github.com/CSUBioGroup/SGCL-LncLoc.

Total 18