Sort:
Open Access Issue
GoM-ICD: Automatic ICD Coding with Gap Schemes and Mixture of Experts
Big Data Mining and Analytics 2025, 8(6): 1211-1224
Published: 19 September 2025
Abstract PDF (5 MB) Collect
Downloads:303

Assigning standardized International Classification of Disease (ICD) codes to Electronic Medical Records (EMR) is crucial for enhancing the efficiency and accuracy of medical coding processes. However, existing methods face challenges in effectively capturing, integrating, and amalgamating specialized medical knowledge from complex textual data. In this study, we propose GoM-ICD, an advanced automatic ICD coding framework that integrates multiple gap schemes with a Mixture of Experts (MoE) architecture. GoM-ICD is designed to address the extreme multilabel text classification in ICD coding. It segments and reassembles text to facilitate seamless information exchange across different chunks, employing various segmentation methods derived from different gap schemes. Then the model-level MoE consolidates the predictions of these methods to enhance the prediction performance. Specifically, the segmented text is input to a Pretrained Language Model (PLM) to extract textual features. Next, a Bidirectional Long Short-Term Memory network (BiLSTM) is employed to capture long-term contextual information from the textual features. Finally, a text-level MoE, followed by a label-level MoE, enables precise attention matching between text and labels, thereby improving the fidelity of the coding process. The three levels of MoE leverage the collective insights of diverse expert models, effectively processing multi-dimensional text features and unifying model-level insights from various gap schemes. Extensive experimental results demonstrate that GoM-ICD achieves the state-of-the-art performance in automatic ICD coding tasks, reaching micro-F1 of 0.617, 0.620, and 0.613 on datasets MIMIC-III full, MIMIC-III clean, and MIMIC-IV ICD-10, respectively. The source code can be obtained from https://github.com/CSUBioGroup/GoM-ICD.

Open Access Issue
SGCL-LncLoc: An Interpretable Deep Learning Model for Improving lncRNA Subcellular Localization Prediction with Supervised Graph Contrastive Learning
Big Data Mining and Analytics 2024, 7(3): 765-780
Published: 28 August 2024
Abstract PDF (7.9 MB) Collect
Downloads:646

Understanding the subcellular localization of long non-coding RNAs (lncRNAs) is crucial for unraveling their functional mechanisms. While previous computational methods have made progress in predicting lncRNA subcellular localization, most of them ignore the sequence order information by relying on k-mer frequency features to encode lncRNA sequences. In the study, we develope SGCL-LncLoc, a novel interpretable deep learning model based on supervised graph contrastive learning. SGCL-LncLoc transforms lncRNA sequences into de Bruijn graphs and uses the Word2Vec technique to learn the node representation of the graph. Then, SGCL-LncLoc applies graph convolutional networks to learn the comprehensive graph representation. Additionally, we propose a computational method to map the attention weights of the graph nodes to the weights of nucleotides in the lncRNA sequence, allowing SGCL-LncLoc to serve as an interpretable deep learning model. Furthermore, SGCL-LncLoc employs a supervised contrastive learning strategy, which leverages the relationships between different samples and label information, guiding the model to enhance representation learning for lncRNAs. Extensive experimental results demonstrate that SGCL-LncLoc outperforms both deep learning baseline models and existing predictors, showing its capability for accurate lncRNA subcellular localization prediction. Furthermore, we conduct a motif analysis, revealing that SGCL-LncLoc successfully captures known motifs associated with lncRNA subcellular localization. The SGCL-LncLoc web server is available at http://csuligroup.com:8000/SGCL-LncLoc. The source code can be obtained from https://github.com/CSUBioGroup/SGCL-LncLoc.

Open Access Issue
NetEPD: A Network-Based Essential Protein Discovery Platform
Tsinghua Science and Technology 2020, 25(4): 542-552
Published: 13 January 2020
Abstract PDF (6.5 MB) Collect
Downloads:108

Proteins drive virtually all cellular-level processes. The proteins that are critical to cell proliferation and survival are defined as essential. These essential proteins are implicated in key metabolic and regulatory networks, and are important in the context of rational drug design efforts. The computational identification of the essential proteins benefits from the proliferation of publicly available protein interaction datasets. Scientists have developed several algorithms that use these interaction datasets to predict essential proteins. However, a comprehensive web platform that facilitates the analysis and prediction of essential proteins is missing. In this study, we design, implement, and release NetEPD: a network-based essential protein discovery platform. This resource integrates data on Protein-Protein Interaction (PPI) networks, gene expression, subcellular localization, and a native set of essential proteins. It also computes a variety of node centrality measures, evaluates the predictions of essential proteins, and visualizes PPI networks. This comprehensive platform functions by implementing four activities, which include the collection of datasets, computation of centrality measures, evaluation, and visualization. The results produced by NetEPD are visualized on its website, and sent to a user-provided email, and they are available to download in a parsable format. This platform is freely available at http://bioinformatics.csu.edu.cn/netepd.

Total 3