Scholar - SciOpen

The identification of latent DNA binding domains presents both significant scientific value and analytical complexity, given the extensive diversity within biological compound datasets. To address this challenge, our research introduces weighted deep forest (WeighDF), a novel computational framework integrating hybrid feature representation with adaptive multi-granularity scanning analysis. This approach dynamically weights features across scanning windows using learnable attenuation coefficients, which amplifies key sequence patterns and suppresses background noise. For comprehensive prediction of diverse DNA binding patterns, we further develop decision learning predictive algorithm for binding sites (DecLPABS), an ensemble architecture combining WeighDF’s adaptive scanning with meta-learner integration strategies. This dual-phase system demonstrates superior versatility in handling both categorical classification and continuous regression problems. Empirical validation across heterogeneous datasets reveals DecLPABS’s enhanced predictive capability, achieving 0.8979 accuracy through optimized feature-space partitioning.

Open Access Issue

A Comparative Study of Sequence Clustering Algorithms

Zhen Ju, Huiling Zhang, Jingjing Zhang, Wenhui Xi, Dian Huang, Shengzhong Feng, Jintao Meng, Yanjie Wei

Big Data Mining and Analytics 2025, 8(5): 1011-1022

Published: 14 July 2025

Abstract

PDF (1.2 MB) Collect Collected

Downloads：221

Sequence clustering software is essential in bioinformatics. However, selecting the appropriate one can be challenging due to its diverse algorithms and targeted applications. This paper analyzes and evaluates eight representative softwares (algorithms) in terms of precision, sensitivity, speed, scale of running time, and memory consumption. Furthermore, this paper examines the effects of sequence count, sequence length, identity, thread count, and GPU on the above aspects. Sequence length and identity significantly impact clustering efficiency (speed and memory consumption), with fluctuation amplitudes exceeding an order of magnitude and non-monotonic effects observed. The evaluation results are analyzed and summarized in tables for users’ reference.

Open Access Issue

Exploring Pathogenic Mutation in Allosteric Proteins: The Prediction and Beyond

Huiling Zhang, Zhen Ju, Jingjing Zhang, Xijian Li, Hanyang Xiao, Xiaochuan Chen, Yuetong Li, Xinran Wang, Yanjie Wei

Tsinghua Science and Technology 2025, 30(5): 2284-2299

Published: 29 April 2025

Abstract

PDF (14.2 MB) Collect Collected

Downloads：191

In the post-genomic era, a central challenge for disease genomes is the identification of the biological effects of specific somatic variants on allosteric proteins and the phenotypes they influence during the initiation and progression of diseases. Here, we analyze more than 38539 mutations observed in 90 human genes with 740 allosteric protein chains. We find that existing allosteric protein mutations are associated with many diseases, but the clinical significance of most mutations in allosteric proteins remains unclear. Next, we develop an ensemble-learning-based model for pathogenic mutation prediction of allosteric proteins based on the intrinsic characteristics of proteins and the prediction results from existed methods. When tested on the benchmark allosteric protein dataset, the proposed method achieves an AUCs of 0.868 and an AUPR of 0.894 on allosteric proteins. Furthermore, we explore the performance of existing methods in predicting the pathogenicity of mutations at allosteric sites and identify potential significant pathogenic mutations at allosteric sites using the proposed method. In summary, these findings illuminate the significance of allosteric mutation in disease processes, and contribute a valuable tool for the identification of pathogenic mutations as well as previously unknown disease-causing allosteric-protein-encoded genes.

Open Access Issue

Distributed Heterogeneous Spiking Neural Network Simulator Using Sunway Accelerators

Xuelei Li, Zhichao Wang, Yi Pan, Jintao Meng, Shengzhong Feng, Yanjie Wei

Big Data Mining and Analytics 2024, 7(4): 1301-1320

Published: 04 December 2024

Abstract

PDF (6.1 MB) Collect Collected

Downloads：119

Spiking Neural Network (SNN) simulation is very important for studying brain function and validating the hypotheses for neuroscience, and it can also be used in artificial intelligence. Recently, GPU-based simulators have been developed to support the real-time simulation of SNN. However, these simulators’ simulating performance and scale are severely limited, due to the random memory access pattern and the global communication between devices. Therefore, we propose an efficient distributed heterogeneous SNN simulator based on the Sunway accelerators (including SW26010 and SW26010pro), named SWsnn, which supports accurate simulation with small time step (1/16 ms), randomly delay sizes for synapses, and larger scale network computing. Compared with existing GPUs, the Local Dynamic Memory (LDM) (similar to cache) in Sunway is much bigger (4 MB or 16 MB in each core group). To improve the simulation performance, we redesign the network data storage structure and the synaptic plasticity flow to make most random accesses occur in LDM. SWsnn hides Message Passing Interface (MPI)-related operations to reduce communication costs by separating SNN general workflow. Besides, SWsnn relies on parallel Compute Processing Elements (CPEs) rather than serial Manage Processing Element (MPE) to control the communicating buffers, using Register-Level Communication (RLC) and Direct Memory Access (DMA). In addition, SWsnn is further optimized using vectorization and DMA hiding techniques. Experimental results show that SWsnn runs 1.4−2.2 times faster than state-of-the-art GPU-based SNN simulator GPU-enhanced Neuronal Networks (GeNN), and supports much larger scale real-time simulation.

Open Access Issue

Autism Spectrum Disorder Classification with Interpretability in Children Based on Structural MRI Features Extracted Using Contrastive Variational Autoencoder

Ruimin Ma, Ruitao Xie, Yanlin Wang, Jintao Meng, Yanjie Wei, Yunpeng Cai, Wenhui Xi, Yi Pan

Big Data Mining and Analytics 2024, 7(3): 781-793

Published: 28 August 2024

Abstract

PDF (5.2 MB) Collect Collected

Downloads：144

Autism Spectrum Disorder (ASD) is a highly disabling mental disease that brings significant impairments of social interaction ability to the patients, making early screening and intervention of ASD critical. With the development of the machine learning and neuroimaging technology, extensive research has been conducted on machine classification of ASD based on structural Magnetic Resonance Imaging (s-MRI). However, most studies involve with datasets where participants’ age are above 5 and lack interpretability. In this paper, we propose a machine learning method for ASD classification in children with age range from 0.92 to 4.83 years, based on s-MRI features extracted using Contrastive Variational AutoEncoder (CVAE). 78 s-MRIs, collected from Shenzhen Children’s Hospital, are used for training CVAE, which consists of both ASD-specific feature channel and common-shared feature channel. The ASD participants represented by ASD-specific features can be easily discriminated from Typical Control (TC) participants represented by the common-shared features. In case of degraded predictive accuracy when data size is extremely small, a transfer learning strategy is proposed here as a potential solution. Finally, we conduct neuroanatomical interpretation based on the correlation between s-MRI features extracted from CVAE and surface area of different cortical regions, which discloses potential biomarkers that could help target treatments of ASD in the future.

Open Access Issue

DeepFilter: A Deep Learning Based Variant Filter for VarDict

Hao Zhang, Zekun Yin, Yanjie Wei, Bertil Schmidt, Weiguo Liu

Tsinghua Science and Technology 2023, 28(4): 665-672

Published: 06 January 2023

Abstract

PDF (1.9 MB) Collect Collected

Downloads：871

With the development of sequencing technologies, somatic mutation analysis has become an important component in cancer research and treatment. VarDict is a commonly used somatic variant caller for this task. Although the heuristic-based VarDict algorithm exhibits high sensitivity and versatility, it may detect higher amounts of false positive variants than callers, limiting its clinical practicality. To address this problem, we propose DeepFilter, a deep-learning based filter for VarDict, which can filter out the false positive variants detected by VarDict effectively. Our approach trains two models for insertion-deletion mutations (InDels) and single nucleotide variants (SNVs), respectively. Experiments show that DeepFilter can filter at least 98.5% of false positive variants and retain 93.5% of true positive variants for InDels and SNVs in the commonly used tumor-normal paired mode. Source code and pre-trained models are available at https://github.com/LeiHaoa/DeepFilter.

Open Access Issue

Identification of Key Genes as Potential Drug Targets for Gastric Cancer

Md. Tofazzal Hossain, Md. Selim Reza, Yin Peng, Shengzhong Feng, Yanjie Wei

Tsinghua Science and Technology 2023, 28(4): 649-664

Published: 06 January 2023

Abstract

PDF (6.8 MB) Collect Collected

Downloads：259

Gastric cancer (GC) is one of the most common cancers and ranks the third in cancer mortality all over the world. The goal of this study was to identify potential hub-genes, highlighting their functions, signaling pathways, and candidate drugs for the treatment of GC patients. We used publicly available next generation sequencing (NGS) data to identify differentially expressed (DE) genes. The top DE genes were mapped to STRING database to construct the protein-protein interaction (PPI) network and top hub genes were selected for further analysis. We found a total of 1555 DE genes with 870 upregulated and 685 downregulated genes in GC. We selected the top 400 (200 upregulated and 200 downregulated) genes to construct a PPI network and extracted the top 15 hub genes. The gene ontology (GO) term and kyoto encyclopedia of genes and genomes (KEGG) pathway enrichment analyses of the 15 hub genes exposed some important functions and signaling pathways that were significantly associated with GC patients. The survival analysis of the hub genes disclosed that the lower expressions of the three hub genes CDH2, COL4A1, and COL5A2 were associated with better survival of GC patients. These three genes might be the candidate biomarkers for the diagnosis and treatment of GC. Then, we considered 3 key proteins (genomic biomarkers) (COL4A1, CDH2, and CO5A2) as the drug target proteins (receptors), performed their docking analysis with the 102 meta-drug agents, and found Everolimus, Docetaxel, Lanreotide, Venetoclax, Temsirolimus, and Nilotinib as the top ranked 6 candidate drugs with respect to our proposed target proteins for the treatment against GC patients. Therefore, the proposed drugs might play vital role for the treatment against GC patients.

Open Access Issue

Protein Residue Contact Prediction Based on Deep Learning and Massive Statistical Features from Multi-Sequence Alignment

Huiling Zhang, Min Hao, Hao Wu, Hing-Fung Ting, Yihong Tang, Wenhui Xi, Yanjie Wei

Tsinghua Science and Technology 2022, 27(5): 843-854

Published: 17 March 2022

Abstract

PDF (21.6 MB) Collect Collected

Downloads：312

Sequence-based protein tertiary structure prediction is of fundamental importance because the function of a protein ultimately depends on its 3D structure. An accurate residue-residue contact map is one of the essential elements for current ab initio prediction protocols of 3D structure prediction. Recently, with the combination of deep learning and direct coupling techniques, the performance of residue contact prediction has achieved significant progress. However, a considerable number of current Deep-Learning (DL)-based prediction methods are usually time-consuming, mainly because they rely on different categories of data types and third-party programs. In this research, we transformed the complex biological problem into a pure computational problem through statistics and artificial intelligence. We have accordingly proposed a feature extraction method to obtain various categories of statistical information from only the multi-sequence alignment, followed by training a DL model for residue-residue contact prediction based on the massive statistical information. The proposed method is robust in terms of different test sets, showed high reliability on model confidence score, could obtain high computational efficiency and achieve comparable prediction precisions with DL methods that relying on multi-source inputs.

Total 8