Open Access Issue
KeyEE: Enhancing Low-Resource Generative Event Extraction with Auxiliary Keyword Sub-Prompt
Big Data Mining and Analytics 2024, 7 (2): 547-560
Published: 22 April 2024

Event Extraction (EE) is a key task in information extraction, which requires high-quality annotated data that are often costly to obtain. Traditional classification-based methods suffer from low-resource scenarios due to the lack of label semantics and fine-grained annotations. While recent approaches have endeavored to address EE through a more data-efficient generative process, they often overlook event keywords, which are vital for EE. To tackle these challenges, we introduce KeyEE, a multi-prompt learning strategy that improves low-resource event extraction by Event Keywords Extraction(EKE). We suggest employing an auxiliary EKE sub-prompt and concurrently training both EE and EKE with a shared pre-trained language model. With the auxiliary sub-prompt, KeyEE learns event keywords knowledge implicitly, thereby reducing the dependence on annotated data. Furthermore, we investigate and analyze various EKE sub-prompt strategies to encourage further research in this area. Our experiments on benchmark datasets ACE2005 and ERE show that KeyEE achieves significant improvement in low-resource settings and sets new state-of-the-art results.

Regular Paper Issue
Collaborative Matrix Factorization with Soft Regularization for Drug-Target Interaction Prediction
Journal of Computer Science and Technology 2021, 36 (2): 310-322
Published: 05 March 2021

Identifying the potential drug-target interactions (DTI) is critical in drug discovery. The drug-target interaction prediction methods based on collaborative filtering have demonstrated attractive prediction performance. However, many corresponding models cannot accurately express the relationship between similarity features and DTI features. In order to rationally represent the correlation, we propose a novel matrix factorization method, so-called collaborative matrix factorization with soft regularization (SRCMF). SRCMF improves the prediction performance by combining the drug and the target similarity information with matrix factorization. In contrast to general collaborative matrix factorization, the fundamental idea of SRCMF is to make the similarity features and the potential features of DTI approximate, not identical. Specifically, SRCMF obtains low-rank feature representations of drug similarity and target similarity, and then uses a soft regularization term to constrain the approximation between drug (target) similarity features and drug (target) potential features of DTI. To comprehensively evaluate the prediction performance of SRCMF, we conduct cross-validation experiments under three different settings. In terms of the area under the precision-recall curve (AUPR), SRCMF achieves better prediction results than six state-of-the-art methods. Besides, under different noise levels of similarity data, the prediction performance of SRCMF is much better than that of collaborative matrix factorization. In conclusion, SRCMF is robust leading to performance improvement in drug-target interaction prediction.

Open Access Issue
Clinical Big Data and Deep Learning: Applications, Challenges, and Future Outlooks
Big Data Mining and Analytics 2019, 2 (4): 288-305
Published: 05 August 2019

The explosion of digital healthcare data has led to a surge of data-driven medical research based on machine learning. In recent years, as a powerful technique for big data, deep learning has gained a central position in machine learning circles for its great advantages in feature representation and pattern recognition. This article presents a comprehensive overview of studies that employ deep learning methods to deal with clinical data. Firstly, based on the analysis of the characteristics of clinical data, various types of clinical data (e.g., medical images, clinical notes, lab results, vital signs, and demographic informatics) are discussed and details provided of some public clinical datasets. Secondly, a brief review of common deep learning models and their characteristics is conducted. Then, considering the wide range of clinical research and the diversity of data types, several deep learning applications for clinical data are illustrated: auxiliary diagnosis, prognosis, early warning, and other tasks. Although there are challenges involved in applying deep learning techniques to clinical data, it is still worthwhile to look forward to a promising future for deep learning applications in clinical big data in the direction of precision medicine.

Open Access Issue
LSTM Based Reserve Prediction for Bank Outlets
Tsinghua Science and Technology 2019, 24 (1): 77-85
Published: 08 November 2018

Reserve allocation is a significant problem faced by commercial banking businesses every day. To satisfy the cash requirement of customers and abate the vault cash pressure, commercial banks need to appropriately allocate reserves for each bank outlet. Excessive reserve would impact the revenue of bank outlets. Low reserves cannot guarantee the successful operation of bank outlets. Considering the reserve requirement is effected by the past cash balance, we deal the reserve allocation problem as a time series prediction problem, and the Long Short Time Memory (LSTM) network is adapted to solve it. In addition, the proposed LSTM prediction model regards date property, which can affect the cash balance, as a primary factor. The experiment results show that our method outperforms some existing traditional methods.

FPC: A New Approach to Firewall Policies Compression
Tsinghua Science and Technology 2019, 24 (1): 65-76
Published: 08 November 2018

Firewalls are crucial elements that enhance network security by examining the field values of every packet and deciding whether to accept or discard a packet according to the firewall policies. With the development of networks, the number of rules in firewalls has rapidly increased, consequently degrading network performance. In addition, because most real-life firewalls have been plagued with policy conflicts, malicious traffics can be allowed or legitimate traffics can be blocked. Moreover, because of the complexity of the firewall policies, it is very important to reduce the number of rules in a firewall while keeping the rule semantics unchanged and the target firewall rules conflict-free. In this study, we make three major contributions. First, we present a new approach in which a geometric model, multidimensional rectilinear polygon, is constructed for the firewall rules compression problem. Second, we propose a new scheme, Firewall Policies Compression (FPC), to compress the multidimensional firewall rules based on this geometric model. Third, we conducted extensive experiments to evaluate the performance of the proposed method. The experimental results demonstrate that the FPC method outperforms the existing approaches, in terms of compression ratio and efficiency while maintaining conflict-free firewall rules.

Open Access Issue
A Hybrid Algorithm Based on Binary Chemical Reaction Optimization and Tabu Search for Feature Selection of High-Dimensional Biomedical Data
Tsinghua Science and Technology 2018, 23 (6): 733-743
Published: 15 October 2018

In recent years, there have been rapid developments in various bioinformatics technologies, which have led to the accumulation of a large amount of biomedical data. The biomedical data can be analyzed to enhance assessment of at-risk patients and improve disease diagnosis, treatment, and prevention. However, these datasets usually have many features, which contain many irrelevant or redundant information. Feature selection is a solution that involves finding the optimal subset, which is known to be an NP problem because of the large search space. Considering this, a new feature selection approach based on Binary Chemical Reaction Optimization algorithm (BCRO) and k-Nearest Neighbors (KNN) classifier is presented in this paper. Tabu search is integrated with CRO framework to enhance local search capacity. KNN is adopted to evaluate the quality of selected candidate subset. The results for an experiment conducted on nine standard medical datasets demonstrate that the proposed approach outperforms other state-of-the-art methods.

Open Access Issue
Applications of Deep Learning to MRI Images: A Survey
Big Data Mining and Analytics 2018, 1 (1): 1-18
Published: 25 January 2018

Deep learning provides exciting solutions in many fields, such as image analysis, natural language processing, and expert system, and is seen as a key method for various future applications. On account of its non-invasive and good soft tissue contrast, in recent years, Magnetic Resonance Imaging (MRI) has been attracting increasing attention. With the development of deep learning, many innovative deep learning methods have been proposed to improve MRI image processing and analysis performance. The purpose of this article is to provide a comprehensive overview of deep learning-based MRI image processing and analysis. First, a brief introduction of deep learning and imaging modalities of MRI images is given. Then, common deep learning architectures are introduced. Next, deep learning applications of MRI images, such as image detection, image registration, image segmentation, and image classification are discussed. Subsequently, the advantages and weaknesses of several common tools are discussed, and several deep learning tools in the applications of MRI images are presented. Finally, an objective assessment of deep learning in MRI applications is presented, and future developments and trends with regard to deep learning for MRI images are addressed.

Open Access Issue
Framework to Identify Protein Complexes Based on Similarity Preclustering
Tsinghua Science and Technology 2017, 22 (1): 42-51
Published: 26 January 2017

Proteins interact with each other to form protein complexes, and cell functionality depends on both protein interactions and these complexes. Based on the assumption that protein complexes are highly connected and correspond to the dense regions in Protein-protein Interaction Networks (PINs), many methods have been proposed to identify the dense regions in PINs. Because protein complexes may be formed by proteins with similar properties, such as topological and functional properties, in this paper, we propose a protein complex identification framework (KCluster). In KCluster, a PIN is divided into K subnetworks using a K-means algorithm, and each subnetwork comprises proteins of similar degrees. We adopt a strategy based on the expected number of common neighbors to detect the protein complexes in each subnetwork. Moreover, we identify the protein complexes spanning two subnetworks by combining closely linked protein complexes from different subnetworks. Finally, we refine the predicted protein complexes using protein subcellular localization information. We apply KCluster and nine existing methods to identify protein complexes from a highly reliable yeast PIN. The results show that KCluster achieves higher Sn and Sp values and f-measures than other nine methods. Furthermore, the number of perfect matches predicted by KCluster is significantly higher than that of other nine methods.

Open Access Issue
Computational Approaches for Prioritizing Candidate Disease Genes Based on PPI Networks
Tsinghua Science and Technology 2015, 20 (5): 500-512
Published: 13 October 2015

With the continuing development and improvement of genome-wide techniques, a great number of candidate genes are discovered. How to identify the most likely disease genes among a large number of candidates becomes a fundamental challenge in human health. A common view is that genes related to a specific or similar disease tend to reside in the same neighbourhood of biomolecular networks. Recently, based on such observations, many methods have been developed to tackle this challenge. In this review, we firstly introduce the concept of disease genes, their properties, and available data for identifying them. Then we review the recent computational approaches for prioritizing candidate disease genes based on Protein-Protein Interaction (PPI) networks and investigate their advantages and disadvantages. Furthermore, some pieces of existing software and network resources are summarized. Finally, we discuss key issues in prioritizing candidate disease genes and point out some future research directions.

Open Access Issue
Genome-Wide Interaction-Based Association of Human Diseases — A Survey
Tsinghua Science and Technology 2014, 19 (6): 596-616
Published: 20 November 2014

Genome-Wide Association Studies (GWASs) aim to identify genetic variants that are associated with disease by assaying and analyzing hundreds of thousands of Single Nucleotide Polymorphisms (SNPs). Although traditional single-locus statistical approaches have been standardized and led to many interesting findings, a substantial number of recent GWASs indicate that for most disorders, the individual SNPs explain only a small fraction of the genetic causes. Consequently, exploring multi-SNPs interactions in the hope of discovering more significant associations has attracted more attentions. Due to the huge search space for complicated multi-locus interactions, many fast and effective methods have recently been proposed for detecting disease-associated epistatic interactions using GWAS data. In this paper, we provide a critical review and comparison of eight popular methods, i.e., BOOST, TEAM, epiForest, EDCF, SNPHarvester, epiMODE, MECPM, and MIC, which are used for detecting gene-gene interactions among genetic loci. In views of the assumption model on the data and searching strategies, we divide the methods into seven categories. Moreover, the evaluation methodologies, including detecting powers, disease models for simulation, resources of real GWAS data, and the control of false discover rate, are elaborated as references for new approach developers. At the end of the paper, we summarize the methods and discuss the future directions in genome-wide association studies for detecting epistatic interactions.

total 12