Scholar - SciOpen

Open Access Issue

A Survey on Data Asset Value Change Estimation and Appreciation with Data Governance

Xiaoou Ding, Genglong Li, Yafeng Tang, Chen Liang, Tianren Yu, Muyun Zhou, Yida Liu, Zekai Qian, Zixuan Song, Hongzhi Wang

Big Data Mining and Analytics 2025, 8(6): 1369-1387

Published: 19 September 2025

Abstract

PDF (1.2 MB) Collect Collected

Downloads：220

In the digital economy, data assets have come to be regarded as the new oil, underscoring their critical role in modern business models and decision-making processes. In response, the Chinese government has prioritized the formalization and management of data assets, introducing policies aimed at enhancing their value. Given the unique nature of data assets, characterized by the potential for both depreciation and appreciation, precise methods for assessing value changes and realizing the appreciation of data assets are urgently needed. Effective data governance techniques, including data cleaning, acquisition, and integration, are essential for maximizing the economic potential of data assets. Against this backdrop, this survey explores two key issues from a data governance perspective: the enhancement of data asset value and the quantification of its changes. It is structured around two primary dimensions: first, by examining data assets’ inherent properties and quality indicators, and second, by utilizing an “on-demand evaluation” approach that assesses value of data assets in response to the performance of downstream machine learning models. By advancing understanding of these issues, this study seeks to optimize strategies for maximizing the economic impact of data assets through refined data governance practices.

Regular Paper Issue

Fine-Tuning Channel-Pruned Deep Model via Knowledge Distillation

Chong Zhang, Hong-Zhi Wang, Hong-Wei Liu, Yi-Lin Chen

Journal of Computer Science and Technology 2024, 39(6): 1238-1247

Published: 16 January 2025

Abstract Collect Collected

Deep convolutional neural networks with high performance are hard to be deployed in many real world applications, since the computing resources of edge devices such as smart phones or embedded GPU are limited. To alleviate this hardware limitation, the compression of deep neural networks from the model side becomes important. As one of the most popular methods in the spotlight, channel pruning of the deep convolutional model can effectively remove redundant convolutional channels from the CNN (convolutional neural network) without affecting the network’s performance remarkably. Existing methods focus on pruning design, evaluating the importance of different convolutional filters in the CNN model. A fast and effective fine-tuning method to restore accuracy is urgently needed. In this paper, we propose a fine-tuning method KDFT (Knowledge Distillation Based Fine-Tuning), which improves the accuracy of fine-tuned models with almost negligible training overhead by introducing knowledge distillation. Extensive experimental results on benchmark datasets with representative CNN models show that up to 4.86% accuracy improvement and 79% time saving can be obtained.

Open Access Issue

IoT data cleaning techniques: A survey

Xiaoou Ding, Hongzhi Wang, Genglong Li, Haoxuan Li, Yingze Li, Yida Liu

Intelligent and Converged Networks 2022, 3(4): 325-339

Published: 30 December 2022

Abstract

PDF (2 MB) Collect Collected

Downloads：476

Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time series characteristics, including error data detection and data repairing. In respect to error data detection techniques, it categorizes an overview of quantitative data error detection methods for detecting single-point errors, continuous errors, and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors. Besides, it provides a detailed description of error data repairing techniques, involving statistics-based repairing, rule-based repairing, and human-involved repairing. We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning.

Regular Paper Issue

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Zhi-Xin Qi, Hong-Zhi Wang, An-Jie Wang

Journal of Computer Science and Technology 2021, 36(4): 806-821

Published: 05 July 2021

Abstract Collect Collected

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Open Access Issue

Effective Density-Based Clustering Algorithms for Incomplete Data

Zhonghao Xue, Hongzhi Wang

Big Data Mining and Analytics 2021, 4(3): 183-194

Published: 12 May 2021

Abstract

PDF (5.5 MB) Collect Collected

Downloads：143

Density-based clustering is an important category among clustering algorithms. In real applications, many datasets suffer from incompleteness. Traditional imputation technologies or other techniques for handling missing values are not suitable for density-based clustering and decrease clustering result quality. To avoid these problems, we develop a novel density-based clustering approach for incomplete data based on Bayesian theory, which conducts imputation and clustering concurrently and makes use of intermediate clustering results. To avoid the impact of low-density areas inside non-convex clusters, we introduce a local imputation clustering algorithm, which aims to impute points to high-density local areas. The performances of the proposed algorithms are evaluated using ten synthetic datasets and five real-world datasets with induced missing values. The experimental results show the effectiveness of the proposed algorithms.

Open Access Issue

Mining Conditional Functional Dependency Rules on Big Data

Mingda Li, Hongzhi Wang, Jianzhong Li

Big Data Mining and Analytics 2020, 3(1): 68-84

Published: 19 December 2019

Abstract

PDF (964.3 KB) Collect Collected

Downloads：135

Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.

Open Access Issue

Trajectory Big Data Processing Based on Frequent Activity

Amina Belhassena, Hongzhi Wang

Tsinghua Science and Technology 2019, 24(3): 317-332

Published: 24 January 2019

Abstract

PDF (15.3 MB) Collect Collected

Downloads：50

With the rapid development and wide use of Global Positioning System in technology tools, such as smart phones and touch pads, many people share their personal experience through their trajectories while visiting places of interest. Therefore, trajectory query processing has emerged in recent years to help users find their best trajectories. However, with the huge amount of trajectory points and text descriptions, such as the activities practiced by users at these points, organizing these data in the index becomes tedious. Therefore, the parallel method becomes indispensable. In this paper, we have investigated the problem of distributed trajectory query processing based on the distance and frequent activities. The query is specified by start and final points in the trajectory, the distance threshold, and a set of frequent activities involved in the point of interest of the trajectory. As a result, the query returns the shortest trajectory including the most frequent activities with high support and high confidence. To simplify the query processing, we have implemented the Distributed Mining Trajectory R-Tree index (DMTR-Tree). For this method, we initially managed the large trajectory dataset in distributed R-Tree indexes. Then, for each index, we applied the frequent itemset Apriori algorithm for each point to select the frequent activity set. For the faster computation of the above algorithms, we utilized the cluster computing framework of Apache Spark with MapReduce as the programing model. The experimental results show that the DMTR-Tree index and the query-processing algorithm are efficient and can achieve the scalability.

Regular Paper Issue

O2iJoin: An Efficient Index-Based Algorithm for Overlap Interval Join

Ji-Zhou Luo, Sheng-Fei Shi, Guang Yang, Hong-Zhi Wang, Jian-Zhong Li

Journal of Computer Science and Technology 2018, 33(5): 1023-1038

Published: 12 September 2018

Abstract Collect Collected

Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based on tree structures such as quad-tree, B⁺-tree and interval tree. These algorithms usually have high CPU cost since deep path traversals are unavoidable, which makes them not so competitive as data-partition or plane-sweep based algorithms. This paper proposes an efficient overlap join algorithm based on a new two-layer flat index named as Overlap Interval Inverted Index (i.e., O2i Index). It uses an array to record the end points of intervals and approximates the nesting structures of intervals via two functions in the first layer, and the second layer uses inverted lists to trace all intervals satisfying the approximated nesting structures. With the help of the new index, the join algorithm only visits the must-be-scanned lists and skips all others. Analyses and experiments on both real and synthetic datasets show that the proposed algorithm is as competitive as the state-of-the-art algorithms.

Open Access Issue

Truth Discovery on Inconsistent Relational Data

Jizhou Sun, Jianzhong Li, Hong Gao, Hongzhi Wang

Tsinghua Science and Technology 2018, 23(3): 288-302

Published: 02 July 2018

Abstract

PDF (2 MB) Collect Collected

Downloads：43

In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the same object. One important task is to identify the most trustworthy value out of all the conflicting claims, and this is known as truth discovery. Existing truth discovery methods simultaneously identify the most trustworthy information and source reliability degrees and are based on the idea that more reliable sources often provide more trustworthy information, and vice versa. However, there are often semantic constrains defined upon relational database, which can be violated by a single data source. To remove violations, an important task is to repair data to satisfy the constrains, and this is known as data cleaning. The two problems above may coexist, but considering them together can provide some benefits, and to the authors knowledge, this has not yet been the focus of any research. In this paper, therefore, a schema-decomposing based method is proposed to simultaneously discover the truth and to clean the data, with the aim of improving accuracy. Experimental results using real world data sets of notebooks and mobile phones, as well as simulated data sets, demonstrate the effectiveness and efficiency of our proposed method.

Open Access Issue

A Generic Data Analytics System for Manufacturing Production

Hao Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao

Big Data Mining and Analytics 2018, 1(2): 160-171

Published: 12 April 2018

Abstract

PDF (663 KB) Collect Collected

Downloads：138

The increase in the amount of manufacturing information available means that big data can be collected and, with appropriate deep analysis, could be of great value to manufacturers. However, most small manufacturers cannot afford the overhead of a professional data analytics team. To address this problem, in this paper a generic data analytics system, Generic Manufacturing Data Analytics system (GMDA), is proposed. This system can perform most manufacturing data analytics tasks and users can easily carry out data analysis even if they have no prior knowledge or experience of data analytics. To establish such a system, we designed an abstract language, GMDL, to describe the manufacturing data analytics tasks. Aimed at factory data analytics, several algorithms were selected, tuned, optimized, and finally integrated into the system. Some noteworthy techniques were developed in GMDA such as proper algorithm selection strategy and an optimal parameter determination algorithm. Case studies show the practicability and reliability of the system.