Sort:
Regular Paper Issue
Minimum Epsilon-Kernel Computation for Large-Scale Data Processing
Journal of Computer Science and Technology 2022, 37 (6): 1398-1411
Published: 30 November 2022

Kernel is a kind of data summary which is elaborately extracted from a large dataset. Given a problem, the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio. It is widely used in geometric optimization, clustering, and approximate query processing, etc., for scaling them up to massive data. In this paper, we focus on the minimum ε-kernel (MK) computation that asks for a kernel of the smallest size for large-scale data processing. For the open problem presented by Wang et al. that whether the minimum ε-coreset (MC) problem and the MK problem can be reduced to each other, we first formalize the MK problem and analyze its complexity. Due to the NP-hardness of the MK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK), is developed to solve it. We prove that the MC problem and the MK problem can be Turing-reduced to each other. Then, we discuss the update of MK under insertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε-Kernel algorithm (RA-SCMK), is utilized to further reduce the complexity of SCMK. The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets. Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets, respectively. The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets. RA-SCMK runs up to three times faster than SCMK on synthetic datasets.

Regular Paper Issue
GAM: A GPU-Accelerated Algorithm for MaxRS Queries in Road Networks
Journal of Computer Science and Technology 2022, 37 (5): 1005-1025
Published: 30 September 2022

In smart phones, vehicles and wearable devices, GPS sensors are ubiquitous and collect a lot of valuable spatial data from the real world. Given a set of weighted points and a rectangle r in the space, a maximizing range sum (MaxRS) query is to find the position of r, so as to maximize the total weight of the points covered by r (i.e., the range sum). It has a wide spectrum of applications in spatial crowdsourcing, facility location and traffic monitoring. Most of the existing research focuses on the Euclidean space; however, in real life, the user's moving route is constrained by the road network, and the existing MaxRS query algorithms in the road network are inefficient. In this paper, we propose a novel GPU-accelerated algorithm, namely, GAM, to tackle MaxRS queries in road networks in two phases efficiently. In phase 1, we partition the entire road network into many small cells by a grid and theoretically prove the correctness of parallel query results by grid shifting, and then we propose an effective multi-grained pruning technique, by which the majority of cells can be pruned without further checking. In phase 2, we design a GPU-friendly storage structure, cell-based road network (CRN), and a two-level parallel framework to compute the final result in the remaining cells. Finally, we conduct extensive experiments on two real-world road networks, and the experimental results demonstrate that GAM is on average one order faster than state-of-the-art competitors, and the maximum speedup can achieve about 55 times.

Regular Paper Issue
Interval Estimation for Aggregate Queries on Incomplete Data
Journal of Computer Science and Technology 2019, 34 (6): 1203-1216
Published: 22 November 2019

Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.

Open Access Issue
An Efficient EH-WSN Energy Management Mechanism
Tsinghua Science and Technology 2018, 23 (4): 406-418
Published: 16 August 2018
Downloads:24

An Energy-Harvesting Wireless Sensor Network (EH-WSN) depends on harvesting energy from theenvironment to prolong network lifetime. Subjected to limited energy in complex environments, an EH-WSN encounters difficulty when applied to real environments as the network efficiency is reduced. Existing EH-WSN studies are usually conducted in assumed conditions in which nodes are synchronized and the energy profile is knowable or calculable. In real environments, nodes may lose their synchronization due to lack of energy. Furthermore, energy harvesting is significantly affected by multiple factors, whereas the ideal hypothesis is difficult to achieve in reality. In this paper, we introduce a general Intermittent Energy-Aware (IEA) EH-WSN platform. For the first time, we adopted a double-stage capacitor structure to ensure node synchronization in situations without energy harvesting, and we used an integrator to achieve ultra-low power measurement. With regard to hardware and software, we provided an optimized energy management mechanism for intermittent functioning. This paper describes the overall design of the IEA platform, and elaborates the energy management mechanism from the aspects of energy management, energy measurement, and energy prediction. In addition, we achieved node synchronization in different time and energy environments, measured the energy in reality, and proposed the light weight energy calculation method based on measured solar energy. In real environments, experiments are performed to verify the high performance of IEA in terms of validity and reliability. The IEA platform is shown to have ultra-low power consumption and high accuracy for energy measurement and prediction.

Regular Paper Issue
CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing
Journal of Computer Science and Technology 2018, 33 (2): 366-379
Published: 23 March 2018

Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.

total 5