Perspective Issue
Graph Accelerators—A Case for Sparse Data Processing
Journal of Computer Science and Technology 2024, 39 (2): 243-244
Published: 30 March 2024
Abstract Collect
Open Access Issue
Building a High-Performance Graph Storage on Top of Tree-Structured Key-Value Stores
Big Data Mining and Analytics 2024, 7 (1): 156-170
Published: 25 December 2023
Abstract PDF (2.7 MB) Collect

Graph databases have gained widespread adoption in various industries and have been utilized in a range of applications, including financial risk assessment, commodity recommendation, and data lineage tracking. While the principles and design of these databases have been the subject of some investigation, there remains a lack of comprehensive examination of aspects such as storage layout, query language, and deployment. The present study focuses on the design and implementation of graph storage layout, with a particular emphasis on tree-structured key-value stores. We also examine different design choices in the graph storage layer and present our findings through the development of TuGraph, a highly efficient single-machine graph database that significantly outperforms well-known Graph DataBase Management System (GDBMS). Additionally, TuGraph demonstrates superior performance in the Linked Data Benchmark Council (LDBC) Social Network Benchmark (SNB) interactive benchmark.

Open Access Issue
LotusSQL: SQL Engine for High-Performance Big Data Systems
Big Data Mining and Analytics 2021, 4 (4): 252-265
Published: 26 August 2021
Abstract PDF (2.6 MB) Collect

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.

Open Access Issue
AIPerf: Automated Machine Learning as an AI-HPC Benchmark
Big Data Mining and Analytics 2021, 4 (3): 208-220
Published: 12 May 2021
Abstract PDF (10.3 MB) Collect

The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the de facto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machine learning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales of machines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which is measured in an analytical and systematic approach, as a major metric to quantify the AI performance. We perform evaluations on various systems to ensure the benchmark’s stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scale on and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.

Open Access Issue
MetaOJ: A Massive Distributed Online Judge System
Tsinghua Science and Technology 2021, 26 (4): 548-557
Published: 04 January 2021
Abstract PDF (1.3 MB) Collect

Online Judge (OJ) systems are a basic and important component of computer education. Here, we present MetaOJ, an OJ system that can be used for holding massive programming tests online. MetaOJ is designed to create a distributed, fault-tolerant, and easy-to-scale OJ system from an existing ordinary OJ system by adding several interfaces into it and creating multiple instances of it. Our case on modifying the TUOJ system shows that the modification adds no more than 3% lines of code and the performance loss on a single OJ instance is no more than 12%. We also introduce mechanisms to integrate the system with cloud infrastructure to automate the deployment process. MetaOJ provides a solution for those OJ systems that are designed for a specific programming contest and are now facing performance bottlenecks.

Open Access Issue
Helmholtz Solving and Performance Optimization in Global/Regional Assimilation and Prediction System
Tsinghua Science and Technology 2021, 26 (3): 335-346
Published: 12 October 2020
Abstract PDF (13.4 MB) Collect

Despite efficient parallelism in the solution of physical parameterization in the Global/Regional Assimilation and Prediction System (GRAPES), the Helmholtz equation in the dynamic core, with the increase of resolution, can hardly achieve sufficient parallelism in the solving process due to a large amount of communication and irregular access. In this paper, optimizing the Helmholtz equation solution for better performance and higher efficiency has been an urgent task. An optimization scheme for the parallel solution of the Helmholtz equation is proposed in this paper. Specifically, the geometrical multigrid optimization strategy is designed by taking advantage of the data anisotropy of grid points near the pole and the isotropy of those near memory equator in the Helmholtz equation, and the Incomplete LU (ILU) decomposition preconditioner is adopted to speed up the convergence of the improved Generalized Conjugate Residual (GCR), which effectively reduces the number of iterations and the computation time. The overall solving performance of the Helmholtz equation is improved by thread-level parallelism, vectorization, and reuse of data in the cache. The experimental results show that the proposed optimization scheme can effectively eliminate the bottleneck of the Helmholtz equation as regards the solving speed. Considering the test results on a 10-node two-way server, the solution of the Helmholtz equation, compared with the original serial version, is accelerated by 100×, with one-third of iterations reduced.

Regular Paper Issue
Journal of Computer Science and Technology 2020, 35 (2): 379-381
Published: 27 March 2020
Abstract Collect
Open Access Issue
Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer
Tsinghua Science and Technology 2020, 25 (1): 56-67
Published: 22 July 2019
Abstract PDF (15.6 MB) Collect

A Weighted Essentially Non-Oscillatory scheme (WENO) is a solution to hyperbolic conservation laws, suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing (HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks, and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.

Open Access Issue
Auxo: A Temporal Graph Management System
Big Data Mining and Analytics 2019, 2 (1): 58-71
Published: 19 November 2018
Abstract PDF (752.8 KB) Collect

As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of graphs has grown. Herein, we propose Auxo, a novel temporal graph management system to support temporal graph analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes temporal graph data in spatio-temporal chunks. A chunk spans a particular time interval and covers a set of vertices in a graph. We propose chunk layout and chunk splitting designs to achieve the desired efficiency and the abovementioned goals. First, by carefully choosing the time split policy, Auxo achieves linear complexity in both space usage and query time. Second, graph splitting further improves the worst-case query time, and reduces the performance variance introduced by splitting operations. Third, Auxo optimizes the data layout inside chunks, thereby significantly improving the performance of traverse-based graph queries. Experimental evaluation showed that Auxo achieved 2.9× to 12.1× improvement for global queries, and 1.7× to 2.7× improvement for local queries, as compared with state-of-the-art open-source solutions.

Regular Paper Issue
Journal of Computer Science and Technology 2018, 33 (3): 531-532
Published: 11 May 2018
Abstract Collect
Total 12