Scholar - SciOpen

Online Judge (OJ) systems are a basic and important component of computer education. Here, we present MetaOJ, an OJ system that can be used for holding massive programming tests online. MetaOJ is designed to create a distributed, fault-tolerant, and easy-to-scale OJ system from an existing ordinary OJ system by adding several interfaces into it and creating multiple instances of it. Our case on modifying the TUOJ system shows that the modification adds no more than 3% lines of code and the performance loss on a single OJ instance is no more than 12%. We also introduce mechanisms to integrate the system with cloud infrastructure to automate the deployment process. MetaOJ provides a solution for those OJ systems that are designed for a specific programming contest and are now facing performance bottlenecks.

Perspective Issue

Graph Accelerators—A Case for Sparse Data Processing

Wen-Guang Chen

Journal of Computer Science and Technology 2024, 39(2): 243-244

Published: 30 March 2024

Abstract Collect Collected

Open Access Issue

Building a High-Performance Graph Storage on Top of Tree-Structured Key-Value Stores

Heng Lin, Zhiyong Wang, Shipeng Qi, Xiaowei Zhu, Chuntao Hong, Wenguang Chen, Yingwei Luo

Big Data Mining and Analytics 2024, 7(1): 156-170

Published: 25 December 2023

Abstract

PDF (2.7 MB) Collect Collected

Downloads：534

Graph databases have gained widespread adoption in various industries and have been utilized in a range of applications, including financial risk assessment, commodity recommendation, and data lineage tracking. While the principles and design of these databases have been the subject of some investigation, there remains a lack of comprehensive examination of aspects such as storage layout, query language, and deployment. The present study focuses on the design and implementation of graph storage layout, with a particular emphasis on tree-structured key-value stores. We also examine different design choices in the graph storage layer and present our findings through the development of TuGraph, a highly efficient single-machine graph database that significantly outperforms well-known Graph DataBase Management System (GDBMS). Additionally, TuGraph demonstrates superior performance in the Linked Data Benchmark Council (LDBC) Social Network Benchmark (SNB) interactive benchmark.

Open Access Issue

LotusSQL: SQL Engine for High-Performance Big Data Systems

Xiaohan Li, Bowen Yu, Guanyu Feng, Haojie Wang, Wenguang Chen

Big Data Mining and Analytics 2021, 4(4): 252-265

Published: 26 August 2021

Abstract

PDF (2.6 MB) Collect Collected

Downloads：960

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to $9 \times$ in certain queries and outperforms Spark SQL in a standard query benchmark by more than $2 \times$ on average.

Open Access Issue

AIPerf: Automated Machine Learning as an AI-HPC Benchmark

Zhixiang Ren, Yongheng Liu, Tianhui Shi, Lei Xie, Yue Zhou, Jidong Zhai, Youhui Zhang, Yunquan Zhang, Wenguang Chen

Big Data Mining and Analytics 2021, 4(3): 208-220

Published: 12 May 2021

Abstract

PDF (10.3 MB) Collect Collected

Downloads：237

The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the de facto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machine learning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales of machines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which is measured in an analytical and systematic approach, as a major metric to quantify the AI performance. We perform evaluations on various systems to ensure the benchmark’s stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scale on and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.

Open Access Issue

Helmholtz Solving and Performance Optimization in Global/Regional Assimilation and Prediction System

Jianqiang Huang, Wei Xue, Haodong Bian, Wenxin Yan, Xiaoying Wang, Wenguang Chen

Tsinghua Science and Technology 2021, 26(3): 335-346

Published: 12 October 2020

Abstract

PDF (13.4 MB) Collect Collected

Downloads：120

Despite efficient parallelism in the solution of physical parameterization in the Global/Regional Assimilation and Prediction System (GRAPES), the Helmholtz equation in the dynamic core, with the increase of resolution, can hardly achieve sufficient parallelism in the solving process due to a large amount of communication and irregular access. In this paper, optimizing the Helmholtz equation solution for better performance and higher efficiency has been an urgent task. An optimization scheme for the parallel solution of the Helmholtz equation is proposed in this paper. Specifically, the geometrical multigrid optimization strategy is designed by taking advantage of the data anisotropy of grid points near the pole and the isotropy of those near memory equator in the Helmholtz equation, and the Incomplete LU (ILU) decomposition preconditioner is adopted to speed up the convergence of the improved Generalized Conjugate Residual (GCR), which effectively reduces the number of iterations and the computation time. The overall solving performance of the Helmholtz equation is improved by thread-level parallelism, vectorization, and reuse of data in the cache. The experimental results show that the proposed optimization scheme can effectively eliminate the bottleneck of the Helmholtz equation as regards the solving speed. Considering the test results on a 10-node two-way server, the solution of the Helmholtz equation, compared with the original serial version, is accelerated by $100 \times$ , with one-third of iterations reduced.

Regular Paper Issue

Preface

Wen-Guang Chen, Ying-Wei Luo, Guang-Yu Sun

Journal of Computer Science and Technology 2020, 35(2): 379-381

Published: 27 March 2020

Abstract Collect Collected

Open Access Issue

Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer

Jianqiang Huang, Wentao Han, Xiaoying Wang, Wenguang Chen

Tsinghua Science and Technology 2020, 25(1): 56-67

Published: 22 July 2019

Abstract

PDF (15.6 MB) Collect Collected

Downloads：91

A Weighted Essentially Non-Oscillatory scheme (WENO) is a solution to hyperbolic conservation laws, suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing (HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks, and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.

Open Access Issue

Auxo: A Temporal Graph Management System

Wentao Han, Kaiwei Li, Shimin Chen, Wenguang Chen

Big Data Mining and Analytics 2019, 2(1): 58-71

Published: 19 November 2018

Abstract

PDF (752.8 KB) Collect Collected

Downloads：74

As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of graphs has grown. Herein, we propose Auxo, a novel temporal graph management system to support temporal graph analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes temporal graph data in spatio-temporal chunks. A chunk spans a particular time interval and covers a set of vertices in a graph. We propose chunk layout and chunk splitting designs to achieve the desired efficiency and the abovementioned goals. First, by carefully choosing the time split policy, Auxo achieves linear complexity in both space usage and query time. Second, graph splitting further improves the worst-case query time, and reduces the performance variance introduced by splitting operations. Third, Auxo optimizes the data layout inside chunks, thereby significantly improving the performance of traverse-based graph queries. Experimental evaluation showed that Auxo achieved $2.9 \times$ to $12.1 \times$ improvement for global queries, and $1.7 \times$ to $2.7 \times$ improvement for local queries, as compared with state-of-the-art open-source solutions.

Regular Paper Issue

Preface

Wen-Guang Chen, Xue-Ming Si

Journal of Computer Science and Technology 2018, 33(3): 531-532

Published: 11 May 2018

Abstract Collect Collected

Total 13