Papers Presenters

Supercomputing Frontiers Asia (SCFA)

Big Data

HHVSF: A Framework to Accelerate Drug-based High-throughput Virtual Screening on High-Performance Computers (#8)

by Pin Chen, Xin Yan, Jiahui Li, Yunfei Du, Jun Xu (Sun Yat-sen University, China)

Abstract: The High-performance High-throughput Virtual Screening Framework (HHVSF) has been developed to accelerate High-Throughput Virtual Screening (HTVS) on high-performance computers. Task management and data management are two core components in HHVSF. Fine-grained computing resources are configured to support serial or threaded applications. Each task gets the input file from the database through a preemptive algorithm and the failed tasks can be found and corrected. NoSQL database MongoDB is used as the data repository engine. Data is mobilized between the RAMDISK in computing node and the database. Data analysis is carried out after the computing process, and the results are stored in the database. Among the most popular molecular docking and molecular structure similarity packages, Autodock_vina (ADV) and WEGA were chosen to carry out experiments. Results show that when ADV was used for molecular docking, 10 million molecules were screened and analyzed in 22.31 h with 16000 cores, and the throughput reached up to 1324 molecules per second, averaging 145 molecules per second during the steady-running process. For WEGA, 958 million conformations were screened and analyzed in 34.12 min with 4000 cores, of which throughput reached up to 9448 molecules per second, 6430 molecules per second on average.

HBasechainDB — A Scalable Blockchain Framework on Hadoop Ecosystem (#27)

by Manuj Subhankar Sahoo, Pallav Kumar Baruah, Adarsh Saraf (Sri Sathya Sai Institute of Higher Learning, India)

Abstract: After the introduction of Bitcoin, blockchain has made its way through numerous applications and been adopted by various communities. A number of implementations exist today providing a platform to carry on business with ease. However, it is observed the scalability of blockchain still remains an issue. Also, none of the frameworks can claim the ability to handle Big Data and support to perform analytics, which is an important and integral facet of the current world of business. We propose HBasechainDB, a scalable blockchain-based tamper-proofed Big Data store for distributed computing. HBasechainDB adds the blockchain characteristics of immutability and decentralization to the HBase database in the Hadoop ecosystem. Linear scaling is achieved by pushing computation to the data nodes. HBasechainDB comes with inherent property of efficient big data processing as it is built on Hadoop ecosystem. HBasechainDB also makes adaptation of blockchain very easy for those organizations whose business logic are already existing on Hadoop ecosystem. HBasechainDB can be used as a tamper-proof, decentralized, distributed Big Data store.

DETOUR: A Large-Scale Non-Blocking Optical Data Center Fabric (#28)

by Jinzhen Bao, Dezun Dong, Baokang Zhao (National University of Defense Technology, China)

Abstract: Optical data center networks (DCNs) are attracting growing interest due to the technical strength compared to traditional electrical switching networks, which effectively eliminates the potential hotspot caused by over-subscription. However, the evolving traffics with high fan-out and various patterns pose new challenges to optical DCNs. Prior solutions are either hard to support high fan-out communications in large-scale or suffer from limited connections with low performance.

In this paper, we propose DETOUR, a large-scale non-blocking optical switching data center fabric. DETOUR composes of optical circuit switches (OCSes) and connects them in a 2D-Torus topology. It supports up to 729 racks and 69K+ ports with each OCS having 96 wavelengths. DETOUR utilizes a broadcast-and-select mechanism and enables signals optically forwarded to any dimension. Moreover, it realizes non-blocking by recursively adjusting conflict links between the diagonal forwarding OCSes. Our extensive evaluation results show that DETOUR delivers comparable high performance to a non-blocking optical switching fabric. It outperforms up to 2.14 $\times$ higher throughput and reduces 34% flow completion times (FCT) and 21% energy consumption compared with the state-of-the-art works.

Querying Large Scientific Data Sets with Adaptable IO System ADIOS (#36)

by Junmin Gu, Scott Klasky, Norbert Podhorszki, Ji Qiang, Kesheng Wu (Lawrence Berkeley National Laboratory, USA)

Abstract: When working with a large dataset, a relatively small fraction of data records are of interest in each analysis operation. For example, while examining a billion-particle dataset from an accelerator model, the scientists might focus on a few thousand fastest particles, or on the particle farthest from the beam center. In general, this type of selective data access is challenging because the selected data records could be anywhere in the dataset and require a significant amount of time to locate and retrieve. In this paper, we report our experience of addressing this data access challenge with the Adaptable IO System ADIOS. More specifically, we design a query interface for ADIOS to allow arbitrary combinations of range conditions on known variables, implement a number of different mechanisms for resolving these selection conditions, and devise strategies to reduce the time needed to retrieve the scattered data records. In many cases, the query mechanism can retrieve the selected data records orders of magnitude faster than the brute-force approach.

Our work relies heavily on the in situ data processing feature of ADIOS to allow user functions to be executed in the data transport pipeline. This feature allows us to build indexes for efficient query processing, and to perform other intricate analyses while the data is in memory.

On the Performance of Spark on HPC Systems: Towards a Complete Picture (#45)

by Orcun Yildiz, Shadi Ibrahim (INRIA, France)

Abstract: Big Data analytics frameworks (e.g., Apache Hadoop and Apache Spark) have been increasingly used by many companies and research labs to facilitate large-scale data analysis. However, with the growing needs of users and size of data, commodity-based infrastructure will strain under the heavy weight of Big Data. On the other hand, HPC systems offer a rich set of opportunities for Big Data processing. As first steps toward Big Data processing on HPC systems, several research efforts have been devoted to understanding the performance of Big Data applications on these systems. Yet the HPC specific performance considerations have not been fully investigated. In this work, we conduct an experimental campaign to provide a clearer understanding of the performance of Spark, the de facto in-memory data processing framework, on HPC systems. We ran Spark using representative Big Data workloads on Grid’5000 testbed to evaluate how the latency, contention and file system’s configuration can influence the application performance. We discuss the implications of our findings and draw attention to new ways (e.g., burst buffers) to improve the performance of Spark on HPC systems.

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems (#48)

by Peng Cheng, Yutong Lu, Yunfei Du, Zhiguang Chen (NUDT, China)

Abstract: With the rapid development of big data analytics frameworks, many existing high performance computing (HPC) facilities are evolving new capabilities to support big data analytics workloads. However, due to the different workload characteristics and optimization objectives of system architectures, migrating data-intensive applications to HPC systems that are geared for traditional compute-intensive applications presents a new challenge. In this paper, we address a critical question on how to accelerate complex application that contains both data-intensive and compute-intensive workloads on the Tianhe-2 system by deploying an in-memory file system as data access middleware; we characterize the impact of storage architecture on data-intensive MapReduce workloads when using Lustre as the underlying file system. Based on our characterization and findings of the performance behaviors, we propose shared map output shuffle strategy and file metadata cache layer to alleviate the impact of metadata bottleneck. The evaluation of these optimization techniques shows up to 17% performance benefit for data-intensive workloads.

GPU/FPGA

MACC : An OpenACC Transpiler for Automatic Multi-GPU Use (#29)

by Kazuaki Matsumura, Mitsuhisa Sato, Taisuke Boku, Artur Podobas, Satoshi Matsuoka (Tokyo Institute of Technology, Japan)

Abstract: Graphics Processing Units (GPUs) perform the majority of computations in state-of-the-art supercomputers. Programming these GPUs is often assisted using a programming model such as (amongst others) the directive-driven OpenACC. Unfortunately, OpenACC (and other similar models) are incapable of automatically targeting and distributing work across several GPUs, which decreases productivity and forces needless manual labour upon programmers. We propose a method that enables OpenACC applications to target multi-GPU. Workload distribution, data transfer and inter-GPU communication (including modern GPU-to-GPU links) are automatically and transparently handled by our compiler with no user intervention and no changes to the program code. Our method leverages existing OpenMP and OpenACC backends, ensuring easy integration into existing HPC infrastructure. Empirically we quantify performance gains and losses in our data coherence method compared to similar approaches and also show that our approach can compete with the performance of hand-written MPI code.

Architecture of an FPGA-Based Heterogeneous System for Code-Search Problems (#31)

by Yuki Hiradate, Hasitha Waidyasooriya, Masanori Hariyama, Masaaki Harada (Tohoku University, Japan)

Abstract: Code search problems refer to searching a particular bit pattern that satisfies given constraints. Obtaining such codes is very important in fields such as data encoding, error correcting, cryptography, etc. Unfortunately, the search time increases exponentially with the number of bits in the code, and typically requires many months of computation to find large codes. On the other hand, the search method mostly consists of 1-bit computations, so that reconfigurable hardware such as FPGAs (field programmable gate arrays) can be used to successfully obtain a massive degree of parallelism. In this paper, we propose a heterogeneous system with a CPU and an FPGA to speed-up code search problems. According to the evaluation, we obtain over 86 times speed-up compared to typical CPU-based implementation for extremal doubly even self-dual code search problem of length 128.

Acceleration of Wind Simulation Using Locally Mesh-Refined Lattice Boltzmann Method on GPU-Rich Supercomputers (#37)

by Naoyuki Onodera, Yasuhiro Idomura (Japan Atomic Energy Agency, Japan)

Abstract: A real-time simulation of the environmental dynamics of radioactive substances is very important from the viewpoint of nuclear security. Since airflows in large cities are turbulent with Reynolds numbers of several million, large-scale CFD simulations are needed. We developed a CFD code based on the adaptive mesh-refined Lattice Boltzmann Method (AMR-LBM). AMR method arranges fine grids in a necessary region so that we can realize a high-resolution analysis including a global simulation area. The code is developed on the GPU-rich supercomputer TSUBAME3.0 at the Tokyo Tech, and the GPU kernel functions are tuned to achieve high performance on the Pascal GPU architecture. The code is validated against a wind tunnel experiment which was released from the National Institute of Advanced Industrial Science and Technology in Japan Thanks to the AMR method, the total number of grid points is reduced to less than 10% compared to the fine uniform grid system. The performances of weak scaling from 1 nodes to 36 nodes are examined. The GPUs (NVIDIA TESLA P100) achieved more than 10 times higher node performance than that of CPUs (Broadwell).

Performance Tools

TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics (#17)

by Estelle Dirand, Laurent Colombet, Bruno Raffin (INRIA, France)

Abstract: The in situ paradigm proposes to co-locate simulation and analytics on the same compute node to analyze data while still resident in the compute node memory, hence reducing the need for post-processing methods. A standard approach that proved efficient for sharing resources on each node consists in running the analytics processes on a set of dedicated cores, called helper cores, to isolate them from the simulation processes. Simulation and analytics thus run concurrently with limited interference. In this paper, we show that the performance can be improved through a dynamic helper core strategy. We rely on a work stealing scheduler to implement TINS, a task-based in situ frameworks with an on-demand analytics isolation. The helper cores are dedicated to analytics only when analytics tasks are available. Otherwise, the helper cores join the other cores for processing simulation tasks. TINS relies on the Intel^® TBB library. Experiments on up to 14,336 cores run a set of representative analytics parallelized with TBB coupled with the hybrid MPI+TBB ExaStamp molecular dynamics code. TINS shows up to 40% performance improvement over various other approaches including the standard helper core.

Machine Learning Predictions for Underestimation of Job Runtime on HPC System (#30)

by Jian Guo, Akihiro Nomura, Ryan Barton, Haoyu Zhang, Satoshi Matsuoka (Tokyo Institute of Technology, Japan)

Abstract: In modern high-performance computing (HPC) systems, users are usually requested to estimate the job runtime for system scheduling when they submit a job. In general, an underestimation of job runtime will cause the HPC system to terminate the job before its completion. If users could be notified that their jobs may not finish before its allocated time expires, users can take actions, such as killing the job and resubmitting it after parameter adjustment, to save time and cost. Meanwhile, the productivity of HPC systems could also be vastly improved. In this paper, we propose a data-driven approach – that is, one that actively observes, analyzes, and logs jobs – for predicting underestimation of job runtime on HPC systems. Using data produced by TSUBAME 2.5, a supercomputer deployed at the Tokyo Institute of Technology, we apply machine learning algorithms to recognize patterns about whether the underestimation of job runtime occurs. Our experimental results show that our approach to runtime-underestimation prediction with 80% precision, 70% recall and 74% F1-score on the entirety of a given dataset. Finally, we split the entire job data set into subsets categorized by scientific application name. The best precision, recall and F1-score of subsets on runtime-underestimation prediction achieved 90%, 95% and 92% respectively.

A Power Management Framework with Simple DSL for Automatic Power-Performance Optimization on Power-Constrained HPC Systems (#41)

by Yasutaka Wada, Yuan He, Thang Cao, Masaaki Kondo (Meisei University, Japan)

Abstract: To design exascale HPC systems, power limitation is one of the most crucial and unavoidable issues; and it is also necessary to optimize the power-performance of user applications while keeping the power consumption of the HPC system below a given power budget. For this kind of power-performance optimization for HPC applications, it is indispensable to have enough information and good understanding about both the system specifications (what kind of hardware resources are included in the system, which component can be used as a “power-knob”, how to control the power-knob, etc.) and user applications (which part of the application is CPU-intensive, memory-intensive, and so on). Because this situation forces both the users and administrators of power-constrained HPC systems pay much effort and cost, it has been highly demanded to realize a simple framework to automate a power-performance optimization process, and to provide a simple user interface to the framework. To tackle these concerns, we propose and implement a versatile framework to help carry out power management and performance optimization on power-constrained HPC systems. In this framework, we also propose a simple DSL as an interface to utilize the framework. We believe this is a key to effectively utilize HPC systems under the limited power budget.

Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with Radiation (#43)

by Sidharth Kumar, Alan Humphrey, Will Usher, Steve Petruzza, Brad Peterson, John A. Schmidt, Derek Harris, Ben Isaac, Jeremy Thornock, Todd Harman, Valerio Pascucci, Martin Berzins (University of Utah, USA)

Abstract: The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper, we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes–file per process I/O and shared file I/O–in terms of both the total number of generated files and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira.

Linear Algebra

High Performance LOBPCG Method for Solving Multiple Eigenvalues of Hubbard Model: Efficiency of Communication Avoiding Neumann Expansion Preconditioner (#32)

by Susumu Yamada, Toshiyuki Imamura, Masahiko Machida (Japan Atomic Energy Agency, Japan)

Abstract: The exact diagonalization method is a high accuracy numerical approach for solving the Hubbard model of a system of electrons with a strong correlation. The method solves for the eigenvalues and eigenvectors of the Hamiltonian matrix derived from the Hubbard model. Since the Hamiltonian is a huge sparse symmetric matrix, it was expected that the LOBPCG method with an appropriate preconditioner could be used to solve the problem in a short time. This turned out to be the case as the LOBPCG method with a suitable preconditioner succeeded in solving the ground state (the smallest eigenvalue and its corresponding eigenvector) of the Hamiltonian. In order to solve for multiple eigenvalues of the Hamiltonian in a short time, we use a preconditioner based on the Neumann expansion which uses approximate eigenvalues and eigenvectors given by LOBPCG iteration. We apply a communication avoiding strategy, which was developed considering the physical properties of the Hubbard model, to the preconditioner. Our numerical experiments on two parallel computers show that the LOBPCG method coupled with the Neumann preconditioner and the communication avoiding strategy improves convergence and achieves excellent scalability when solving for multiple eigenvalues.

Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code (#34)

by Yasuhiro Idomura, Takuya Ina, Akie Mayumi, Susumu Yamada, Toshiyuki Imamura (Japan Atomic Energy Agency, Japan)

Abstract: A preconditioned Chebyshev basis communication-avoiding conjugate gradient method (P-CBCG) is applied to the pressure Poisson equation in a multiphase thermal-hydraulic CFD code JUPITER, and its computational performance and convergence properties are compared against a preconditioned conjugate gradient (P-CG) method and a preconditioned communication-avoiding conjugate gradient (P-CACG) method on the Oakforest-PACS, which consists of 8,208 KNLs. The P-CBCG method reduces the number of collective communications with keeping the robustness of convergence properties. Compared with the P-CACG method, an order of magnitude larger communication-avoiding steps are enabled by the improved robustness. It is shown that the P-CBCG method is 1.38 $\times$ and 1.17 $\times$ faster than the P-CG and P-CACG methods at 2,000 processors, respectively.

Optimization of Hierarchical Matrix Computation on GPU (#35)

by Satoshi Ohshima, Ichitaro Yamazaki, Akihiro ida, Rio Yokota (Kyushu University, Japan)

Abstract: The demand for dense matrix computation in large-scale and complex simulations is increasing; however, the memory capacity of the current computer system is insufficient for such simulations. Hierarchical matrix method ( $H$ -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of $H$ -matrices is more complex than that of dense and sparse matrices; thus, accelerating the $H$ -matrices is required. We focus on $H$ -matrix – vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.