+44 40 8873432 [email protected]

Papers Presenters

Supercomputing Frontiers Asia (SCFA) 2019

MH-QEMU: Memory-State-Aware Fault Injection Platform (001)

HIDEYUKI JITSUMOTO, Yuya KOBAYASHI, Akihiro Nomura, Satoshi Matsuoka (Global Scientific Information and Computing Center, Tokyo Institute of Technology)

Abstract: As we move towards higher-density, larger-scale, and lower-power computing hardware, new types of failures are being experienced with increasing frequency. Hardware designed for the post-Moore generation are also bringing about novel resiliency challenges.

In order to improve the efficiency of resiliency methods, fault injection plays an important role in understanding how errors affect the OS and application. Memory-state-aware fault injection, in particular, can be used to investigate the memory-related faults caused by using current and future hardware under extreme conditions and assess the costs/benefit trade-off of resiliency methods.

We introduce MH-QEMU, a memory-state-aware fault injection platform implemented by extending VM to intercepting memory accesses.

In addition, MH-QEMU can provide MH-QEMU users with the placement information of injected fault which is described by physical address and virtual address.

Therefore, MH-QEMU users are able to collect memory access information and define an injection condition from the collected information.

MH-QEMU incurs a 3.4 times overhead, and we demonstrate how row-hammer faults can be injected using MH-QEMU to analyzing the resiliency NPB CG’s algorithm.

Performance Evaluation and Analysis of Linear Algebra Kernels in the prototype Tianhe-3 Cluster (003)

Xin You, Hailong Yang, Zhongzhi Luan, Yi Liu, Depei Qian (Beihang University)

Abstract: As the supercomputing system entering the exascale era, power consumption becomes a major concern in the system design. Among all the novel techniques for reducing power consumption, ARM architecture is gaining popularity in the HPC community due to its low power footprint and high energy efficiency. As one of the initiatives for addressing the exascale challenges in China, Tianhe-3 supercomputer has adopted the technology roadmap of using the many-core ARM architecture with home-built phytium-2000+ and matrix-2000+ processors. In this paper, we evaluate several linear algebra kernels such as matrix-matrix multiplication, matrix-vector multiplication and triangular solver with both sparse and dense datasets. These linear algebra kernels are good performance indicators of the prototype Tianhe-3 cluster. Comprehensive analysis is performed using roofline model to identify the directions for performance optimization from both hardware and software perspectives. In addition, we compare the performance of phytium-2000+ and matrix-2000+ with widely used KNL processor. We believe this paper provides valuable experiences and insights as work-in-progress towards exascale for the HPC community.

Analysis of the CPU Cooling Temperature Effect on the Performance and Power Consumption for the HPC Operational Decision Support (038)

Jorji Nonaka, Fumiyoshi Shoji, Motohiko Matsuda, Hiroya Matsuba, Toshiyuki Tsukamoto (RIKEN Center for Computational Science)

Abstract: Modern leading-edge supercomputers are highly complex systems designed to provide maximum computational performance with minimum outage time. To operate such large systems, most HPC facilities usually possess auxiliary subsystems, such as power and cooling, and software suites, for the monitoring and management, in order to enable a stable and reliable long-term operation. Energy efficiency has become a critical element when considering the operation of such HPC facilities, and warm water cooling, has received greater attention in recent years as a highly efficient alternative to the traditional cooling systems. Since the processing speed still remains the main focus of any HPC systems, it becomes important to understand the effect of CPU cooling temperature on the performance and power consumption to support the HPC operational planning and decision makings. In this paper, we present some evaluation results using a commodity CPU with a temperature controlled water circulation equipment to reproduce the closed, water cooling loop used in the K computer system. This study is not conclusive since there are other factors and variables to consider in an actual HPC operation, but we believe that the knowledge obtained from the results can be useful for supporting some HPC operational decisions, and additionally, these initial results made us encouraged to continue working on this topic focusing also on the future coming supercomputer system.

NVMe-based BeeGFS as a next-generation scratch filesystem for High Performance Computing and Artificial Intelligence / Machine Learning workloads (035)

Jacob Anders, Greg Lehman, Igor Zupanovic, Rene Tyhouse, Garry Swan, Joseph Antony (CSIRO)

Abstract: With the ever-growing increase in performance of HPC systems, it is essential to ensure that the storage subsystems keep up with the growth. However – what is the most important metric – is it capacity, bandwidth or perhaps IO operations per second? Historically, a filesystem which provides excellent numbers in one area was unlikely to excel in others. This often led to situations where a filesystem tuned for streaming I/O and capable of sustaining tens of gigabytes per second delivered only a fraction of expected performance due to saturation with random small I/O. This situation is only becoming worse as workloads such as AI/ML and computational genomics are gaining popularity and importance.

At CSIRO we realised that a focused platform investment requires a new approach to storage that doesn’t require making a choice between high bandwidth, high capacity and high IOPS. We took up a challenge of building a next-generation filesystem that can simultaneously excel in all these areas. In order to do so, we decided to combine BeeGFS filesystem and NVMe.

Presently, the CSIRO BeeGFS cluster is 2PB+ of NVMe storage and being able to sustain high throughput while retaining the capability of supporting millions of IO operations per second, so there is no longer need to sacrifice bandwidth for IOPS or vice versa. BeeGFS runs on a range of popular operating systems and standard kernels, simplifying deployment and management of storage.

In this talk, we explain the motivation behind CSIRO investment in BeeGFS, cover the high-level design and implementation details. We highlight the challenges that need to be overcome in order to efficiently harness the available NVMe bandwidth. Finally, we present results of preliminary performance benchmarks.

“Into the DEEP”: A Modular Supercomputing Architecture towards Exascale (002)

Herbert Dr. Cornelius, Axel Auweter (Megware Computer GmbH)

Abstract: With the increasing complexity and performance requirements of modern HPC and HPDA workloads, it becomes clear that a “one size fit all” system architecture strategy is limited in respect to delivered performance and energy efficiency, especially considering a sustainable path forward towards Exascale computing. As an alternative to a traditional monolithic general purpose or purely dedicated system architecture approach, we will discuss the concept of a new Modular Supercomputing Architecture (MSA) and its benefits. The MSA integrates compute modules with different performance characteristics into a single heterogeneous system via a federated network. MSA brings substantial benefits for heterogeneous applications and workflows: each part can be run on an exactly matching system, improving time to solution and energy use. This approach is well suited for customers running heterogeneous application and workload mixes. It also offers valuable flexibility to the compute providers, allowing the set of modules and their respective size to be tailored to actual usage and specific application characteristics. We will discuss the software and hardware components of different modules to fit specific application characteristics in the context of a network federation to act as a single system, thus providing more flexibility and scalable performance for heterogeneous HPC/HPDA workloads and workflows. Our focus will be on potential technologies, system architectures and its integration aspects like advanced cooling, energy metering, power distribution, monitoring and management. The MSA is scalable and can be implemented at large scale, thus leading a path to efficient Exascale computing. The MSA concept is being explored by the ongoing DEEP-EST (Dynamical Exascale Entry Platform – Extreme Scale Technologies) European Union HORIZON 2020 project.

Tuning Alya for Energy Efficiency with READEX (028)

Venkatesh Kannan, Guillaume Houzeaux, Ricard Borrell, Myles Doyle (Irish Centre for High-End Computing (ICHEC))

Abstract: High performance computing (HPC) is a major driving force for research and innovation in many scientific and industrial domains. The applications in these areas are highly complex, and demand high performance and efficient execution. Energy requirements of near-future extreme-scale and exascale systems , which is significant in the TCO, is a major cause for concern. Therefore, it is crucial to improve the energy-efficiency of the applications that run on these systems.

A significant source of improvement for applications is that they commonly exhibit dynamic resource requirements. This may stem from different regions in the application that are executed or changes in the workload at runtime. Consequently, such dynamism in an application presents opportunity to tailor the utilisation of resources in the HPC system based on the requirements of the application at runtime.

READEX is a EU Horizon 2020 FET-HPC project whose objective is to exploit the dynamism found in HPC applications at runtime to achieve efficient computation on Exascale systems. Alya is a high performance computational mechanics application that is present in the Unified European Application Benchmark Suite and the PRACE Accelerator Benchmark Suite.

In this paper, the application dynamism present in Alya is investigated and exploited by the tool suite developed in READEX. We report on the potential energy savings and the effects on the application runtime, where we observe 5-10% reduction in the energy consumed by the application.

Secure Data Reservoir:70 Gbps fully encrypted data transfer facility (025)

Junichiro Shitami, Goki Honjo, Kei Hiraki, Mary Inaba (The University of Tokyo)

Abstract: High-speed data movement between research institutes and universities are always the most important function of network for HPC researches. Our research group has been working to efficiently utilize the high-speed network infrastructures of that date. We started the Data-Reservoir project on 1 Gbps Tokyo-US internet [1], then 10 Gbps [2] and 100 Gbps [4] were achieved. The actual HPC application is normally “storage device to storage device” but this was much harder than “memory to memory” data-transfer. We developed Web-based storage device to storage device data access system called USADAfox [3] to realize 6.5 Gbps web data transfer from Japan to Europe.

Target applications of long-distance data-movement also change gradually. Early applications are mainly instrumentation data of physics or astronomy and graphics and videos. Our recent applications have been shifted to bio-science and medical science [5]. Here fully-encrypted data transfer is always required to satisfy research ethics. However, fully-encrypted high-speed data transfer is much difficult than plain data-transfer between storage devices. Main difficulty exists in (1) computational overheads for encryption-decryption, (2) high-speed data movement between storage devices to encryption-decryption engine and network interfaces, and (3) stability of TCP on 100 Gbps long fat pipe networks. As for (1) we tested several different approaches and selected the method to utilize “encryption accelerator (AES-NI)” attached to each core of the main processor. As for (2) we selected software-based RAID 0 of 8 M.2 PCIE NVMe because hardware RAID 0 of similar configuration required higher-power-consumption and showed instability. As for (3) we use 8 TCP streams stabilized by pacing technique [2].

We perform very long-distance 100 Gbps data-movement experiments from US (Dallas, TX) to Singapore via Tokyo. RTT of this network is 443 ms (we used folded path between US and Singapore using 2 VLANs). Results of data-movement experiments shows stable 79.4 Gbps (peak), 70 Gbps (average) for more than 5 minutes. The important fact is that this level of performance is attained by a pair of ordinary 1U servers and can be applied many practical set-ups.

[1] K. Hiraki, M. Inaba, J. Tamatsukuri, R. Kurusu, Y. Ikuta, H. Koga and A. Zinzaki, “Data Reservoir: utilization of multi-gigabit backbone network for data-intensive research”, Proc. SC ’02 Proceeding of the 2002 ACM/IEE conference on Suptercomputing, pp. 1-9, ACM/IEEE, Baltimore, USA, 2002.

[2] Takeshi Yoshino,Yutaka Sugawara,Katsushi Inagami,Junji Tamatsukuri,Mary Inaba,Kei Hiraki

“Performance optimization of TCP/IP over 10 gigabit ethernet by precise instrumentation.” Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15-21, 2008, Austin, Texas, USA Nov. 2008.

[3] Yoshiki Iguchi and Naoki Tanida and Kenichi Koizumi and Mary Inaba and Kei Hiraki,


[4] K.Koizumi, G. Honjo, J. Shitami, J. Tamatsukuri, H. Tezuka, M. Mary and K. Hiraki, “Extensions for Single TCP Stream on Long Fat-pipe Networks, Poster, TNC16 Network Conference, Prague, Czech Republic, June 2016.

[5] Nitta N. et al., “Intelligent Image-Activated Cell Sorting” CELL 175(1) 266-276 2018.

SuperCloud – The evolution of HPC to Software Defined Computing (034)

Jacob Anders, Garry Swan, Kenneth Ban (CSIRO)

Abstract: Batch-queue, bare-metal High Performance Computing systems have been the centrepiece of Scientific Computing for decades, providing researchers a highly efficient computing platform. With the advent of Cloud Computing some predicted that classic HPC will start
losing market share, however – this has not been seen so far for the majority of workloads, as Cloud systems heavily rely on virtualisation, which imposes performance overheads and other challenges.

Ironically, the biggest asset of traditional HPC – being heavily optimised for the task for running batch jobs – can be also seen as it’s biggest shortcoming: it is often lacking flexibility to support other workloads. Running interactive, persistent or isolated applications as well as
heterogeneous hardware/software stacks often proves difficult. This has a negative impact on a variety of workloads, including the key areas of importance and growth such as computational genomics, artificial intelligence / machine learning and cybersecurity research.

To address these challenges, CSIRO High Performance Computing team developed SuperCloud – a bare-metal OpenStack system with InfiniBand Software Defined Networking interconnect. The system can provision cloud instances directly on the hardware, with no
need of virtualisation, achieving the level of performance previously only seen on classic HPC systems. It is capable of doing so while enabling the users to run any operating system and workload required, provisioning resources in software defined networks which can be
isolated, private, shared or publicly accessible.

SuperCloud is a meta-system that the Infrastructure-as-Code tools can build upon. It allows the users to programmatically request compute, networking, storage and software resources through a single, unified set of APIs. This is a key capability which has the
potential to consolidate today’s isolated islands of software defined networking (SDN), software defined storage (SDS) and software-as-a-service (SaaS) and enable HPC teams to move a completely new paradigm of building systems – Software Defined Computing. In this
new paradigm, HPC cluster management software, job schedulers and high-performance parallel file systems move up the stack and become cloud native applications. At the same time, HPC and HPC storage become much more heterogeneous, flexible and dynamic – and
can not only consistently deliver top performance, but also quickly adapt to ever-changing needs of the scientific community, addressing the lack of flexibility we are facing today.


PHINEAS: an Embedded Heterogeneous Parallel Platform (037)

Nikhil Khatri, Nithin Bodanapu, Sudarshan TSB (Department of Computer Science and Engineering, PES University – Bangalore)

Abstract: With machine learning being applied to increasingly varied domains, the computational needs of researchers have increased proportionately. Hobbyists, researchers and universities are turning to building their own cluster computers to meet their high performance compute needs. These clusters are typically highly efficient, low cost ARM based platforms consisting of between 4 and 8 nodes. In this paper, we present PHINEAS: Parallel Heterogeneous INdigenous Embedded ARM System, a parallel compute platform which allows for distributed computation using MPI and OpenMP and which further leverages the on-board GPU to perform general purpose compute tasks.

We describe the hardware components of the cluster, the software stack installed on each node and a host of common benchmark algorithms and their results. The results show that the cluster meets the stringent latency requirements of embedded systems. We further describe how the on-board GPU’s OpenGL ES 2.0 programming model can be used to implement tasks such as image convolution and neural network inference which are common in intelligent embedded systems. Parallelisation of compute tasks across multiple GPUs is discussed as a method to combine the advantages of distributed and heterogeneous computing.

Practical Resource Usage Prediction Method for Large Memory Jobs in HPC clusters (039)

Bill McMillan, Xiuqiao Li, Nan Qi, Yuanyuan He (IBM Cognitive Systems)

Abstract: Users in high performance computing (HPC) clusters normally face challenges to specify accurate resource estimates for running their applications as batch jobs. Predic-tion is a common way to alleviate this complexity by using historical job records of previous runs to estimate resource usage for new coming jobs. Most of existing re-source prediction methods directly build a single model to consider all of the jobs in clusters. However, people in production usage tend to only focus on the resource usage of jobs with certain patterns, e.g. jobs with large memory consumption. This paper proposes a practical resource prediction method for large memory jobs. The proposed method first tries to predict whether a job tends to use large memory size, and then predicts the final memory usage using a model which is trained by only historical large memory jobs. Using several real world job traces collected from large production clusters of IBM Spectrum LSF customer sites, the evaluation results show that the average prediction errors can be reduced up to 40 percent for nearly 90 percent of large memory jobs. Meanwhile, the model training cost can be reduced over 30 percent for the evaluated job traces.

A Crystal/Clear Pipeline for Applied Image Processing (021)

Christopher Watkins, Nicholas Rosa, Thomas Carroll, David Ratcliffe, Marko Ristic, Christopher Russell, Rongxin Li, Vincent Fazio, Janet Newman (Commonwealth Scientific and Industrial Research Organisation (CSIRO))

Abstract: Many long-standing image processing problems in applied science domains are finding solutions through the application of deep learning approaches to image processing. Here we present one such compute intensive application; the case of classifying images of protein crystallisation droplets. The Collaborative Crystallisation Centre in Melbourne, Australia is a medium throughput service facility that produces between five and twenty thousand images per day. This submission outlines a reliable and robust machine learning pipeline that autonomously classifies these images using CSIRO’s high-performance computing facilities. Our pipeline achieves improved accuracies over existing implementations and delivers these results in real time. The pipeline has been designed to process the classification of both nascent images in real time, as well as a back catalogue of 60 million legacy images. We discuss the specific tools and techniques used to construct and parallelise the pipeline, as well as the methodologies for testing and validating externally developed classification models. Finally, we recommend some steps for future refinement of the pipeline in production.

A Cache-based Data Movement Infrastructure for On-demand Scientific Cloud Computing (016)

David Abramson, Jake Carroll, Chao Jin, Michael Mallon, Zane van Iperen, Hoang Nguyen (The University of Queensland)

Abstract: As cloud computing has become the de facto standard for big data processing, there is interest in using a multi-cloud environment that combines public cloud resources with private on-premise infrastructure. However, by decentralizing the infrastructure, a uniform storage solution is required to provide data movement between different clouds to assist on-demand computing. This paper presents a solution based on our earlier work, the MeDiCI (Metropolitan Data Caching Infrastructure) architecture. Specially, we extend MeDiCI to simplify the movement of data between different clouds and a centralized storage site. It uses a hierarchical caching system and supports most popular infrastructure-as-a-service (IaaS) interfaces, including Amazon AWS and OpenStack. As a result, our system allows the existing parallel data intensive application to be offloaded into IaaS clouds directly. The solution is illustrated using a large bio-informatics application, a Genome Wide Association Study (GWAS), with Am-azons AWS, HUAWEI Cloud, and a private centralized storage system. The system is evaluated on Amazon AWS and the Australian national cloud.

Accelerate and Operationalize AI Deployments Using AI-Optimized Infrastructure (023)

Yael Shani (IBM)

Abstract: Rapid advancements in AI technologies are disrupting healthcare faster than most industry experts anticipated, changing the playing field for discovery, diagnosis and treatments. While AI holds a great promise to derive beneficial medical insights and advance precision medicine, the associated IT requirements for speed, flexibility, scalability, and advanced data management disciplines are still beyond the current capacity of many healthcare and life sciences organizations.

In this session we will discuss how to rapidly deploy an optimized and supported platform for AI workloads with high performance software-defined infrastructure that will allow you to simplify deployment, administration and management and accelerate time to results with simplified AI models that you can easily build, train and interact with.

Join our session to learn how to enhance data scientists productivity and solve the big data challenge to avoid a situation that data preparation and transformation is becoming a bottleneck with underlying storage solutions that will maximize efficiency of the overall data pipeline and optimize the performance of your machine learning and deep learning workloads.

Speaker: Dave Taylor, Executive Architect, Cloud Storage and Software Defined Storage Solutions, IBM

Sidra Medicine with IBM Spectrum Scale – Enhancing Patient Care with Personalized Medicine for Qatari population, EMEA & Wider World (015)

Yael Shani (IBM STSM, Spectrum Scale Development)

Abstract: The world is moving towards precision medicine where medicine is tailored to patients unique genotype. To identify indicators of major diseases and to accelerate the development of precision medicine & personalized treatments, Sidra Medicine (groundbreaking hospital, research and education institution, focusing on the health and well-being of children and women in Qatar and the wider world) has built a national high-performance computing platform. At Sidra Medicine, biomedical scientists are using a high-performance data analytics platform—based on a unified compute and storage environment from IBM—to accelerate cutting edge research and bring the future of personalized medicine closer. In this talk, Sidra and IBM team will present on the composable architecture approach they took to create a flexible, scalable, software-defined infrastructure based on IBM Systems and IBM Storage. The talk will deep dive on the back-end storage sub-systems and how an composable infrastructure approach based on scale out file system (IBM Spectrum Scale) enabled them to customize deployments for varying functional and performance needs. The talk will share the architecture recipe which enabled Sidra to complete 1,652,653 genomics pipeline jobs, helping it drive innovative genomic and clinical research and bring the future of personalized care closer to reality.

Speakers: Tushar Pathare – Big Data, GPFS, CSXF Lead – SIDRA Research Department
Sandeep Patil – IBM STSM, Spectrum Scale Development