Cookies

We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.


Durham Research Online
You are in:

Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver.

Charrier, D.E. and Hazelwood, B. and Tutlyaeva, E. and Bader, M. and Dumbser, M. and Kudryavtsev, A. and Moskovsky, A. and Weinzierl, T. (2019) 'Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver.', International journal of high performance computing applications., 33 (5). pp. 973-986.

Abstract

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is parallelized using tasks, and it is cache efficient. AMR plus ADER-DG yields a task graph which is highly dynamic in nature and comprises both arithmetically expensive tasks and tasks which challenge the memory’s latency. The expensive tasks and thus the whole code benefit from AVX vectorization, although we suffer from memory access bursts. A frequency reduction of the chip improves the code’s energy-to-solution. Yet, it does not mitigate burst effects. The bursts’ latency penalty becomes worse once we add Intel Optane technology, increase the core count significantly or make individual, computationally heavy tasks fall out of close caches. Thread overbooking to hide away these latency penalties becomes contra-productive with noninclusive caches as it destroys the cache and vectorization character. In cases where memory-intense and computationally expensive tasks overlap, ExaHyPE’s cache-oblivious implementation nevertheless can exploit deep, noninclusive, heterogeneous memory effectively, as main memory misses arise infrequently and slow down only few cores. We thus propose that upcoming supercomputing simulation codes with dynamic, inhomogeneous task graphs are actively supported by thread runtimes in intermixing tasks of different compute character, and we propose that future hardware actively allows codes to downclock the cores running particular task types.

Item Type:Article
Full text:(AM) Accepted Manuscript
Download PDF
(997Kb)
Full text:(VoR) Version of Record
Available under License - Creative Commons Attribution Non-commercial.
Download PDF (Advance online version)
(737Kb)
Full text:(VoR) Version of Record
Available under License - Creative Commons Attribution Non-commercial.
Download PDF
(737Kb)
Status:Peer-reviewed
Publisher Web site:https://doi.org/10.1177/1094342019842645
Publisher statement:This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (http://www.creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
Date accepted:11 March 2019
Date deposited:15 March 2019
Date of first online publication:15 April 2019
Date first made open access:28 April 2019

Save or Share this output

Export:
Export
Look up in GoogleScholar