Assessing the power of Tasks with data dependencies in OMPSS to optimize particle-in-cell simulation

Despite the strong promises of the task-based paradigm, as adopted in parallel programming models such as OpenMP and OmpSs, its effective advantages are far from being well understood when…

Despite the strong promises of the task-based paradigm, as adopted in parallel programming models such as OpenMP and OmpSs, its effective advantages are far from being well understood when applied to the non-trivial programs that comprise real-world HPC applications.

In the context of the EPEEC project, a recent collaboration between INESC-ID and BSC (Barcelona Supercomputer Center) teams has contributed to a better assessment of the advantages and limitations of tasks with data dependencies when used to parallelize the important class of particle-mesh applications. The case used for this study was a plasma physics kinetic simulation, based on an electromagnetic particle-in-cell (EM-PIC) method. This method is widely used for modeling many relevant plasma physics scenarios, ranging from high-intensity laser-plasma interaction to astrophysical shocks.

Different task-based implementations of a bare-bones version of the OSIRIS EM-PIC code, called ZPIC, have been developed based on the OmpSs-2 programming model. The different versions explore the task-based paradigm to different extents — ranging from its most basic to advanced features such as data dependencies. The suite of parallel implementations is available as open source code at EPEEC’s Github repository: https://github.com/epeec/zpic-epeec.

These implementations were experimentally evaluated with realistic simulation workloads (namely, Laser Wakefield Accelerator and Collision of Plasma Clouds) on a shared-memory multicore processor. The experiments were performed on a computational node composed of two Intel Xeon Platinum 8160 @2.10GHz CPUs with 24 physical cores (total of 48 cores) and 96GB of RAM, running SUSE Linux. The obtained results show that a fully asynchronous implementation (i.e., using only data dependencies for synchronization) is able to achieve near perfect scaling for 48 cores, despite the unbalanced conditions. This impressive result was accomplished while retaining the code simplicity of task-based programming.