1 Department of Applied Mathematics and Computer Science, Technical University of Denmark2 Scientific Computing, Department of Applied Mathematics and Computer Science, Technical University of Denmark3 Department of Informatics and Mathematical Modeling, Technical University of Denmark4 Center for Energy Resources Engineering, Center, Technical University of Denmark
With application to large-scale water wave simulations
The main objective with the present study has been to investigate parallel numerical algorithms with the purpose of running efficiently and scalably on modern many-core heterogeneous hardware. In order to obtain good efficiency and scalability on modern multi- and many- core architectures, algorithms and data structures must be designed to utilize the underlying parallel architecture. The architectural changes in hardware design within the last decade, from single to multi and many-core architectures, require software developers to identify and properly implement methods that both exploit concurrency and maintain numerical efficiency. Graphical Processing Units (GPUs) have proven to be very e_ective units for computing the solution of scientific problems described by partial differential equations (PDEs). GPUs have today become standard devices in portable, desktop, and supercomputers, which makes parallel software design applicable, but also a challenge for scientific software developers at all levels. We have developed a generic C++ library for fast prototyping of large-scale PDEs solvers based on flexible-order finite difference approximations on structured regular grids. The library is designed with a high abstraction interface to improve developer productivity. The library is based on modern template-based design concepts as described in Glimberg, Engsig-Karup, Nielsen & Dammann (2013). The library utilizes heterogeneous CPU/GPU environments in order to maximize computational throughput by favoring data locality and low-storage algorithms, which are becoming more and more important as the number of concurrent cores per processor increases. We demonstrate in a proof-of-concept the advantages of the library by assembling a generic nonlinear free surface water wave solver based on unified potential flow theory, for fast simulation of large-scale phenomena, such as long distance wave propagation over varying depths or within large coastal regions. Simulations that are valuable within maritime engineering because of the adjustable properties that follow from the flexible-order implementation. We extend the novel work on an efficient and robust iterative parallel solution strategy proposed by Engsig-Karup, Madsen & Glimberg (2011), for the bottleneck problem of solving a _-transformed Laplace problem in three dimensions at every time integration step. A geometric multigrid preconditioned defect correction scheme is used to attain high-order accurate solutions with fast convergence and scalable work effort. To minimize data storage and enhance performance, the numerical method is based on matrix-free finite difference approximations, implemented to run efficiently on many-core GPUs. Also, single-precision calculations are found to be attractive for reducing transfers and enhancing performance for both pure single and mixed-precision calculations without compromising robustness. A structured multi-block approach is presented that decomposes the problem into several subdomains, supporting flexible block structures to match the physical domain. For data communication across processor nodes, messages are sent using MPI to repeatedly update boundary information between adjacent coupled subdomains. The impact on convergence and performance scalability using the proposed hybrid CUDA-MPI strategy will be presented. A survey of the convergence and performance properties of the preconditioned defect correction method is carried out with special focus on large-scale multi-GPU simulations. Results indicate that a limited number of multigrid restrictions are required, and that it is strongly coupled to the wave resolutions. These results are encouraging for the heterogeneous multi-GPU systems as they reduce the communication overhead signifficantly and prevent both global coarse grid corrections and inefficient processor utilization at the coarsest levels. We find that spatial domain decomposition scales well for large problems sizes, but for problems of limited sizes, the maximum attainable speedup is reached for a low number of processors, as it leads to an unfavorable communication to compute ratio. To circumvent this, we have considered a recently proposed parallel-in-time algorithm referred to as Parareal, in an attempt to introduce algorithmic concurrency in the time discretization. Parareal may be perceived as a two level multigrid method in time, where the numerical solution is first sequentially advanced via course integration and then updated simultaneously on multiple GPUs in a predictor-corrector fashion. A parameter study is performed to establish proper choices for maximizing speedup and parallel effciency. The Parareal algorithm is found to be sensitive to a number of numerical and physical parameters, making practical speedup a matter of parameter tuning. Results are presented to confirm that it is possible to attain reasonable speedups, independently of the spatial problem size. To improve application range, curvilinear grid transformations are introduced to allow representation of complex boundary geometries. The curvilinear transformations increase the complexity of the implementation of the model equations. A number of free surface water wave cases have been demonstrated with boundary-fitted geometries, where the combination of a flexible geometry representation and a fast numerical solver can be a valuable engineering tool for large-scale simulation of real maritime scenarios. The present study touches some of the many possibilities that modern heterogeneous computing can bring if careful and parallel-aware design decisions are made. Though several free surface examples are outlined, we are yet to demonstrate results from a real large-scale engineering case.