S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ...

S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ö m u n g s m e c h a ni k

H e r a u s g e b e r

J. Fr ö hli c h, R. M ail a c h

I n stit ut f ü r St r ö m u n g s m e c h a ni k

T e c h ni s c h e U ni v e r sit ät D r e s d e n

D - 0 1 0 6 2 D r e s d e n

B a n d 3 1

T U D pr ess2 0 2 0

I m m o H ui s m a n n

C o m p ut ati o n al fl ui d d y n a mi c s

o n wil dl y h et e r o g e n e o u s s y st e m s

Bi bli o g r afi s c h e I nf o r m ati o n d e r D e ut s c h e n N ati o n al bi bli ot h e k

Di e D e ut s c h e N ati o n al bi bli ot h e k v e r z ei c h n et di e s e P u bli k ati o n i n d e r

D e ut s c h e n N ati o n al bi bli o g r afi e; d et ailli e rt e bi bli o g r afi s c h e D at e n si n d

i m I nt e r n et ü b e r htt p:// d n b. d - n b. d e a b r uf b a r.

Bi bli o g r a p hi c i nf o r m ati o n p u bli s h e d b y t h e D e ut s c h e N ati o n al bi bli ot h e k

T h e D e ut s c h e N ati o n al bi bli ot h e k li st s t hi s p u bli c ati o n i n t h e D e ut s c h e

N ati o n al bi bli o g r afi e; d et ail e d bi bli o g r a p hi c d at a a r e a v ail a bl e i n t h e

I nt e r n et at htt p:// d n b. d - n b. d e.

I S B N 9 7 8 - 3 - 9 5 9 0 8 - 4 2 4 - 6

© 2 0 2 0 T U D p r e s s

b ei T h el e m U ni v e r sit ät s v e rl a g

u n d B u c h h a n dl u n g G m b H & C o. K G

D r e s d e n

htt p:// w w w.t u d p r e s s. d e

All e R e c ht e v o r b e h alt e n. | All ri g ht s r e s e r v e d.

G e s et zt v o m A ut o r. | T y p e s et b y t h e a ut h o r.

P ri nt e d i n G e r m a n y.

Di e v o rli e g e n d e A r b eit w u r d e a m 2 7. N o v e m b e r 2 0 1 8 a n d e r F a k ult ät M a s c hi n e n w e s e n

d e r T e c h ni s c h e n U ni v e r sit ät D r e s d e n al s Di s s e rt ati o n ei n g e r ei c ht u n d a m 2 9. J a n u a r 2 0 2 0

e rf ol g r ei c h v e rt ei di gt .

T hi s w o r k w a s s u b mitt e d a s a P h D t h e si s t o t h e F a c ult y of M e c h a ni c al S ci e n c e a n d

E n gi n e e ri n g of T U D r e s d e n o n 2 7 N o v e m b e r 2 0 2 0 a n d s u c c e s sf ull y d ef e n d e d o n

2 9 J a n u a r y 2 0 2 0.

G ut a c ht e r | R e vi e w e r s

1. P r of. D r. -I n g. h a bil. J o c h e n Fr ö hli c h

2. P r of. S p e n c e r J. S h e r wi n

Technische Universitat Dresden

Faculty of Mechanical Science and Engineering

Computational fluid dynamics

on

wildly heterogeneous systems

Dissertation

in order to obtain the degree

Doktor-Ingenieur (Dr.-Ing.)

by

Immo Huismann

born on 1st October 1988 in Hamburg

Referees: Prof. Dr.-Ing. habil. Jochen Frohlich

Technische Universitat Dresden

Prof. Spencer J. Sherwin

Imperial College London

Date of submission: 27th November 2018

Date of defence: 29th January 2020

i

Abstract

In the last decade, high-order methods have gained increased attention. These combine

the convergence properties of spectral methods with the geometrical flexibility of low-order

methods. However, the time step is restrictive, necessitating the implicit treatment of diffu-

sion terms in addition to the pressure. Therefore, efficient solution of elliptic equations is of

central importance for fast flow solvers. As the operators scale with O(p · nDOF), where nDOF

is the number of degrees of freedom and p the polynomial degree, the runtime of the best

available multigrid algorithms scales with O(p · nDOF) as well. This super-linear scaling lim-

its the applicability of high-order methods to mid-range polynomial orders and constitutes

a major road block on the way to faster flow solvers.

This work reduces the super-linear scaling of elliptic solvers to a linear one. First, the static

condensation method improves the condition of the system, then the associated operator

is cast into matrix-free tensor-product form and factorized to linear complexity. The low

increase in the condition and the linear runtime of the operator lead to linearly scaling solvers

when increasing the polynomial degree, albeit with low robustness against the number of

elements. A p-multigrid with overlapping Schwarz smoothers regains the robustness, but

requires inverse operators on the subdomains and in the condensed case these are neither

linearly scaling nor matrix-free. Embedding the condensed system into the full one leads to

a matrix-free operator and factorization thereof to a linearly scaling inverse. In combination

with the previously gained operator a multigrid method with a constant runtime per degree

of freedom results, regardless of whether the polynomial degree or the number of elements

is increased.

Computing on heterogeneous hardware is investigated as a means to attain a higher perfor-

mance and future-proof the algorithms. A two-level parallelization extends the traditional

hybrid programming model by using a coarse-grain layer implementing domain decomposi-

tion and a fine-grain parallelization which is hardware-specific. Thereafter, load balancing

is investigated on a preconditioned conjugate gradient solver and functional performance

models adapted to account for the communication barriers in the algorithm. With the new

model, runtime prediction and measurement fit closely with an error margin near 5%.

The devised methods are combined into a flow solver which attains the same throughput

when computing with p = 16 as with p = 8, preserving the linear scaling. Furthermore, the

multigrid method reduces the cost of implicit treatment of the pressure to the one for explicit

treatment of the convection terms. Lastly, benchmarks confirm that the solver outperforms

established high-order codes.

iii

Kurzzusammenfassung

In den letzten Jahrzehnten lagen Methoden hoherer Ordnung im Fokus der Forschung. Diese

kombinieren die Konvergenzeigenschaften von Spektralmethoden mit der geometrischen Flex-

ibilitat von Methoden niedriger Ordnung. Dabei entsteht eine restriktive Zeitschrittbegren-

zung, die die implizite Behandlung von Diffusionstermen zusatzlich zu der des Druckes

erfordert. Aus diesem Grund ist die effiziente Losung elliptischer Gleichungen von zen-

tralem Interesse fur schnelle Stromungsloser. Da die Operatoren mit O(p · nDOF) skalieren,

wobei nDOF die Anzahl der Freiheitsgrade und p der Polynomgrad ist, skaliert die Laufzeit

der besten derzeit verfugbaren Mehrgitterloser ebenso mit O(p · nDOF). Diese super-lineare

Skalierung beschrankt die Anwendbarkeit von Methoden hoherer Ordnung auf mittlere Poly-

nomgrade und stellt eine große Hurde auf dem Weg zu schnelleren Stromungslosern dar.

Diese Arbeit senkt die super-lineare Skalierung elliptischer Loser auf eine lineare. Zuerst

verbessert die statische Kondensation die Kondition des Systems. Der dazu benotigte Op-

erator wird in Matrix-freier Tensorproduktform dargestellt und auf lineare Komplexitat

faktorisiert. Die Kombination aus langsam wachsender Kondition und linearer Operator-

laufzeit erzeugt Loser, die linear skalieren wenn der Polynomgrad angehoben wird, aller-

dings nicht wenn die Anzahl der Elemente erhoht wird. Eine p-Mehrgittermethode mit

uberlappendem Schwarz Glatter stellt die Robustheit gegenuber der Anzahl der Elemente

her, benotigt allerdings den inversen Operator auf Teilgebieten und im kondensierten Fall

sind diese weder linear skalierend noch Matrix-frei. Eine Einbettung des kondensierten Sys-

tems in das volle System liefert einen Matrix-freien Operator, der anschließend auf lineare

Komplexitat faktorisiert wird. In Kombination mit dem Operator resultiert eine Mehrgit-

termethode mit konstanter Laufzeit pro Freiheitsgrad, egal ob der Polynomgrad oder die

Anzahl der Elemente gesteigert wird.

Heterogenes Rechnen wird zur Steigerung der Rechenleistung und Zukunftssicherung der Al-

gorithmen und untersucht. Eine Zweischicht-Parallelisierung erweitert die traditionelle hy-

bride Parallelisierung, wobei eine grobe Schicht Gebietszerlegung implementiert und die feine

Hardware-spezifisch ist. Daraufhin wird die Lastverteilung auf solchen Systemen anhand

einer prakonditionierten konjugierte Gradienten Methode untersucht und ein funktionales

Leistungsmodell adaptiert um mit den auftretenden Kommunikationsbarrieren umzugehen.

Mit dem neuen Modell liegen Laufzeitvoraussage und -messung nahe beieinander, mit einem

Fehler von 5%.

Die entwickelten Methoden werden zu einem Stromungsloser kombiniert, der den gleichen

Durchsatz bei Rechnungen mit p = 16 und p = 8 erreicht, also die lineare Skalierung

beibehalt. Des weiteren reduziert der Mehrgitterloser die Rechenzeit zur impliziten Be-

handlung des Druckes auf die der expliziten fur die Konvektion. Zu guter Letzt zeigen

Benchmarks, dass der Loser eine hohere Performanz erreicht als etablierte Codes.

v

Acknowledgements

First and foremost, I want to thank those persons who made this work possible: First,

Prof. Jochen Frohlich who not only supported this work, but also endeared fluid mechanics

to me in the first place and allowed it to flourish at the chair, from curvilinear beginnings to

Cartesian endings and, second, Dr. Jorg Stiller who was always an encouraging supervisor

and contributed immensely with stimulating discussions as well as in-depth knowledge.

Thereafter, I would like to thank those who accompanied me over the years: My friends

and family who were always supportive in the long roller coaster of ups and downs that

culminated in this thesis. But directly after, I want to thank the whole Chair of Fluid

Mechanics, without whose help, support, and enticing discussions this endeavour would

not have been even half as productive and not even half as much fun. Furthermore, the

contribution of the productive environment of the Orchestration path of the Center for

Advancing Electronics Dresden requires explicit mentioning.

Then, I would like to thank those who took it upon them to help this work on the last

miles: Prof. Spencer Sherwin who accepted to be co-referee of the thesis and Prof. Jeronimo

Castrillon who offered to be on the commission.

But let us not forget those who drew me into mechanics in the first place and, over multiple

detours, to fluid mechanics: Prof. Balke and Prof. Ulbricht who not only held joyful lectures

ranging from statics, over strength theory to fracture mechanics but always accompanied

these with applications, such as fracture mechanics of a pressurized bratwurst.

vii

Contents

Acknowledgements v

List of symbols xi

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 High-order discretization methods . . . . . . . . . . . . . . . . . . . . 3

1.2.2 High-performance computing . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Goal and structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . 7

2 The spectral-element method 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 A spectral-element method for the Helmholtz equation . . . . . . . . . . . 9

2.2.1 Strong and weak form of the Helmholtz equation . . . . . . . . . . 9

2.2.2 Finite element approach . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Convergence properties . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Tensor-product elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Tensor-product matrices . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Tensor-product bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Tensor-product operators . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Fast diagonalization method . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Performance of basic Helmholtz solvers . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Considered preconditioners and solvers . . . . . . . . . . . . . . . . . 19

2.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Performance optimization for tensor-product operators 25

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Basic approach for interpolation operator . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Baseline operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Runtime tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

viii

3.3 Compiling information for the compiler . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Enhancing the interpolation operator . . . . . . . . . . . . . . . . . . 29

3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Extension to Helmholtz solver . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.1 Required operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.2 Operator runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.3 Performance gains for solvers . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Fast Static Condensation – Achieving a linear operation count 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Static condensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Principal idea of static condensation . . . . . . . . . . . . . . . . . . 42

4.2.2 Static condensation in three dimensions . . . . . . . . . . . . . . . . . 43

4.3 Factorization of the statically condensed Helmholtz operator . . . . . . . . 46

4.3.1 Tensor-product decomposition of the operator . . . . . . . . . . . . . 47

4.3.2 Sum-factorization of the operator . . . . . . . . . . . . . . . . . . . . 50

4.3.3 Product-factorization of the operator . . . . . . . . . . . . . . . . . . 52

4.3.4 Extensions to variable diffusivity . . . . . . . . . . . . . . . . . . . . 54

4.3.5 Runtime comparison of operators . . . . . . . . . . . . . . . . . . . . 56

4.4 Efficiency of pCG solvers employing fast static condensation . . . . . . . . . 60

4.4.1 Element-local preconditioning strategies . . . . . . . . . . . . . . . . 60

4.4.2 Considered solvers and test conditions . . . . . . . . . . . . . . . . . 61

4.4.3 Solver runtimes for homogeneous grids . . . . . . . . . . . . . . . . . 62

4.4.4 Solver runtimes for inhomogeneous grids . . . . . . . . . . . . . . . . 66

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Scaling to the stars – Linearly scaling spectral-element multigrid 69

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Linearly scaling additive Schwarz methods . . . . . . . . . . . . . . . . . . 70

5.2.1 Additive Schwarz methods . . . . . . . . . . . . . . . . . . . . . . . 70

5.2.2 Embedding the condensed system into the full system . . . . . . . . . 72

5.2.3 Tailoring fast diagonalization for static condensation . . . . . . . . . 73

5.2.4 Implementation of boundary conditions . . . . . . . . . . . . . . . . . 76

5.2.5 Extension to element-centered block smoothers . . . . . . . . . . . . 77

5.3 Multigrid solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.1 Multigrid algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.2 Complexity of the resulting algorithms . . . . . . . . . . . . . . . . . 80

5.4 Runtime tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4.1 Runtimes for the star inverse . . . . . . . . . . . . . . . . . . . . . . 81

ix

5.4.2 Solver runtimes for homogeneous meshes . . . . . . . . . . . . . . . . 84

5.4.3 Solver runtimes for anisotropic meshes . . . . . . . . . . . . . . . . . 86

5.4.4 Solver runtimes for stretched meshes . . . . . . . . . . . . . . . . . . 87

5.4.5 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Computing on wildly heterogeneous systems 95

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Programming wildly heterogeneous systems . . . . . . . . . . . . . . . . . . 96

6.2.1 Model problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2.2 Two-level parallelization of the model problem . . . . . . . . . . . . . 97

6.2.3 Performance gains for homogeneous systems . . . . . . . . . . . . . . 100

6.3 Load balancing model for wildly heterogeneous systems . . . . . . . . . . . . 103

6.3.1 Single-step load balancing for heterogeneous systems . . . . . . . . . 103

6.3.2 Problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3.3 Multi-step load balancing for heterogeneous systems . . . . . . . . . . 107

6.3.4 Performance with new load balancing model . . . . . . . . . . . . . . 109

6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7 Specht FS – A flow solver computing on heterogeneous hardware 113

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.2 Flow solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.2.1 Incompressible fluid flow . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.2.2 Spectral Element Cartesian HeTerogeneous Flow Solver . . . . . . . . 114

7.2.3 Pressure correction scheme in Specht FS . . . . . . . . . . . . . . . 115

7.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.3.1 Test regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.3.2 Taylor-Green vortex in a periodic domain . . . . . . . . . . . . . . 117

7.3.3 Taylor-Green vortex with Dirichlet boundary conditions . . . . 119

7.3.4 Turbulent plane channel flow . . . . . . . . . . . . . . . . . . . . . . 119

7.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.4.1 Turbulent plane channel flow . . . . . . . . . . . . . . . . . . . . . . 122

7.4.2 Turbulent Taylor-Green vortex benchmark . . . . . . . . . . . . . 124

7.4.3 Parallelization study . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8 Conclusions and outlook 131

A Further results for wildly heterogeneous systems 135

Bibliography 137

xi

List of symbols

Abbreviations

AVX2 Advanced Vector Extension 2 (instruction set)

BDF Backward Differencing Formula

BLAS Basic Linear Algebra Subroutines (software)

CFD Computational Fluid Dynamics

CFL Courant-Friedrich-Levy

CG Conjugate Gradient Method

CMOS Complementary Metal–Oxide–Semiconductor

CPU Central Processing Unit

DG Discontinuous Spectral-Element Method

DGEMM Double precision General Matrix Matrix Multiplication from BLAS

DOF Degrees Of Freedom

FEM Finite Element Method

FMA Fused Multiply Add (instruction)

FPGA Field-Programmable Gate Array

GEMM General Matrix Matrix Multiplication from BLAS

GLL Gauß-Lobatto-Legendre

GPU Graphics Processing Unit

HPC High-Performance Computer

ipCG Inexact Preconditioned Conjugate Gradient Method

LES Large-Eddy Simulation

MKL Intel Math Kernel Library (software)

MPI Message Passing Implementation (Software)

PDE Partial Differential Equation

xii

pCG Preconditioned Conjugate Gradient Method

RAM Random Access Memory

RHS Right-Hand Side

SEM Continuous Spectral-Element Method

SIMD Single Instruction Multiple Data (instruction)

SVV Spectral Vanishing Viscosity

Greek Symbols

α Expansion factor

αki Coefficient for BDFk time derivative for time level n− i

βki Extrapolation coefficient for order k and time level n− i

δij Component ij of Kronecker delta

∆u Correction for u in the condensed system

∆ui Correction for u on Schwarz subdomain Ωi in the condensed system

ε Mean dissipation rate

εSVV Polynomial degree for the SVV model

κ Condition number

Λ Diagonal matrix of eigenvalues

λ Helmholtz parameter λ ≥ 0

νpost Number of post-smoothing steps

νpre Number of pre-smoothing steps

Ω Computational Domain

Ωe Domain of element e

Ωi Schwarz domain i

ΩS Standard element ΩS = [−1, 1]

ϕ Pressure potential

φi i-th basis function

xiii

ρ Convergence rate of multigrid method

ξ coordinates in one-dimensional standard element

Latin Symbols

AR Aspect ratio

ARmax Maximum aspect ratio in a mesh

C Constant

CCFL CFL number

D Standard element differentiation matrix

De Diagonal matrix comprising the eigenvalues in three dimensions

De Differentiation matrix for element Ωe

d Number of dimensions

di,e Geometry coefficient for direction i in element Ωe

Ek Mean kinetic energy

e Element index

f Body force

f Right-hand side for the Helmholtz equation

F Discrete right-hand side

H Static condensed Helmholtz operator

H Helmholtz operator

H1 Sobolev (norm)

He Helmholtz operator for element Ωe

Hi Helmholtz operator for Schwarz subdomain Ωi in the condensed system

h Element width in one dimension

hi Element width in direction i

I Set of face indices I = e,w, n, s, t, b

I Identity matrix

xiv

i Integer

J l Grid transfer operation from level l − 1 to l

j Integer

k Integer

L Standard element stiffness matrix

L Number of levels for the multigrid method minus one

L2 Euclidean (norm)

Le Stiffness matrix for element Ωe

l Integer

M Standard element mass matrix

Me Mass matrix for element Ωe

M−1 Diagonal matrix containing the inverse multiplicity for the data points

m Integer

n10 Number of iterations to reduce the residual by ten orders

nDOF Number of degrees of freedom

ne Number of elements

nI Number of inner points nI = p− 1

np Number of points np = p+ 1

nS Number of points in a Schwarz subdomain Ωi

nt Number of time steps

nv Number of vertices in a mesh

O() Landau symbol

P (Pseudo)-Pressure

p Polynomial degree

pSVV Polynomial degree for the SVV model

pW Polynomial degree of the weight function

xv

Q Scatter operation mapping global to local data

Re Reynolds number

Reτ Wall Reynolds number

r Residual vector

S Transformation matrix

t Time variable

tIter Time per iteration

u Solution variable for the Helmholtz equation

u Vector containing the (current) solution u

u Velocity vector

uex Exact solution

uh Approximation for u

ui,e Coefficient for φi in element Ωe

Wi Weights associated with Schwarz subdomain Ωi in the condensed system

wi Quadrature weights for collocation point i

x Coordinate vector

x Coordinates in one dimension

y+ Distance from the wall in wall units

Mathematical Symbols

∂(·) Partial derivative

∆(·) Laplacian operator

∇(·) Vector derivative operator

∥ · ∥ Norm of vector / variable

∥ · ∥∞ Maximum norm of vector / variable

· ⊗ · Tensor product operator

xvi

Sub- and superscripts

B Boundary part of a matrix / vector

b Bottom in compass notation

C Variable for a CPU

Cond Condensed

E Eigenspace

e East in compass notation

Fe East face

Fw West face

G Variable for a GPU

I Inner part of a matrix / vector

i Vector / matrix for Schwarz subdomain Ωi

l Variable on multigrid level l

n North in compass notation

(·)n Variable at time step n

Prim Primary

s South in compass notation

T Transpose

t Top in compass notation

w West in compass notation

u / A Vector u or matrix A for the condensed system

(·) ⋆ Variable at intermediate time level

(·) Vector with three components

1

Chapter 1

Introduction

1.1 Introduction

For millennia mankind tried to unravel the mysteries posed by their surroundings, from the

philosophers from ancient Greece over those of enlightenment to the large-scale research

institutes of today. Over the centuries we arrived at models describing our environment,

such as equations of motion and laws of thermodynamics. These models not only allow to

gain insights into the inner workings of the world, but also further technological progress in

civil engineering, mechanical engineering, and many other applied sciences by allowing for

accurate predictions and models, granting us electricity, combustion engines, and flight.

For fluid mechanics, the Navier-Stokes equations allow us to describe the behaviour of

flows. However, most cases possess no known analytical solution. Therefore, experiments

were the dominant method to predict flow structures at the start of the 20th century. But

nowadays simulations complement them, and are the preferred option when turning to op-

timization. These have permeated the whole engineering landscape opening the field of

computational fluid dynamics (CFD), ranging from pipe flows [96, 93], turbines [80] and

wings [69] to weather forecast as well as earthquake [79] and climate predictions [68]. But

while the simulations involve more and more details and their scope widens, they are still

limited by the available compute power. For instance, while the aerodynamics of flight is long

understood and airfoils have been simulated with large-eddy simulation for more than 15

years [92], the time-resolving simulation of the flow around an aircraft is only expected to be

possible in 2030 [119]. For this to become reality, however, improvements in hardware, spa-

tial discretization methods, time-stepping schemes, and turbulence modelling are required

and this work contributes to enabling such computations.

To reach the goal of simulating whole aircraft, improvements in temporal and spatial dis-

cretization schemes are mandatory. The latter changed significantly in the last decades.

Where once Fourier methods and low-order finite difference discretizations were common,

2 1 Introduction

finite volume schemes are used throughout industry nowadays and high-order methods are

the current focus of research. These combine high convergence rates with the a finite element

approach, leading to the geometrical flexibility of finite volume schemes while attaining high

convergence orders [23, 74]. The high convergence rate, in turn, allows to lower the number

of degrees of freedom while attaining the same error margin as low-order techniques. But

it incurs a higher operator cost with the operator scaling super-linearly with the number

of degrees of freedom when increasing the polynomial order. The drawback becomes highly

relevant for the solution of elliptic equations, where the iteration count increases with the

polynomial degree as well. This super-linear scaling renders a high polynomial order infea-

sible in practice. To attain simulation of aircraft until 2030, improvements on the operator

costs and the resulting solvers are required, with linear scaling of the operator and a constant

iteration count being the optimal result.

Where the operator costs limit the convergence rate of the discretization scheme, the bound-

aries of physics limit the hardware capabilities. Current processors are fabricated in the

Complementary Metal–Oxide–Semiconductor (CMOS) process, where, typically, silicon gets

doped in order to render it semiconducting and create circuits. From the 1960s to 2010,

miniaturization doubled the number of transistors per chip every two years, with the heuris-

tic being called Moore’s law [95]. The lower structure width in combination with the

increasing transistor count leads to more performance without an increase in the required

power [22]. The performance gains, sometimes referred to “free lunch”, allowed programs to

perform better without requiring any change and, in turn, led to ever more intricate simula-

tions being performed. But free lunch is over: The current transistor width of 10 nm scrapes

at the physical limits of the technology [67], and the end of Moore’s law is nigh.

To circumvent the lack of performance gains through miniaturization, different compute

infrastructures are investigated. For instance using accelerator cards for numerics or so-called

dark silicon, where only parts of the processor are powered [28]. The resulting computers will

be more heterogeneous than the ones available today [26]. The first herald of this transition

is the increasing number of accelerator cards built into high-performance computers [94],

and the number is only growing larger. Where accelerators have been well understood and

employed for CFD simulations [40, 79], more heterogeneity provides further challenges from

programmability to load balancing. These need to be addressed to allow simulations to

capitalize on future compute structures, as for instance done in the Orchestration Path of

the Center For Advancing Electronics Dresden [19, 131], where this work resided in.

This work aims to provide improvements for high-order spatial discretization schemes by

lowering the operation count of elliptic solvers to linear complexity. Furthermore, running

these algorithms on heterogeneous hardware will be investigated, once from the programming

side and once from the load balancing perspective. The combination thereof will provide a

contribution to competitive high-order computational fluid dynamics and future-proving it

for the hardware to come.

1.2 State of the art 3

1.2 State of the art

1.2.1 High-order discretization methods

The numerical methods utilized in Computational Fluid Dynamics have gone a long way

since its advent. Today a large variety of schemes exist, ranging from low-order methods,

such as the Finite Volume Method which offer high spatial flexibility at a low convergence

order [32], to spectral methods that allow for very high convergence orders at the expense

of spatial flexibility [108]. Current research focuses on more flexible high-order methods

such as the Discontinuous Galerkin method (DG) [55] and the continuous Spectral Ele-

ment Method (SEM) [23, 74]. These combine the advantages of both, allowing for a do-

main decomposition by using a finite-element approach, as well as high convergence orders

via a polynomial ansatz, and are now widely accepted with general-purpose codes such

as Nek5000 [35], Nektar++ [18], Semtex [11], DUNE [7], and the deal.II library [5]

being readily available.

The main benefit of these high-order methods is the spectral convergence property. The

error scales with hp+1, where h is the element width and p the polynomial order. Raising

the polynomial degree allows for a higher convergence rate and, in turn, to attain the same

error margin using less degrees of freedom, assuming that the solution possesses sufficient

smoothness. This fueled a race to ever higher polynomial degrees: Where in the first publi-

cation on SEM a polynomial degree of p = 6 was employed [105], typical polynomial orders

in current simulations range up to 11 [3, 8, 87, 93] and even polynomial degrees of p = 15

are not unheard of [29]. The higher polynomial degrees, however, come at a cost. While the

convergence order increases with the polynomial degree, the complexities of the operators

do as well. In each element, (p+ 1)3 degrees of freedom are coupled with each other, leading

to O(p6) multiplications when implementing the operators via matrix-matrix multiplication

and O(p4) when exploiting tensor-product bases [23]. In both cases, the operation count in-

creases super-linearly with the number of degrees of freedom when increasing the polynomial

order.

The increased convergence rate of high-order methods is bought with more expensive op-

erator evaluations. While the increased costs are bearable for operations occurring once or

twice per time step, the iterative solution of elliptic equations requires tens if not hundreds

of iterations and therefore operator evaluations. Moreover, the iteration count increases with

the polynomial degree as well due to the condition of the system [23, 21]. The combina-

tion of higher operator costs and increasing iteration count leads to the solution process

of the Poisson equation for the pressure occupying up to 90 % of the runtime of a flow

solver [29] and constitutes a major roadblock on the path to high convergence rates.

4 1 Introduction

For elliptic solvers, a constant iteration count when increasing the number of elements is

mandatory for large-scale simulations, and global coupling is required. A two-level method

which triangulates the high-order mesh using linear finite elements provides good precondi-

tioning, even for unstructured grids [34, 126]. For structured grids, combining multigrid with

overlapping Schwarz type smoothers lowers the iteration count, but it is still dependent

upon the polynomial degree [88]. Additional refinement by weight functions reduces the num-

ber to three iterations [123, 122]. Where the cost of residual evaluation are similar as those

for a convection operator, the smoother requires an explicit inverse on a subdomain com-

prising multiple elements. However, in three dimensions all of these methods require O(p4)operations for both, residual evaluation and smoother.

A different venue to attain faster solvers lies in lowering the number of degrees of freedom. For

the spectral-element method, the static condensation method allows to eliminate element-

interior degrees of freedom, leading to a closer coupling of the remaining ones and a better

conditioned system [21]. It was already utilized in the first publication on the SEM [105] and

the hybridizable discontinuous Galerkin method allows for similar gains for DG [76, 138].

The method, however, still requires global coupling. For a static condensed system, iterative

substructuring reduces the number of unknowns on the faces to one, and leads to an even

smaller system [107, 120, 117]. However, the required number of iterations stays high.

Coupling static condensation with p-multigrid allows for the same number of iterations used

for the full case while using fewer unknowns and allowing for faster operators [52, 51]. But

again the smoother requires O(p4) operations.

So far, all solution methods for elliptic equations require O(p4) operations per iteration.

Lowering the operation count to O(p3) while attaining a constant iteration count promises

improvements of one or two orders of magnitude and removes a major obstacle to exploiting

high convergence orders.

1.2.2 High-performance computing

In the last decades, Moore’s law, as proclaimed in 1965 [95], allowed to double the number

of transistors per chip every two years. In turn, the peak performance of the top 500 reported

high-performance computers (HPC) doubled every two years, as Figure 1.1 shows. These

gains mostly resulted from miniaturization: Reduction of the structure width of the transis-

tors in conjunction with increased doping allowed for more transistors per chip at the same

power demand [129]. This “free lunch” generated performance gains for programs without

requiring any changes in the code. After 2012, however, the structure width stagnated as

the physical limits were reached in the production of Complementary Metal–Oxide–Semi-

conductor (CMOS) [67]. As shown in Figure 1.1, the structure width of Intel processors

decreased every two years for more than two decades, but stalled in 2012, with the 10 nm

1.2 State of the art 5

Sum #1 # 500

1990

1995

2000

2005

2010

2015

2020

Year

100

102

104

106

108

1010

Perform

ance

[GFLOP/s]

1990

1995

2000

2005

2010

2015

2020

Year

101

102

103

Structure

width

[nm]

Figure 1.1: Development of processors and high-performance computers over time. Left: Devel-opment of peak performance of the top 500 reported high-performance computersin the world over time. Here, “#1” refers to the most performant supercomputer,“#500” to the 500th, and “sum” to the sum of the 500. The data was extractedfrom [94]. Right: Structure width of transistors in Intel CPUs over time, extractedfrom [132].

process being delayed until late 2018. As a result, the performance gains in HPCs declined

and with the end of Moore’s law in sight, new venues to more performance are investigated.

One way to achieve higher performance lies in using specialized hardware instead of the

general-purpose CPUs utilized beforehand. For instance, accelerator hardware, such as Field

Programmable Gate Arrays (FPGAs) [121] and Graphics Processing Units (GPUs) [40],

generate the current performance increase in HPCs. At the moment of writing, 7 of the top

10 HPCs in the world incorporate accelerator hardware [94]. With the HPC environment

being utilized for other tasks than simulations, one prime example being machine learning,

different accelerator hardware can benefit different areas of research. Therefore, the HPC

of the future will be heterogeneous, consisting of different processing units specialized in

different tasks and CFD needs to adapt if it wants to keep up with the changing hardware.

For CFD, the increasing heterogeneity poses problems. Current codes are developed with-

out heterogeneity in mind. Plenty of codes are optimized for the CPU [105, 17], sometimes

even with scaling up to 300 000 cores [56], and GPU implementations of high-order codes

exist as well [77, 79]. But all of these stick to the model of completely homogeneous hard-

ware. This is in part due to the different set of programming languages being utilized. For

programming multi-core CPUs many are applicable. Multi-process parallelization can be

facilitated, with libraries such as MPI [128] or with partitioned global address space (PGAS)

languages, e.g.CoArray Fortran [111] or Unified Parallel C [20]. Furthermore,

directive-based shared-memory parallelization can be facilitated via OpenMP [102]. When

turning to GPUs, the programming landscape is fractured as well, with the programming

6 1 Introduction

languages ranging from OpenCL [125], over CUDA [1], to directive-based languages such as

OpenACC [101] and meta-languages such as MAPS [6, 9]. These programming languages

are often not compatible with each other, leading to one program being capable of comput-

ing on one set of hardware, whereas a completely different implementation is required for

a different one. And while some languages, such as OpenCL [125] and OmpSs [15], and

libraries such as OP2 [39, 110] allow to address multiple kinds of hardware, coupling these

can become a problem.

Multiple models already exist to compute on heterogeneous systems. From the computer

science side, task-based parallelism easily tackles heterogeneity by decomposing every op-

eration into small tasks which can then be sent to the hardware queues. Libraries such

as StarPU [4] and Charm++ [71, 113] implement it and the model is well-suited for

molecular dynamics simulations [103], for example. Similarly, decomposing the problem into

operators and distributing these to the hardware best suited for them can lead to perfor-

mance gains for databases [73]. However, these programming patterns do not match the

reality of CFD, where data parallelism is present in the operators but inter-dependencies

exist between consecutive operators and elements. Moreover, with current systems data

movement constitutes the main bottleneck for an algorithm and can render computing on

GPUs inefficient [44]. When using domain decomposition, creating programs addressing

multiple kinds of hardware with one source remains a problem. Using MPI in conjunction

with OpenMP and CUDA is one approach [137], but requires two programming paradigms

in one program. Simple expansions of the hybrid programming concept for shared-memory

programming [70] are needed in order to lower the require both maintenance and program-

ming effort.

After attaining a running heterogeneous program, the problem of load balancing remains,

as the heterogeneous system can be slower than any component alone. In the simplest

case, a constant load ratio is established via heuristics, as done for aerodynamics in [137].

Dynamic load balancing is a further option [139, 53, 24, 25], but requires constant reeval-

uation and costly redistribution. Furthermore it stays unclear whether the optimum is

attained. Functional performance models, for instance assuming that the runtime scales lin-

early with the number of elements, are employed from parallel matrix multiplication [140],

over lattice Boltzmann simulations [30], to finite volume codes [86] and spectral-element

simulations [59]. However, all these references take only the total runtime into account,

independent of the algorithm. When taking the growing complexity of CFD algorithms into

account, the approach seems too simplistic and evaluation of whether the assumptions hold

are required.

The heterogeneity of the hardware constitutes a major challenge, but other problems exist as

well: While the peak performance of the CPUs has still been increasing, the memory band-

width has not kept up, leading to the so-called memory gap [133]. For low-order discretization

schemes, such as Finite Difference, Finite Volume or low-order Finite Element schemes, very

1.3 Goal and structure of the dissertation 7

few operations are required per degree of freedom, e.g. 7 for a stencil for the Laplacian on 3D

Cartesian grids. Current CPUs, however, require a factor of 40 in computational intensity

to attain peak performance [48], otherwise the memory remains the bottleneck [134]. As the

gap is widening [116], ever larger portions of the performance remain unutilized for these

algorithms and improvements for high order methods need to account for the memory gap

if they want to stay future proof.

1.3 Goal and structure of the dissertation

For high polynomial degrees, the largest portion of the runtime of solvers for incompressible

fluid flow is spent in the pressure solver. The main goal of this dissertation lies in lowering

the time spent in the pressure solver to the one spent in convection operators. Not only

does this significantly lower the runtime of high-order solvers, but in addition allows for the

usage of high-order time-stepping schemes, which require at least one solution of the pressure

equation per convergence order. The second goal is to prove the resulting algorithms against

changes in the hardware – the increasing heterogeneity as well as the memory gap.

To lower the runtime of the pressure solver, the runtime of elliptic operators needs to be

factorized to linear complexity. However, this can not come alone. In the full system, an op-

erator scaling with nDOF ≈ p3ne incurs O(nDOF) loads and stores and the memory gap limits

the performance in the foreseeable future. Therefore, the static condensation technique is

employed, where only the boundaries of the elements remain in the equation system. As

the number of memory operations scales with O(p2ne) = O(1/p · nDOF), a linearly scaling

operator allows to circumvent the memory gap. After attaining a linearly scaling operation

count for the operator, the overlapping Schwarz methods proposed in [122, 51] are em-

ployed to attain a constant iteration count. However, the smoother in these references scales

with O(p4) and requires factorization to linear complexity. The combination of linear com-

plexity in operator and smoother and a constant iteration count results in a solver scaling

linearly with the number of degrees of freedom, independent of the polynomial degree.

After devising solvers which allow one to bridge the growing memory gap, the aspect of

heterogeneity is investigated on the most commonly encountered heterogeneous system: the

CPU-GPU coupling. First, a programming model allowing to compute on it utilizing one

single source is demonstrated. Thereafter, load balancing of the resulting heterogeneous

systems is investigated in order to extract the maximum attainable performance.

The layout of this work is as follows: Chapter 2 will introduce the notation and the spectral-

element method, then Chapter 3 investigates the attainable performance of tensor-product

operators. These serve as baseline to compare with for the remainder of the dissertation.

Chapter 4 investigates the static condensation method and derives a linearly scaling operator,

allowing to achieve a constant runtime per degree of freedom when increasing the polynomial

8 1 Introduction

degree. Thereafter, Chapter 5 expands this concept to a full multigrid solver with linear

complexity in the number of degrees of freedom. Then Chapter 6 provides a method for

orchestrating heterogeneous systems, considering the programming side as well as the load

balancing side. Lastly, Chapter 7 proposes a flow solver incorporating all of these methods,

validates it and compares the attained performance with those of other available high-order

codes.

9

Chapter 2

The spectral-element method

2.1 Introduction

This chapter introduces the nomenclature used throughout the remainder of the work by

recapitulating the main points of the spectral-element method. More thorough introductions

can be found in [23, 74].

2.2 A spectral-element method for the Helmholtz equa-

tion

2.2.1 Strong and weak form of the Helmholtz equation

For an open domain Ω, the Helmholtz equation reads

∀x ∈ Ω : λu(x)−∆u(x) = f(x) , (2.1)

with u being the solution variable, f the right-hand side and λ a parameter. For λ ≥ 0, the

equation becomes elliptic and constitutes the basic building block for time-stepping of diffu-

sion equations or for pressure treatment in the case of incompressible fluid flow. While the

equation was originally formulated for λ < 0, i.e. the hyperbolic case, the term Helmholtz

equation is still used for the elliptic case of λ ≥ 0 throughout this work. Equation (2.1) is

a partial differential equation (PDE) of second order and, therefore, one boundary condi-

tion per boundary suffices. Both, Neumann and Dirichlet boundary conditions can be

imposed, for instance

∀x ∈ ∂ΩD : u(x) = gD(x) (2.2)

10 2 The spectral-element method

for a Dirichlet and

∀x ∈ ∂ΩN : n · ∇u(x) = gN(x) (2.3)

for a Neumann condition. Here, n denotes the outward-pointing normal vector on the

boundary, ∂ΩD and ∂ΩN the respective boundaries and gD and gN the functions of boundary

values on them. To create an equation system, the weighted residual method introduces a

test function v leading to

∀x ∈ Ω : (vλu)(x)− (v∆u)(x) = (vf)(x) (2.4)

⇒∫

x∈Ω

(vλu)(x) dx−∫

x∈Ω

(v∆u)(x) dx =

∫x∈Ω

(vf)(x) dx . (2.5)

The above is the strong form of the Helmholtz equation. Introducing function spaces for v

and u at this point leads to a minimization problem and, e.g., collocation methods. Inte-

gration by parts, however, allows to lower the differentiability requirement for u beforehand

while raising the one for v:

⇒∫

x∈Ω

(vλu)(x) dx+

∫x∈Ω

(∇Tv∇u

)(x) dx =

∫x∈Ω

(vf)(x) dx+

∫x∈∂Ω

(vn · ∇u)(x) dx . (2.6)

When enforcing Dirichlet boundary conditions in a strong fashion, the corresponding test

function v is set to zero on the boundary. The last term, therefore, implements Neumann

boundary conditions:∫x∈Ω

(vλu)(x) dx+

∫x∈Ω

(∇Tv∇u

)(x) dx =

∫x∈Ω

(vf)(x) dx+

∫x∈∂ΩN

(vgN)(x) dx . (2.7)

Compared to (2.5) three things changed. First and foremost, the differentiability require-

ment for u is now the same as for v, leading to the term weak form of the PDE. Second, using

a Galerkin formulation, i.e. the same function spaces for u and v, yields a symmetric op-

erator on the left-hand side. And third, the right-hand side incorporates the right-hand side

of the initial equation and the boundary conditions. The terms for the Dirichlet bound-

ary conditions are not present, as the test function is by construction zero on Dirichlet

boundaries.

2.2.2 Finite element approach

The previous section derived the weak form of the Helmholtz equation. Here, a one-

dimensional domain Ω is considered in order to derive the associated element matrices. The

2.2 A spectral-element method for the Helmholtz equation 11

domain is decomposed into ne non-overlapping subdomains Ωe called elements and on every

element Ωe, a polynomial ansatz of order p approximates the solution u with uh

∀x ∈ Ωe : uh(x) =

p∑i=0

ui,eφi,e(x) , (2.8)

where φi,e are the basis functions on the element and ui,e the respective coefficients. Typically,

these functions are constructed on the standard element ΩS = [−1, 1], and then mapped

linearly to Ωe, such that

∀x ∈ Ωe : uh(x) =

p∑i=0

ui,eφi(ξ(x)) . (2.9)

With interpolation polynomials, a set of collocation points ξipi=0 defines the basis functions

and leads to the interpolation property φi(ξj) = δij, where δij denotes theKronecker delta.

Using the ansatz (2.9), the integrals from (2.7) can be evaluated on each element. For

instance, the mass term equates to∫x∈Ωe

(vhuh)(x)dx =

∫x∈Ωe

p∑i=0

vi,eφi(ξ(x))

p∑j=0

uj,eφj(ξ(x))dx

⏞ ⏟⏟ ⏞vTe Meue

, (2.10)

where Me is the element mass matrix and ve and ue the respective coefficient vectors for vh

and uh in Ωe. Similarly, the stiffness term yields with a matrix product∫x∈Ωe

(∂xvh∂xuh)(x)dx =

∫x∈Ωe

p∑i=0

vi,e∂xφi(ξ(x))

p∑j=0

uj,e∂xφj(ξ(x))dx

⏞ ⏟⏟ ⏞vTe Leue

. (2.11)

Here, Le denotes the element stiffness matrix. On the standard element the components of

these matrices compute to

Mij =

1∫−1

(φiφj)(ξ)dξ (2.12)

Lij =

1∫−1

(∂ξφi∂ξφj)(ξ)dξ , (2.13)

with the latter being evaluated using the differentiation matrix

Dij = ∂ξφj(ξi) (2.14)


⇒ L = DTMD . (2.15)

The affine linear mapping from ΩS to Ωe introduces metric factors

Me =he

2M (2.16)

De =2

he

D (2.17)

Le =2

he

L , (2.18)

and result in element-local Helmholtz operators

He = λMe + Le (2.19)

and discrete right-hand sides

Fe = Mefe , (2.20)

where the latter can additionally include the effects of Neumann boundary conditions.

Typically, more than one element is desired in the computation. In the continuous spectral-

element method, continuity of the variable uh facilitates coupling between the elements.

With an element-local storage scheme, an equation system of the form

vTLQQTHLuL = vT

LQQTFL (2.21)

results, where HL denotes the block-diagonal matrix of element-local Helmholtz operators

and uL, vL and FL the vectors of discrete solution, test function, and right-hand side,

respectively. Furthermore,QT gathers the contributions for the global degrees of freedom

and Q scatters these to the element-local ones. If the set of collocation nodes incorporates

a separated element boundary, the operation QQT simplifies to adding contributions from

adjoining elements, as shown in Figure 2.1, and can be implemented in the local system.

While requiring the storage of multiply occurring data points, the element-local storage

allows for faster operator evaluation [17].

Throughout this work, Gauß-Lobatto-Legendre polynomials, as shown in Figure 2.2,

serve as basis functions. They are interpolation polynomials defined by theGauß-Lobatto

quadrature points which include the element boundaries. In conjunction with the interpo-

lation property, the boundary is completely separated, facilitating an easier gather-scatter

operation. Moreover, the respective system matrices possess a low condition number and,

lastly, the quadrature rules inherent to the points can be employed such that

Mij ≈ δijwi , (2.22)

2.2 A spectral-element method for the Helmholtz equation 13

Ω1

Ω1

Ω2

Ω2

Ω3

Ω3

Element-wise system

Global system

Element-wise system

QT

Q

Ω1 Ω2 Ω3

Element-wise system QQT

+ +

Figure 2.1: Gather-Scatter operation for an element-wise storage scheme in one dimension whenusing Gauß-Lobatto-Legendre basis functions for polynomial degree p = 4 and 4elements. The boundary nodes of the elements are drawn larger and arrows denotecommunication between the elements. Top: First, the gather operation QT gatherscontributions from neighboring elements, then the result gets scattered to the element-local storage via Q. Bottom: Implementation in a local-element system, foregoing theglobal system.

φ0 φ1 φ2 φ3 φ4

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00

ξ

−0.25

0.00

0.25

0.50

0.75

1.00

φi(ξ)

Figure 2.2: Gauß-Lobatto-Legendre basis functions for polynomial degree p = 4 on thestandard element [−1, 1].

where wi is the weight for the point ξi. While the Gauß quadrature allows for exact inte-

gration of order 2p+ 1, only 2p− 1 is attained on Gauß-Lobatto points. This introduces

a discretization error. Its impact, however, diminishes with increasing polynomial degree.

Furthermore, the convergence properties presented in the next section still hold and the

lowered implementation and computational effort outweigh the slightly lower accuracy.

2.2.3 Convergence properties

Let V denote the function space containing the exact solution uex to (2.1), Vh ⊂ V the

function space spanned by the basis functions of the spectral-element mesh using width h


and polynomial degree p, and uh the solution on the mesh. Further, let Ih denote the

interpolant from V to Vh. Then, the error estimate for the spectral element solution is [23]

∥uex − uh∥V ≤ C minu∈Vh

∥u− uex∥V ≤ C ∥Ihuex − uex∥V , (2.23)

where C is a constant. In the above, the interpolation error generates an upper bound

for the discretization error ∥uex − uh∥V . It depends upon the polynomial degree p, element

width h, as well as the smoothness of the solution and the chosen norm. With sufficient

differentiability, the interpolation error in the maximum norm ∥·∥V,∞ approximates to

C ∥Ihuex − uex∥V,∞ ≤C

(p+ 1)!

u(p+1)ex

V,∞ hp+1Λ(p) , (2.24)

where Λ is the Lebesque constant of the polynomial system. While Λ depends on the

polynomial order, the corresponding value does not change significantly when using GLL

polynomials, e.g. only an increase by a factor of 1.5 is present when the polynomial degree

increases from p = 5 to p = 20. Inserting the above into (2.23) leads to

⇒ ∥uex − uh∥V,∞ ≤C

(p+ 1)!

u(p+1)ex

V,∞ hp+1Λ(p) . (2.25)

Equation (2.25) is the so-called spectral-convergence property: Any polynomial degree allows

for convergence. Asymptotically, however, lower polynomial degrees require more degrees of

freedom to attain the same accuracy as a high order approximation.

2.3 Tensor-product elements

2.3.1 Tensor-product matrices

Let C ∈ Rn2,n2,u,v ∈ Rn2

with n ∈ N. The evaluation of the matrix vector product

v = Cu (2.26)

requires n4 multiplications when implemented as a triple sum. If, however, the matrix C

possesses a substructure, such that for matrices A,B ∈ Rn,n

C =

⎛⎜⎜⎜⎜⎜⎝AB1,1 AB1,2 . . . AB1,n

AB2,1 AB2,2 . . . AB2,n

......

. . ....

ABn,1 ABn,2 . . . ABn,n

⎞⎟⎟⎟⎟⎟⎠ =: B⊗A , (2.27)

2.3 Tensor-product elements 15

the matrix can be decomposed

C = B⊗A = (B⊗ I) (I⊗A) = (I⊗A) (B⊗ I) . (2.28)

The above allows for

v = Cu = (B⊗ I) (I⊗A)u , (2.29)

which is the consecutive application of one-dimensional matrix products. First applying A,

then B, requires 2n3 multiplications instead of the prior n4. The decomposition C = B⊗A

denotes a so-called tensor-product matrix with the following properties:

(B⊗A)T = BT ⊗AT (2.30a)

(B⊗A)−1 = B−1 ⊗A−1 (2.30b)

(B⊗A) (D⊗C) = (BD)⊗ (AC) , (2.30c)

with further properties being presented in [89, 23]. While only square matrices were discussed

here, the extension to non-square matrices as well as multiple dimensions is straight-forward.

For the d-dimensional case, application of tensor product matrices utilize dnd+1 multiplica-

tions, whereas the direct matrix multiplication requires n2d multiplications. Hence, casting

matrix multiplications in the form of tensor products lowers the algorithmic complexity and

facilitates structure exploitation while utilization of (2.28) and (2.30) allows for factorization.

2.3.2 Tensor-product bases

Consider the hexahedral standard element ΩS in three dimensions, i.e. ΩS = [−1, 1]3. As in

the one-dimensional case, a function u can be approximated on ΩS using nDOF degrees of

freedom

uh

(ξ)=

nDOF∑m=1

umφ3Dm

(ξ)

, (2.31)

with basis functions φ3Dm : ΩS ↦→ R. In general, any kind of ansatz functions can be utilized in

multiple dimensions, generating the need to create sets of collocation points, differentiation

matrices, and integration weights associated with them. Tensor-product bases constitute a

general way to generate a particular set of these. A full polynomial ansatz serves as basis in

each direction, and a multi-index maps to a lexicographic indexing, as shown in Figure 2.3:

uh

(ξ)=

p1∑i=0

p2∑j=0

p3∑k=0

uijkφ3Dijk

(ξ)

, (2.32)


ξ1

ξ2(0, 0) (1, 0) (2, 0) (3, 0)

(0, 1) (1, 1) (2, 1) (3, 1)

(0, 2) (1, 2) (2, 2) (3, 2)

(0, 3) (1, 3) (2, 3) (3, 3)

(0, 4) (1, 4) (2, 4) (3, 4)

i

j

ξ1

ξ21 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

Figure 2.3: Left: Utilization of multi-index (i, j, k) in the plane k = 0 for GLL collocation nodes ofpolynomial degrees p1 = 3 and p2 = 4. Right: corresponding lexicographic indexing.

where p1, p2, and p3 are the polynomial degrees in the directions ξ1, ξ2, and ξ3, respectively.

Decomposing the basis functions in the respective directions leads to

φ3Dijk

(ξ)= φi(ξ1)φj(ξ2)φk(ξ3) (2.33)

⇒ uh

(ξ)=

p1∑i=0

p2∑j=0

p3∑k=0

uijkφi(ξ1)φj(ξ2)φk(ξ3) . (2.34)

While it is possible to use different polynomial degrees per direction, this work employs the

same polynomial degree in every direction, i.e. p = p1 = p2 = p3, simplifying the above to

⇒ uh

(ξ)=

p∑i=0

p∑j=0

p∑k=0

uijkφi(ξ1)φj(ξ2)φk(ξ3) . (2.35)

For a hexahedral element, the domain Ωe is [a1, b1]× [a2, b2]× [a3, b3] and linear functions

facilitate the mappings:

∀x ∈ Ωe : uh(x) =

p∑i=0

p∑j=0

p∑k=0

uijkφi(ξ1(x1))φj(ξ2(x2))φk(ξ3(x3)) . (2.36)

The above provides a regular structure consisting of the one-dimensional ansatz utilized in

all three directions. It allows to decompose many operators in the element directions and,

therefore, opens up possibilities for structure exploitation.

2.3.3 Tensor-product operators

Many operations can be decomposed into their action in different spatial directions. For

a tensor-product element, this leads to a decomposition into the actions in the respective

2.3 Tensor-product elements 17

element directions. For instance, the integration of the two functions uh and vh on the

standard element ΩS = [−1, 1]3 can be written as

∫ξ∈ΩS

(vhuh)(ξ)dξ =

1∫−1

1∫−1

1∫−1

(vhuh)(ξ)dξ1dξ2dξ3 (2.37)

with the integrand being

(vhuh)(ξ)=

∑0≤i,j,k≤p

vijkφi(ξ1)φj(ξ2)φk(ξ3)∑

0≤l,m,n≤p

ulmnφl(ξ1)φm(ξ2)φn(ξ3) . (2.38)

Equation (2.37) can be cast into

vTM3Du =

∫ξ∈ΩS

(vhuh)(ξ)dξ , (2.39)

where

M3Dijk,lmn =

1∫−1

1∫−1

1∫−1

φi(ξ1)φj(ξ2)φk(ξ3)φl(ξ1)φm(ξ2)φn(ξ3)dξ1dξ2dξ3 (2.40)

⇔M3Dijk,lmn =

1∫−1

φi(ξ1)φl(ξ1)dξ1

⏞ ⏟⏟ ⏞Mil

1∫−1

φj(ξ2)φm(ξ2)dξ2

⏞ ⏟⏟ ⏞Mjm

1∫−1

φk(ξ3)φn(ξ3)dξ3

⏞ ⏟⏟ ⏞Mkn

(2.41)

⇔M3D = M⊗M⊗M . (2.42)

The tensor-product structure of the basis infers a tensor-product structure into the element

matrices. As for the one-dimensional case, usage of the coordinate transformation induces

metric coefficients into the matrices, e.g.

M3De =

h1,eh2,eh3,e

8M⊗M⊗M , (2.43)

where hi,e denotes the element width in direction i.

In a similar fashion, the tensor-product notation of many operators can be derived. For

instance a tensor-product version of the Helmholtz operator in three dimensions reads

He = d0,eM⊗M⊗M+ d1,eM⊗M⊗ L+ d2,eM⊗ L⊗M+ d3,eL⊗M⊗M , (2.44)


where the coefficients di,e evaluate to

de =h1,eh2,eh3,e

8

(λ,

(2

h1,e

)2

,

(2

h2,e

)2

,

(2

h3,e

)2)T

. (2.45)

While the evaluation of the Helmholtz operator in general requires O(p6), (2.44) allows foran evaluation in O(p4) operations. The tensor-product structure of the operators constitutesone key element to well-performing spectral-element solvers and will be harnessed throughout

this work.

2.3.4 Fast diagonalization method

The fast diagonalization method is a standard technique for fast application of inverse op-

erators [89, 23]. The method is founded on the generalized eigenvalue decomposition of

two matrices, for instance M and L. While typically utilized to facilitate an inverse for

the Helmholtz operator, every tensor-product operator using only these two matrices is

treatable. Here, the Helmholtz operator (2.44)

He = d0,eM⊗M⊗M+ d1,eM⊗M⊗ L+ d2,eM⊗ L⊗M+ d3,eL⊗M⊗M

is considered. Due to M being symmetric positive definite, a generalized eigenvalue decom-

position of L with regard to M is possible. I.e. search for eigenvalues λ ∈ R such that

∃ v ∈ Rnp : Lv = λMv . (2.46)

As M is invertible, using the notation M1/2 with M1/2M1/2 = M, the above can be cast

into

⇒∃ v ∈ Rnp : M−1/2LM−1/2(M1/2v

)= λ

(M1/2v

)(2.47)

⇔∃ v ∈ Rnp : M−1/2LM−1/2v = λv . (2.48)

The above is an eigenvalue decomposition that additionally transforms the mass matrix to

identity. When storing the scaled eigenvectors in a matrix S and the eigenvalues in a diagonal

matrix Λ, the above can be written as

STMS = I (2.49a)

STLS = Λ . (2.49b)

2.4 Performance of basic Helmholtz solvers 19

With L being symmetric and positive semi-definite, existence of a solution is guaranteed

and the eigenvalues are non-negative. However, the resulting transformation matrix is non-

orthogonal and possesses the properties

SST = M−1 (2.50a)

S−1 = STM . (2.50b)

Using these identities, the operator (2.44) can now be written as

He =(S−TST

)⏞ ⏟⏟ ⏞I

⊗(S−TST

)⏞ ⏟⏟ ⏞I

⊗(S−TST

)⏞ ⏟⏟ ⏞I

He

(SS−1

)⏞ ⏟⏟ ⏞I

⊗(SS−1

)⏞ ⏟⏟ ⏞I

⊗(SS−1

)⏞ ⏟⏟ ⏞I

(2.51)

⇒ He =(S−T ⊗ S−T ⊗ S−T

)De

(S−1 ⊗ S−1 ⊗ S−1

)(2.52)

where

De =(ST ⊗ ST ⊗ ST

)He (S⊗ S⊗ S) (2.53)

⇒ De = d0,eI⊗ I⊗ I+ d1,eI⊗ I⊗Λ+ d2,eI⊗Λ⊗ I+ d3,eΛ⊗ I⊗ I . (2.54)

Here, De is the diagonal matrix comprising the generalized eigenvalues of the three-dimen-

sional element operator. If De is invertible, for instance due to λ being positive and there-

fore d0,e > 0 or when only the interior of the element is considered, (2.52) can be explicitly

inverted to

H−1e = (S⊗ S⊗ S)D−1

e

(ST ⊗ ST ⊗ ST

). (2.55)

While the presence of De leads to the operator not being in tensor-product form anymore,

the method allows construction of explicit inverses for operators and, furthermore, evaluation

of these in O(p4) operations.

2.4 Performance of basic Helmholtz solvers

2.4.1 Considered preconditioners and solvers

To show-case the behavior of the spectral-element method as well as validate the baseline

implementation used in this work, basic Helmholtz solvers are considered, employing the

preconditioned Conjugate Gradient (pCG) method [118]. The pCG method allows to directly

link the required number of iterations to the condition of the system matrix and therefore

enables qualitative and quantitative study of the operator and the effect of preconditioners

on it.


In the end, multigrid-based preconditioners will be utilized as they negate the increase of the

condition number with the number of elements. Similar to the pCG methods, multigrid re-

quires one residual evaluation as well as a smoothing operation on each level [12, 13]. Multiple

options exist for the smoothers. Block-inverses facilitated by the fast diagonalization method

are one choice [88], as are block-Jacobi or block-Gauss-Seidel smoothers [72]. For SEM,

these consist of tensor-product operations and their behavior for a constant number of ele-

ments can be mimicked by element-local preconditioners. Two main options exist for local

tensor-product preconditioning these: A diagonal Jacobi-type preconditioner generates an

efficient, easily implemented preconditioner, but also one which is limited in effectiveness.

Using the fast diagonalization operator from Subsection 2.3.4 allows for a more intricate pre-

conditioner. When using the generalized eigenvalue decomposition on the interior element

only, the operator maps into the eigenspace of the inner element, edges, and faces, where the

inverse eigenvalues are applied. On the vertices, the diagonal preconditioner remains. The

strategy results in a block-Jacobi preconditioner.

Three solvers result with the techniques described above: An unpreconditioned CG solver

working on the full set of data called fCG. Then, a diagonally preconditioned CG method

called dfCG. Finally, a block-preconditioner leads to the solver bfCG.

2.4.2 Setup

In a domain Ω = (0, 2π)3, the manufactured solution

uex(x) = cos(µ(x1 − 3x2 + 2x3)) sin(µ(1 + x1))

· sin(µ(1− x2)) sin(µ(2x1 + x2)) sin(µ(3x1 − 2x2 + 2x3)) ,(2.56)

is considered, which generalizes the one from [52] to three dimensions. The right-hand side

to the Helmholtz equation being evaluated analytically from

f(x) = λuex(x)−∆uex(x) . (2.57)

InhomogeneousDirichlet conditions are imposed on the boundary and the stiffness param-

eter µ is set to 5, leading to a heavily oscillating right-hand side. Furthermore, the Helm-

holtz parameter λ is chosen as λ = 0, corresponding to the Laplace equation and, hence,

a larger condition number of the system matrix and therefore a harder test case. The initial

guess is set to pseudo-random number inside the domain, and to the respective boundary

conditions on the boundary. This specific initial guess prevents overresolution of the resid-

ual, which would lead to fewer required iterations than expected in practice. After attaining

the numerical solution uh, it is interpolated to a mesh using polynomial degree 31, which is

deemed sufficient to resolve the solution, and the maximum error computed on the colloca-

tion nodes.

2.4 Performance of basic Helmholtz solvers 21

p = 4 p = 8 p = 12 p = 16

101 102

Elements per direction k

10−9

10−7

10−5

10−3

10−1

101

103

∥u−uex∥ ∞

1

5

1

917

1

102 103

Degrees of freedom per direction

10−9

10−7

10−5

10−3

10−1

101

103

∥u−uex∥ ∞

1

4

16

1

Figure 2.4: Errors when solving the Helmholtz equation with a spectral-element method andvarying the polynomial degree p. Left: Error over the number of elements per direc-tion k. Right: Error over the number of degrees of freedom nDOF per direction.

The solvers were implemented in Fortran 2008 using double precision floating point numbers

and compiled with the Intel Fortran compiler with MPI Wtime serving for time measurements.

The tests were conducted on one node of the HPC Taurus at ZIH Dresden. It consisted of two

sockets, each containing an Intel Xeon E5-2680 v3 with twelve cores running at 2.5GHz. Of

these, only one core computed during the tests, leading to the algorithms, not parallelization

efficiency, being measured.

Two test cases are considered: In the first, the domain is discretized with spectral elements

of degree p ∈ 4, 8, 12, 16, and the number of elements is scaled from ne = 43 to ne = 2563.

It allows to investigate the error of the discretization compared to the number of degrees of

freedom and, therefore, to validate the implementation. In the second case, ne = 83 spectral

elements of polynomial degrees p ∈ 2 . . . 16 discretize the domain. With the constant

number of elements, the lack of a multigrid preconditioner does not hinder the solvers,

allowing investigation of solution times as well as condition numbers. Hence, the number of

iterations required to reduce the residual by ten orders, n10, was measured. Furthermore,

runtimes tests were conducted. In order to attain reproducible runtimes, the solvers were

called 11 times, with the runtimes of the last 10 solver calls being averaged. This precludes

measurement of instantiation effects, e.g. library loading, that would only occur in the first

time step of a real-world simulation.

2.4.3 Results

Figure 2.4 depicts the maximum error of the discretization over the number of elements and

degrees of freedom. For small number of elements and p = 4, the error exhibits values two


fCG dfCG bfCG

2 4 8 16

Polynomial degree p

102

103

Number

ofiterationsn10

9

8

2 4 8 16

Polynomial degree p

10−2

10−1

100

101

102

Runtime[s]

1

4

4

1

2 4 8 16

Polynomial degree p

101

102

Runtimeper

DOF[µs]

3

4

4

3

2 4 8 16

Polynomial degree p

101

102

103

Iterationtimep.D

OF[ns]

Figure 2.5: Utilized number of iterations and required runtime per iteration and degree of freedomwhen solving the Helmholtz equation with locally preconditioned conjugate gradi-ent methods. Three solvers are compared: An unpreconditioned conjugate gradientmethod (fCG), a diagonally preconditioned one (dfCG), and a block-preconditionedone (bfCG). Left: Number of iterations to reduce the residual by ten orders n10.Right: Runtime per iteration and number of degrees of freedom (DOF).

orders of magnitude higher than the maximum value of the solution, which is probably due

to underresolution of the right-hand side leading to aliasing errors. With higher polynomial

degrees and, hence, more degrees of freedom to resolve the boundary and initial conditions,

the error decreases. Increasing the number of elements decreases the error as well until 10−9,

where machine precision is reached. For each tested polynomial degree, the error scales

with hp+1 after entering the asymptotic regime, replicating the spectral convergence prop-

erty (2.25) and therefore validating the implementation. When plotting over the number of

degrees of freedom per direction, the difference in accuracy gets less pronounced. For fewer

than 100 degrees of freedom, no significant difference in the errors is present. When increas-

ing the number of degrees of freedom per direction beyond 100, the asymptotic behavior

becomes dominant, with a slope of p.

2.5 Summary 23

Figure 2.5 compares the required number of iterations for the three solvers as well as their

runtime per iteration and degree of freedom for a constant number of elements of ne = 83

and a variable polynomial degree. As to be expected, the unpreconditioned solver requires

the largest number of iterations to reduce the residual by ten orders. Diagonal precondi-

tioning only slightly decreases the number, albeit with the same nearly linear increase in the

number of iterations. As the elements only possess one interior point at p = 2, diagonal and

block preconditioning amount to the same operation and, hence, generate the same number

of iterations. For higher polynomial degrees, block-wise preconditioning generates a lower

slope and only half the number of iterations are necessary at p = 16. These, however, do

not translate to a lower runtime: the block-preconditioned solver has the largest runtime

until p = 15 and generates only slight savings thereafter.

For all three solvers, the runtime per degree of freedom exhibits a linear increase with the

polynomial degree. In combination with the increasing number of iterations, this amounts

to a near constant runtime per iteration and degree of freedom, which is in stark contrast

to the expectation. While the number of multiplications for Helmholtz operator and the

preconditioner scale with O(p4), the asymptotic regime for these is not reached, indicating

that the implementations are not compute bound. Hence, optimization potential exists even

for these baseline solvers, stemming from operators which do not harness the full potential of

the hardware. This is a known issue in the spectral-element community: Often matrix-matrix

multiplication ends up faster than operators exploiting the tensor-product structure due to

the former being optimized for the hardware [17]. To attain a baseline variant to compare

more efficient algorithms with, the current implementations need to be streamlined.

2.5 Summary

This section briefly recapitulated the spectral-element method. Compared to low-order meth-

ods, the polynomial ansatz of order p allows for the error scaling with with hp and therefore

arbitrary convergence orders. However, the higher error reduction comes at a cost. With

tensor-product bases, the operators boil down to the consecutive application of 1D matrices

in the respective dimensions of the elements, lowering the multiplication count from O(p2d)

to O(dpd+1

). While the above constitutes a significant improvement over matrix-matrix

implementations, the operators still scale super-linearly with the number of degrees of free-

dom when increasing the polynomial degree. However, the simple implementation of such

operators does not follow this prediction due to inefficiencies.

25

Chapter 3

Performance optimization for tensor-

product operators

3.1 Introduction

Explicit time steping schemes and iterative solver consist of the recurring application of

operators, for instance gradient, divergence, interpolation, and Laplacian operator, which

occupy a large portion of the runtime. With spectral elements and tensor-product bases, the

operation count of these scale with O(pd+1ne

), where d is the number of dimensions. The

operations decompose into one-dimensional matrix products. While these bear similarity

to batched DGEMM [84, 91], they work on non-contiguous dimensions, barring the direct

usage of matrix multiplication implementations. Despite the operator complexity scaling

withO(p4) in 3D, a direct matrix-matrix implementation is often more efficient for reasonable

polynomial degrees, as showcased for two dimensions in [17].

This chapter investigates the performance of tensor-product operations in three dimensions,

providing a structured approach toward their optimization. Due to their high relevance

for applications in numerous scientific domains, the interpolation operator, Helmholtz

operator, and fast diagonalization operator serve as prototypical examples. For these, ef-

ficient implementations are proposed and possible performance gains investigated from the

single operator itself, to a full Helmholtz solver. To this end, Section 3.2 investigates

the performance of a loop-based implementation of the interpolation operator compared to

a library-based implementation of the matrix-matrix multiplication. In Section 3.3 opti-

mization strategies are discussed and their impact on the operator runtime evaluated. The

approach is afterwards extended to the Helmholtz and the fast diagonalization operators

in Section 3.4. These are afterwards utilized in the Helmholtz solvers from Chapter 2

to showcase the attainable performance gains for the basic building blocks of time-stepping

26 3 Performance optimization for tensor-product operators

schemes in CFD. This chapter summarizes the work presented in [64], where the approach

was then used on a fully-fledged PDE solver for combustion problems.

3.2 Basic approach for interpolation operator

3.2.1 Baseline operators

Continuous and Discontinuous spectral-element formulations typically employ tensor-product

elements with np points per direction. The approach allows to decompose operations into

smaller ones working separately on the directions and, hence, opens up possibilities of struc-

ture exploitation. On these elements, interpolation can be required, e.g. for visualization.

In the one-dimensional case, applying a matrix mapping from the np basis functions to n⋆p

new ones, A ∈ Rn⋆p,np , to the coefficient vector implements the operation. For the multi-

dimensional case, the matrices can differ, e.g.A is employed in the first, B in the second,

and C in the third dimension. The most prominent use case is np = n⋆p. The number of

unknowns are increased or decreased, implementing the prolongation or restriction operation

for multigrid. The case np = n⋆p is relevant as well and can, for instance, be employed for

creating a polynomial cutoff filter to stabilize flow simulations [33]. This section addresses

the latter case, assuming a constant polynomial degree p = np − 1 in every element and

the simpler choice A = B = C. The operator is applied to a vector u ∈ Rn3p,ne in all three

dimensions, computing the result vector v ∈ Rn3p,ne

∀ Ωe : ve ← A⊗A⊗A ue , (3.1)

which can be written as

v ← A⊗A⊗A u (3.2)

when interpreting v and u as matrices. While small, the operator incorporates matrix

products in every dimension and, hence, contains all features and components present in

larger operators.

As baseline implementations, two variants are considered. The first one assembles the oper-

ator to a matrix A⊗A⊗A =: A3D ∈ Rn3p,n

3p and applies it in every element. This allows

leveraging highly-optimized matrix multiplications from libraries, such as BLAS and the

resulting algorithm can be written in one line, i.e.v ← A3Du. It couples n3p values with n3

p,

requiring a total of n6pne multiplications. As it employs matrix-matrix multiplication imple-

mented via GEMM from BLAS, it is called MMG. The second variant implements the tensor-

product decomposition (3.1), consecutively applying three one-dimensional matrix products

in separate dimensions via Algorithm 3.1. With strided access in two dimensions, BLAS

3.2 Basic approach for interpolation operator 27

Algorithm 3.1: Loop-based implementation of the sum-factorization of the interpolation operatorusing temporary storage arrays.

1: for Ωe : do2: for 1 ≤ i, j, k ≤ np do3: uijk ←

∑np

l=1 Ailuljk,e ▷ u = I⊗ I⊗Aue

4: end for5: for 1 ≤ i, j, k ≤ np do6: uijk ←

∑np

l=1 Ajluilk ▷ u = I⊗A⊗Aue

7: end for8: for 1 ≤ i, j, k ≤ np do9: vijk,e ←

∑np

l=1 Akluijl,e ▷ ve = A⊗A⊗Aue

10: end for11: end for

is not directly applicable and loops implement the variant. The result is termed “tensor

product loop” (TPL). Due to the three one-dimensional matrix multiplication, 3n4pne mul-

tiplications occured. Hence, it requires less multiplications than MMG starting at np = 2.

3.2.2 Runtime tests

The two variants MMG and TPL were implemented in Fortran 2008 using double preci-

sion, with the Intel Fortran compiler v. 2018 serving as compiler and the corresponding

Intel Math Kernel Library (MKL) as BLAS implementation. For time measurement pur-

poses MPI Wtime was employed and the optimization level was set to O3.

To compare the performance of the operators, runtime tests were conducted on the high

performance computer Taurus at ZIH Dresden. One node containing two Intel Xeon E5-2680

running at a clock speed of 2.5GHz served as measuring platform. As this chapter aims to

improve the single-core performance, only one core was utilized, lest effects of parallelization,

not the performance, are measured. When using AVX2 vector instructions and counting

the fused multiply add instruction (FMA) as two performed floating point operations, the

maximum available performance of one core computes to 40GFLOP s−1. Furthermore, 64 kB

of L1 cache, 256 kB of L2 cache and 30MB of L3 cache are available [48].

The operator size is varied ranging from np = 2 to np = 17, lying in the range of polynomial

degrees currently employed in simulations, and the number of elements was changed from

ne = 83 to ne = 163 = 4096. For these parameters, the operators were run 101 times, with

the runtime of the last 100 times being averaged. This prevents measurement of one-time

effects, e.g. initialization of libraries such as MKL.

Figure 3.1 depicts the computational throughput of the operators, corresponding to the

number of computed entries in the result vector per second, called (Mega) Lattice Updates


MMG TPL

100 101 102 103

Number of elements ne

0

200

400

600

800

1000

1200

1400

Throughput[M

LUP/s]

m = 2

100 101 102 103


0

25

50

Throughput[M

LUP/s]

m = 7

Figure 3.1: Performance of the two implementation variants for the interpolation operator whenvarying the number of elements ne for two 1D operator sizes np. Left: np = 2,right:np = 7. In both cases, the performance is measured in mega lattice updates persecond (MLUP/s).

Per Second (MLUP/s). At np = 2, a large overhead is present for both variants, leading to a

low perfomrance at low number of elements, with MMG achieving 100MLUP/s. Afterwards

the performance increases, saturating at 500 elements with 1300MLUP/s and experiencing

a slight drop at 3000 elements. However, TPL does not show such behaviour and stays

slower. The matrix-matrix based implementation is a factor of up to 30 faster than TPL,

resulting from the operator size: At np = 2, the matrix A3D has size 8× 8, which is near

optimal for the micro architecture: The instruction set AVX2 utilizes 256 bit = 32B wide

vector registers, allowing for simultaneous computation on four double precision values [48].

With these, implementation of an 8 × 8 matrix product maps well to the architecture and

incurs nearly no overhead from loops. After the peaking with 1300MLUP/s, a decline in

performance is present for MMG, corresponding to the change from loading data from L2

cache to loading data from the slower L3. In [48] a read bandwidth of 32GB s−1 was measured

for the L3 cache. When assuming that two doubles are loaded and one is stored and, hence,

computation of one point requires 24B, the roofline model predicts 1300MLUP/s [134].

Hence, the variant MMG attains 4/5 of the maximum possible performance for large numbers

of elements. The TPL variant, however, is not, only producing 50MLUP/s for both cases.

In the regime 2 < np < 7, using the assembled element matrix is faster than employing

the loop-based version. For np ≥ 7 TPL becomes faster, which is in agreement with other

studies: Typically, tensor-product variants are efficient only for high polynomial degrees and

matrix-matrix based implementations are utilized for the lower ones [17].

The measurements for ne = 4096 are shown in Figure 3.2, using throughput as well as the

rate of floating point operations measured in GFLOP/s. At np = 6, MMG attains a rate

3.3 Compiling information for the compiler 29

MMG TPL

2 4 8 12 16np

0

5

10

15

20

25

30

35

40

Perform

ance

[GFLOP/s]

2 4 8 12 16np

0

50

100

150

200

250

300

350

400

Throughput[M

LUP/s]

Figure 3.2: Performance of the two operators when using ne = 4096 elements and increasing the1D operator size np. Top: Performance measured in gigaflops GFLOP/s with thethreshold of 40GFLOP/s being the maximum rate of floating point operations theCPU core can execute, bottom: Performance in mega lattice updates per second(MLUP/s) with the threshold line corresponding to 50MLUP/s.

of 35GFLOP/s and from then on extracts 90 % of the maximum possible number of floating

point instructions per second. Still, a decreasing number of mesh updates results, as number

of operations required per mesh point increases with n3p. The variant TPL, on the other

hand, utilizes less than an eights of the available performance. However, it attains an

approximately constant throughput of 50MLUP/s, with np = 4, 8, 12 outperforming due to

the operator width being a multiple of the instruction set width. With the constant number

of updates and low floating point operation usage, the variant is far from compute-bound.

This is due to the compiler. As can be seen from the assembler code, many loop-control

constructs are present and the reduction loops are vectorized. The compiler optimizes the

code for very large loops and gets misled by the large amount of small, deeply-nested ones.

The result is a code that does not perform for any relevant polynomial degree and this

chapter illustrates ways to achieve efficient ones.

3.3 Compiling information for the compiler

3.3.1 Enhancing the interpolation operator

The last section showed what is common knowledge in the SEM community: For low to

medium polynomial degrees loop-based implementations are often slower than versions uti-

lizing optimized libraries [17, 98], with a factor of up to 30. The highly-optimized libraries

make complexity-wise inferior algorithms excel by exploiting the full potential of the hard-

ware. To be able to transplant the behaviour of these libraries to tensor-product operators,


one has to understand why they shine. Most BLAS implementations received intensive

manual optimization, as documented for GotoBLAS [42]. These optimizations include using

input-size dependent code, cache-blocking for large operators, and manual unrolling of loops

for known loop bounds. In conjunction with architecture-dependent parameters, further

optimization can be achieved [49]. In the following, these techniques are explored, once by

manual optimization, once by exploiting the libraries.

Compiling the current implementation of the interpolation operator generates a binary in-

corporating a large number of loop control constructs, diminishing the performance. These

stem from the limited amount of information presented to the compiler, confining it to opti-

mizing for very large numbers of loop iterations. As a first measure, denoting the data size

directly lowers the required number of loop control operations and, futhermore, enables the

compiler to determine which kind of optimization is benefitial [23]. As a first step, Algo-

rithm 3.1 was implemented with compile-time known loop bounds and data sizes, i.e. once

for np = 1, 2, . . ., with the relevant variant being selected at runtime. While leading to a bet-

ter performance, the implementation either requires code replication or meta-programming,

both lowering the maintainability and increasing binary size. As a parametrization of the

operator is performed, the resulting variant of TPL is called TPL-Par. To further enable

compiler optimizations, all one-dimensional matrix products were extracted into separate

compilation units. Due to the lowered scope, these are easier to analyze, and as they get in-

lined afterwards, no function call overhead occurs during runtime. For more cache-blocking,

the matrix products over second and third dimension in Algorithm 3.1 were fused into one

loop over the last dimension and, hence, one compilation unit. As the code transformation is

the opposite of an inlining, it will be termed outlining here and the variant called TPL-ParO.

Over the last decades, the performance of compiler-generated code has significantly increased.

However, the optimizations applied by the compilers are not necessarily those that attain the

best result. In order to evaluate the effectiveness of compiler optimizations as well as to gauge

possible performance gains, manual refinement was performed on the the outlined variant,

with the product being TPL-Hand. For instance, the operands in the matrix products occur

multiple times, but are computed every single time. Reusing these leads to fewer required

loads and stores and, in turn, a better performance. Furthermore, unroll and jam [49] of

width 2 and 4, with 4 being used for np = 4, 8, 16 and 2 for the remaining even polynomial

degrees. For np = 12 the unroll width was set to two to ensure that at least one data point

remains in each cacheline of when using the stride of 96B. Furthermore, the compiler tried

vectorizing the reduction loops, which proved detrimental to the overall performance. To

circumvent this behaviour, the innermost non-reduction loop was vectorized by hand. Lastly,

for np = 16, blocking of the matrix access led to further performance gains. The mentioned

optimizations are directly applicable for even operator sizes, the implementations for uneven

ones in addition treated the remainder of the unroll and jam operation, leading to a lower

performance for these.


Algorithm 3.2: (“TPG”) Variant of the interpolation operator optimized for contiguous dataaccess patterns in the matrix multiplications. The matrices u and v are inter-preted as Rnp,n2

pne matrices. Cyclic permutation allows evaluating the operatorwith matrix-matrix products. One temporary storage, w ∈ Rnp,n2

pne , is required.

1: w ← P(u) ▷ w = P(u)2: v ← Aw ▷ v = P(A⊗ I⊗ I u)3: w ← P(v) ▷ w = P2(A⊗ I⊗ I u)4: v ← Aw ▷ v = P2(A⊗A⊗ I u)5: w ← P(v) ▷ w = A⊗A⊗ I u6: v ← Aw ▷ v = A⊗A⊗A u

Algorithm 3.3: (“TPG-Bl”) Variant of Algorithm 3.2. The operator is used on subsets I that fitin the L2- or L3-cache.

1: for I do ▷ subset I fits into L2- or L3-cache2: w ← P(uI) ▷ w = P(uI)3: vI ← Aw ▷ vI = P(A⊗ I⊗ I uI)4: w ← P(vI) ▷ w = P2(A⊗ I⊗ I uI)5: vI ← Aw ▷ vI = P2(A⊗A⊗ I uI)6: w ← P(vI) ▷ w = A⊗A⊗ I uI7: vI ← Aw ▷ vI = A⊗A⊗A uI8: end for

The above approach mimics the pain-staking optimizations performed to gain well-performing

code for matrix multiplications. An orthogonal approach lies in casting the tensor-product

operations in a way that already optimized libraries are applicable. One matrix product per

dimension is applied in (3.1) to the input vector u. While in Fortran, the first dimension

lies contiguously in memory, the others do not. Working on these results in strided memory

accessses, barring the usage of DGEMM. However, cyclic permutation poses a remedy and

makes DGEMM applicable again [16]. Here, the permutation operator P() exchanges moves

the last direction towards the front:

v = P(u)⇔ ∀ 1 ≤ i, j, k ≤ np : vkij,e = uijk,e (3.3)

v = P2(u)⇔ ∀ 1 ≤ i, j, k ≤ np : vjki,e = uijk,e (3.4)

v = P3(u)⇔ ∀ 1 ≤ i, j, k ≤ np : vijk,e = uijk,e . (3.5)

Using P(), the interpolation operator transforms into a cascade of permutations, each fol-

lowed by a call to DGEMM. This results in a variant utilizing GEMM for tensor-products,

which is called TPG, and shown in Algorithm 3.2.

The loop-based variants work on one element after the other, which leads to inherent cache

blocking. The variant TPG, however, operates multiple times on the whole data set. If it

does not fit into the faster caches, it will be loaded multiple times from L3 cache, or even


MMG

TPL

TPL-Par

TPG-Bl

TPG

TPL-Hand

TPL-ParO

100 101 102 103


0

200

400

600

800

1000

1200

1400

Throughput[M

LUP/s]

m = 2

100 101 102 103


0

100

200

300

400

500

600

700

800

900

Throughput[M

LUP/s]

m = 4

100 101 102 103


0

50

100

150

200

250

300

350

400

450

Through

put[M

LUP/s]

m = 8

100 101 102 103


0

50

100

150

200

250

Through

put[M

LUP/s]

m = 12

Figure 3.3: Performance of the interpolation operator with the new variants, depending on the1D operator size np and the number of elements ne.

from RAM. To exploit the higher bandwidth of the caches, the data set is split into subsets

of elements that fit into either L2, or if not applicable, at least into the portion of L3 cache

of the core, resulting in Algorithm 3.3. As a level of cache-blocking was added, the variant

is called TPG-Bl.

3.3.2 Results

The runtime tests from Section 3.2.2 were repeated including the new variants, resulting in

the data shown in Figure 3.3. While the matrix-matrix based variant remains the fastest

for np = 2, the parametrized variant at least attains a third of the performance. Using


MMG

TPL

TPL-Par

TPG-Bl

TPL-Hand TPL-ParO

2 4 8 12 16np

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Perform

ance

[GFLOP/s]

2 4 8 12 16np

0

150

300

450

600

750

900

Throughput[M

LUP/s]

Figure 3.4: Performance of the variants of the interpolation operators with ne = 4096 ele-ments with increasing 1D operator size np. Left: performance measured in gi-gaflops (GFLOP/s), right: performance measured in mega lattice updates per sec-ond (MLUP/s).

smaller compilation units only leads to a slight increase. However, further enforcing vec-

torization allows TPL-Hand to reach half the performance of MMG. The tensor-product

variants employing GEMM use more loads and stores and, hence, do not reach the same

levels, but are faster than TPL, with gains up to a factor of ten. At np = 4, cache blocking

retains the performance TPG reaches at low numbers of elements for TPG-Bl. Both are

faster than MMG and TPL-Par, with only the outlined variant outperforming them slightly.

Further adding hand optimization allows for a factor of three in performance gains compared

to TPL-ParO, and nine to TPL. For np = 8, the operator width fits the architecture very

well, allowing variants using GEMM to attain twice the performance of the parametrized

variant. Cache-blocking is required to retain a significant performance for even small number

of elements. While for np = 12 the benefit of hand-optimization diminishes to one quarter,

it is required to gain good performance at np = 16.

Figure 3.4 depicts throughput as well as floating point performance with the new operator

variants. As beforehand, the matrix-matrix variant is fastest for np = 2. However just

for that one operator size. From np = 3 on, the hand-optimization variant extracts the

most performance, with distinct peaks present at np = 4, 8, 12, 16. There, near half of the

maximum performance is reached as no treatment for the remainder from unrolling and for

shorter SIMD operations occurs. Using only parametrization and outlining does not attain

the same performance: Only half the performance of TPL-Hand results, a flop rate between

5 and 10GFLOP/s, with a peak at np = 12. Even with smaller compilation units and known

data and loop sizes, the compiler does not generate a well-performing binary, causing the

necessity of manual optimization. Compared to the loop-based variants, TPG-Bl quickly


MMG TPL TPL-Hand

1

4

1

2

1 2 4 8 16 32

Computational Intensity [FLOP/B]

1.25

2.5

5

10

20

40

Perform

ance

[GFLOP/s] Peak vec+FMA

vec no FMA

FMA no vec

no vec no FMA

L3bandwidth

Figure 3.5: Roofline model for the interpolation operator. with “vec” indicating vectorizationand “FMA” the fused multiply add capability. The horizontal lines indicate the peakperformance with the respective capabilities.

reaches a stable 10GFLOP/s and becomes faster than non-optimized loop variants with a

far lower optimization effort.

Figure 3.5 shows a roofline analysis of the operators when using ne = 4096 elements, i.e. a

diagram comparing the possible performance against the attained one [134]. The rate of

floating point operations is plotted over the computational intensity, the amount of work

performed per loaded data. Two limits are present for the floating point rate: The maximum

number of operations the core can perform, demarked by the top line, and the bandwidth

for loading and storing data, resulting in the sloped line on the left. For the analysis,

loading and storing from L3 cache is assumed, as well as that the result is only written once.

While this will over-estimate the algorithmic complexity of the implementations, it shows the

optimization potential. The variant MMG is only memory-bound for np = 2 and then quickly

reaches peak performance. All features of the CPU are utilized, including vectorization

and FMA. The hand-optimized variant is memory-bound for np ≤ 4 and attains half of

the possible performance there. After np = 4, the operation is compute-bound, and half

of the operations result in computation of the result. While this is small compared to the

performance of the matrix product, operations are required to load the data for matrices

and operands into the registers, limiting the potential of further optimization beyond 50 %

of peak performance.

3.4 Extension to Helmholtz solver 35

3.4 Extension to Helmholtz solver

3.4.1 Required operators

The last section investigated the attainable performance for tensor-product operations on the

interpolation operator, as it constitutes the basic building block for larger operators. This

section expands from this building block to the main component for incompressible fluid

flow solvers. There, the solution of the Helmholtz equations consumes a significant part

of the runtime. Hence, the main ingredients for such solvers are considered here: Operator

and preconditioner.

As shown in Subsection 2.3.3, the element Helmholtz operator for hexahedral elements is

He = d0M⊗M⊗M+ d1M⊗M⊗ L+ d2M⊗ L⊗M+ d3L⊗M⊗M . (3.6)

Extracting the diagonal mass matrix M allows to write the application on a vector u as

ve = d0ue + d1(I⊗ I⊗ L)ue + d2(I⊗ L⊗ I)ue + d3(L⊗ I⊗ I)ue (3.7)

where

ue = (M⊗M⊗M)ue (3.8)

L := LM−1 . (3.9)

With the diagonal mass matrix, application of the above formulation requires 6n4pne + 5n3

pne

operations and a loop-based implementation is shown in Algorithm 3.4.

With the operator present, a precontitioner is missing. For sake of simplicity, the local

preconditioning strategies from Chapter 2 are considered, with one being diagonal and the

other one relying on the fast diagonalization. In three dimensions, the application of the

associated operator (2.55) reads

ve = (S⊗ S⊗ S)De

(ST ⊗ ST ⊗ ST

)ue (3.10)

and consists of the consecutive application of interpolation – or more precisely a transfor-

mation to the generalized eigenspace of the element – application of the eigenvalues, and

backward transformation. Hence, the algorithms for the interpolation operator are expanded

by application of diagonal matrix and backward interpolation, leading to 12n4pne + n3

pne op-

erations per application with tensor-products and 4n6pne + n3

pne for a matrix-matrix variant.


Algorithm 3.4: Loop-based variant of the Helmholtz operator (3.7) using one temporary storage,

u ∈ Rn3p .

1: for Ωe : do2: for 1 ≤ i, j, k ≤ np do3: uijk ←MiiMjjMkkuijk,e

4: vijk,e ← d0uijk

5: end for6: for 1 ≤ i, j, k ≤ np do

7: vijk,e ← vijk,e + d1∑np

l=1 Liluljk



l=1 Ljluilk



l=1 Lkluijl

14: end for15: end for

3.4.2 Operator runtimes

The tests performed on the interpolation operator were repeated on the Helmholtz and fast

diagonalization operator with Figure 3.6 depicting the achieved performance for the Helm-

holtz operator. As for the interpolation operator, the matrix-matrix variant is faster

for np = 2. However, starting from np = 4, the optimized loop variants outperforms it.

From these, the parametrized version TPL-Par extracts up to a quarter of the possible per-

formance, with outlining reliably increasing the performance. Further hand-tuning increases

the performance to 17.5GFLOP/s, where the deterioration compared to the interpolation

stemming from using one diagonal operation first. While slow at first, TPG-Bl can match

hand-optimized routines for odd polynomial degrees, and harnesses a quarter of the floating

point performance. However, the overall performance of it is lower than for the interpolation

operator. As the Helmholtz operator is additive, the result vector needs to be permuted in

addition to the input vector, requiring further temporary storage and bandwidth. The added

storage only allows for cache-blocking to L2 cache until np = 8, and a performance degra-

dation is visible in the plot. However, the variant still generates a better performance than

the parametrized version. When comparing the maximally performance of the operators to

the one attained for interpolation, the peak is lower. This is due to the added application

of the mass matrix, which is just one multiplication per point and, hence, memory-bound.

Figure 3.7 depicts the attained performance for the fast diagonalization operator. In ac-

cordance with the two other cases, parametrization generates a speedup of two to three

over TPL. Further hand-optimiziation increases the speedup to seven for operator sizes that

are a multiple of four and allows to harvest half of the peak performance. The more gen-


MMG

TPL

TPL-Par

TPG-Bl

TPL-Hand TPL-ParO

2 4 8 12 16np

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Perform

ance

[GFLOP/s]

2 4 8 12 16np

0

80

160

240

320

400

480

560

640

Throughput[M

LUP/s]

Figure 3.6: Performance of the implementation variants for the Helmholtz operator (3.7) whenusing ne = 4096 elements.

MMG TPL TPL-Par TPG-Bl TPL-Hand

2 4 8 12 16np

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Perform

ance

[GFLOP/s]

2 4 8 12 16np

0

60

120

180

240

300

360

420

480

Through

put[M

LUP/s]

Figure 3.7: Performance of the implementation variants for the fast diagonalization opera-tor (2.52) when using ne = 4096 elements.

eral variant TPG-Bl, however, offers a fourth of the possible performance without manual

optimiziation.

The roofline analysis for Helmholtz and fast diagonalization operator is shown in Fig-

ure 3.8. For all relevant polynomial degrees, the operations are not memory-bound. But

only the matrix-matrix version attains peak performance, whereas the hand-optimized ver-

sion only extracts 50 %. As evident from the generated assembly code, the latter solely

utilizes vector instructions and the fused-multiply instruction. However, not all of these are

for computation: Loading the operands and matrices into the respective registers also occu-

pies instructions that are executed and, hence, runtime. With blocking to increase register


MMG TPL TPL-Hand

1

4

1

2

1 2 4 8 16 32


1.25

2.5

5

10

20

40

Perform

ance


vec no FMA

FMA no vec

no vec no FMA

L3bandwidth

1

4

1

2

1 2 4 8 16 32


1.25

2.5

5

10

20

40

Perform

ance


vec no FMA

FMA no vec

no vec no FMA

L3bandwidth

Figure 3.8: Roofline models with “vec” indicating vectorization and “FMA” the fused multiplyadd capability. Left: Helmholtz operator, right: fast diagonalization operator.

Table 3.1: Number of iterations to reduce the initial residual by ten orders, and runtime per degreeof freedom and iteration, and achieved speedup when using optimized operators forthe solvers dfCG and bfCG.

Time per iteration and DOF [ns]

Iterations TPL TPL-Hand Speedup

p dfCG bfCG dfCG bfCG dfCG bfCG dfCG bfCG

3 117 92 58.8 117.8 25.2 28.8 2.3 4.1

7 297 178 41.0 80.9 15.9 22.2 2.6 3.6

11 495 257 49.8 94.9 22.5 31.6 2.2 3.0

15 688 337 49.6 94.9 25.6 36.3 1.9 2.6

reusage, further performance gains are to be expected. These will, however, be limited to a

factor of two at most, not a factor of ten presented here.

3.4.3 Performance gains for solvers

To evaluate performance gains for solvers, the tests from Section 2.4 were repeated, with

the solvers dfCG and bfCG being implemented twice, once with the operator variant TPL

and once with TPL-Hand, leading to four solvers to compare. Here, ne = 83 = 512 spectral

elements discretized the domain Ω = (0, 2π)3 with the polynomial degree being varied be-

tween p = 2 and p = 16. The solvers were run 11 times, reducing the initial residual by ten

orders, with the last ten times generating an average runtime and the required number of

iterations being measured.


dfCG bfCG TPL TPL-Hand

2 4 8 16

Polynomial degree p

101

102

103

Number

ofiterationsn10

2 4 8 16

Polynomial degree p

100

101

102

Runtimeper

DOF[µs]

Figure 3.9: Solver runtimes for diagonal and block-Jacobi preconditioning when utilizing theoperator variant TPL and TPL-Hand. Left: Required number of iterations to reducethe residual by ten orders of magnitude, right: runtime per degree of freedom.

Figure 3.9 depicts the number of iterations to reduce the residual by ten orders, n10, as

well as the runtime per degree of freedom for the two solvers and operator implementation

variants. For a polynomial degree of p = 2, the interior element has no separate degree of

freedom. Hence, diagonal and block preconditioning result in the same operation and, in

turn, the same number of iterations. For low polynomial degrees, the difference in iterations

is small, but continuously grows to a factor of two at p = 16. However, for TPL, this lower

number of iterations does not lead to a lower runtime. At p = 2, the runtime of bfCG is twice

that of dfCG, with the gap slowly shrinking and equal runtimes being present at p = 11.

Afterwards the difference in runtime is minute. Introducing the optimized operators reduces

the runtime by a factor ranging between two and four, where four occurs for easily opti-

mizable and two for other polynomial degrees. Furthermore, block-preconditioning leads to

similar runtimes as diagonal preconditioning starting at p = 2, and retains a lower runtime

for p > 2. The optimization results in higher operator performance for operator sizes that

are a multiple of four, which directly translates to the runtimes of the solvers, for instance

solving for p = 7 leads to a runtime per degree of freedom lower than for p ∈ 4 . . . 6 by a

factor of two. This performance benefit makes the polynomial degrees 3, 7, 11, 15 preferable

in simulations: The throughput increases at the same number of degrees of freedom with no

incured penalty.

Table 3.1 lists the test results for operator sizes that are a multiple of four. For both solvers,

the runtime per iteration depends on the preconditioner choice. For TPL using the block

preconditioner increases the runtime two-fold. Furthermore, the runtime per iteration stag-

nates, with the same time being used at p = 11 and p = 15. As mentioned beforehand, this

constitutes evidence that the implementation is memory bound. For diagonal precondition-

ing, TPL-Hand lowers the runtime by a factor of two, with a larger speedup present at low


polynomial degrees. The decline in speedup stems from the time per iteration increasing,

which is due to the constant performance of the operator. Lastly, the speedup when using

block-preconditioning ranges from 2.5 to 4 allowing to significantly lower the runtime of

simulations.

3.5 Conclusions

This chapter evaluated the performance of tensor-product spectral-element operators, using

the interpolation, Helmholtz and fast diagonalization operators as examples. For simple

loop-based variants, the findings from [17] hold: A matrix-matrix multiplication based on the

assembled three-dimensional element operator can outperform sum factorization for small

polynomial degrees, with the factorization becoming viable after p = 6. However, a roofline

analysis showed that the operator does not harness the full potential of the hardware and

can be improved significantly.

Optimizations were performed for the loop-based variants. These included using constant

bounds, unroll and jam, blocking, and vectorization and result in improved variants achieving

approximately 50 % of the peak performance. While the assembled matrix-matrix multi-

plication remains faster for a polynomial degree of p = 2, this case is even more efficiently

implemented with sparse matrix-vector multiplications using the global degrees of freedom

instead of the element-local ones. The optimization approach was thereafter applied to fast

diagonalization and Helmholtz operators, with 40 and 50 % of peak performance being

attained, respectively.

After applying the optimization techniques to the main components of a Helmholtz solver,

the performance gains resulting for the solver itself was investigated. The augmented opera-

tors reduced the overall computation time, by a factor of 2 for diagonal preconditioning and

by a factor of up to 4 for tensor-product preconditioning. As for incompressible fluid flow a

majority of the runtime is spent in Helmholtz solvers, using these optimization techniques

leads to a significant reduction in the turnover time of computations. Here only the case

of a Helmholtz solver was investigated, a study of the impact of the techniques onto a

fully-fledged PDE solver can be found in [64].

While the performance gains lead to a significant reduction in the computation time, they

come at a cost. Using one variant per polynomial degree leads to code replication, whereas

the optimizations reduce the readability and the combination of both the maintainability.

To regain readability and maintainability while retaining the speed, automated knowledge-

based systems are required, which generate the operators. Hence, further work will focus

on automating the generation of tensor-product operators using a domain-specific language.

First results are available in [127, 112].

41

Chapter 4

Fast Static Condensation – Achieving

a linear operation count

4.1 Introduction

The last chapter investigated the performance attainable with Helmholtz solvers based

on tensor-product operations. Significant performance gains over the baseline version were

attained. However, the number of operations increased with O(p · nDOF), rendering solution

expensive and even infeasible for large polynomial degrees. A solver scaling with O(nDOF)

could easily reduce the runtime by one order of magnitude. When taking the memory gap

into account, the number of loads and stores are required to reduce as well as otherwise the

algorithm would be memory bound.

For the spectral-element method, static condensation allows to eschew the interior degrees

of freedom and provides a standard method to decrease both, the number of unknowns as

well as the condition of the system matrix. Static condensation is widely employed, for

instance the first work on SEM capitalizes on it [105], as do more recent ones [21, 138].

In the references above, the static condensation significantly increased the performance.

However, the number of operations still scales with O(p4), as the main operator is not

matrix free. To remain efficient at high polynomial degrees, linear complexity is required

throughout the entire solver, from operator execution to preconditioner to the remaining

operations inside an iteration. Moreover, the resulting element matrix differs between the

elements, necessitatingO(p4ne) loads and stores per operator evaluation. With the increasing

gap between memory bandwidth and compute performance, only a matrix-free evaluation

technique secures the future performance for a method.

This chapter develops a linearly scaling, matrix-free variant of the static condensed Helm-

holtz operator, lowering the runtime of Helmholtz solvers tenfold and, furthermore, also

addresses the growing memory gap. The work is inspired by the three-dimensional version

42 4 Fast Static Condensation – Achieving a linear operation count

of the matrix-matrix-based operator implementation from [52], which was later on published

in [51]. The first factorization thereof was published in [57], with solvers being presented

in [60]. While these variants resulted in linear execution times of the iterations, they out-

performed unfactorized versions implemented via dense matrix-matrix multiplications only

for polynomial degrees p > 10. Current simulations, however, tend to use lower polynomial

degrees [8, 138, 87] so that a gain is often not achieved. Further factorizations allowed to

outpace matrix-matrix variants down to a polynomial degree of p = 2 and were presented

in [62].

The chapter is structured as follows: First, Section 4.2 introduces the main concept and

equations for of static condensation for the general and specifically the three-dimensional

case. Then, the resulting operator is factorized to a linearly scaling version that is capable

of outpacing matrix-matrix multiplications Section 4.3. Lastly, Section 4.4 proposes solvers

founding on these operators and investigates their viability in terms of resulting condition

number of the system matrix as well as runtime.

4.2 Static condensation

4.2.1 Principal idea of static condensation

Typically, the solution process for a linear equation involves every degree of freedom. How-

ever, for elliptic equations, the values in the interior solely depend upon the boundary values

of the domain and the right-hand side. Hence, one can choose suitable subdomains and

algebraically eliminate their interior degrees of freedom from the equation system. This

technique is called static condensation or Schur complement and leads to better condition,

fewer algebraic unknowns, and, hence, a faster solution process.

Let Hu = F denote the Helmholtz problem, with u as solution variable and F as discrete

right-hand side. The values are divided into interior degrees of freedom, uI, and boundary

degrees of freedom, uB, as shown in Figure 4.1. Similarly, the matrix H is decomposed into

interaction between boundary and inner part(HBB HBI

HIB HII

)(uB

uI

)=

(FB

FI

), (4.1)

with the symmetry of the operator allowing for

HIB = HTBI (4.2)

⇒

(HBB HBI

HTBI HII

)(uB

uI

)=

(FB

FI

). (4.3)

4.2 Static condensation 43

ξ2

ξ1u

ξ2

ξ1uI

ξ2

ξ1uB

Figure 4.1: Division of degrees of freedom into inner degrees of freedom (subscript I) and bound-ary degrees (subscript B) for a two-dimensional element. Left: Full tensor-productelement of degree p = 5. Middle: Only inner degrees of freedom. Right: Only bound-ary degrees of freedom.

As the inner element operator, HII, is invertible

uI = H−1II (FI −HIBuB) , (4.4)

leading to

HBBuB = FB −HBIuI

⇒ HBBuB = FB −HBIH−1II (FI −HIBuB) (4.5)

⇒(HBB −HBIH

−1II HIB

)⏞ ⏟⏟ ⏞=:H

uB = FB −HBIH−1II FI⏞ ⏟⏟ ⏞

=:F

. (4.6)

The above is a reduced equation system. It only incorporates the values on the domain

boundary, not the inner ones, and, hence, works on fewer degrees of freedom. However,

the resulting operator, H, is more complex. It consists of two parts: The primary part,

HPrim = HBB, is the restriction of the Helmholtz operator to the boundary nodes, whereas

the condensed part, HCond = HBIH−1II HIB, incorporates the interaction of the boundary

nodes with the inner element, and vice versa.

The method can be utilized in multiple ways. While it is traditionally used on an per-

element basis, e.g. [135], lowering the overall complexity of the algorithm, it can also serve

as a solver for the whole grid, as in [83], be employed for whole subdomains of the problem,

e.g. [117, 45], serve as basis for p-multigrid techniques [52] or as preconditioner for a DG

scheme [50].

4.2.2 Static condensation in three dimensions

In this section, the static condensation method is used on a three-dimensional tensor-product

element utilizing Gauß-Lobatto-Legendre points, leading to the residual evaluation

algorithm utilized in the three-dimensional version of the multigrid solver from [52]. As

investigating one element suffices, the element index e is dropped in favor of readability.


Algorithm 4.1: Solution algorithm with static condensation.

1: for Ωe : do ▷ condense right-hand side2: Fe ← FB,e −HBI,eH

−1II,eFI,e

3: ue ← uB,e

4: end for5:

6: u← Solution(Hu = F) ▷ Solve condensed system7:

8: for Ωe : do ▷ regain inner nodes9: uB,e ← ue

10: uI,e ← H−1II,eFI,e −HIB,euB,e

11: end for

ξ3

ξ1

ξ2ξ3

ξ1

ξ2

n

w

bs

e

t

Figure 4.2: Explosion view of a three-dimensional spectral-element using a tensor-product basiswith Gauß-Lobatto-Legendre points. Left: Three-dimensional tensor-productelement using Gauß-Lobatto-Legendre points of polynomial degree p = 3.Right: Explosion view of the element with compass notation for the faces (for clarityof presentation only the element faces are shown).

Due to the tensor-product base using GLL points, the element boundary is directly separable.

It can be decomposed into three non-overlapping entities: Element vertices, edges, and faces.

There are eight vertices in every element, leading to eight data points. Similarly, there are

twelve edges with nI = p− 1 points on edges and, hence, 12nI data points being associated

with the edges. Lastly, there are n2I points per faces and thus 6n2

I facial degrees of freedom.

The static condensed Helmholtz operator consists of primary and condensed part. The

primary part is the restriction of the Helmholtz operator to the boundary nodes, i.e. faces,

edges, and vertices. To obtain the specific suboperator, one can multiply the respective

restriction operators onto the Helmholtz operator. For instance for the faces east and

west, when using the compass notation from Figure 4.2 the respective degrees of freedom uFe

and uFw , correspond to uijki=p,j∈I,k∈I and uijki=0,j∈I,k∈I, with the index set I = 1 . . . nI.Hence, the restriction operators for the faces are

RFe =(0 I 0

)⊗(0 I 0

)⊗(0 . . . 0 1

)(4.7)

4.2 Static condensation 45

RFw =(0 I 0

)⊗(0 I 0

)⊗(1 0 . . . 0

), (4.8)

where I is the identity matrix of appropriate size and 0 denotes row and column vectors

containing only zeroes. For example the restriction operator for p = 4, reads

RFw =

⎛⎜⎜⎝0 1 0 0 0

0 0 1 0 0

0 0 0 1 0

⎞⎟⎟⎠⊗⎛⎜⎜⎝0 1 0 0 0

0 0 1 0 0

0 0 0 1 0

⎞⎟⎟⎠⊗ (1 0 0 0 0)

. (4.9)

The restrictions to east and west face, respectively, lead to the operator

HFwFe = RFwHRTFe

(4.10)

which equates to

HFwFe = d0MII ⊗MII ⊗M0p + d1MII ⊗MII ⊗ L0p

+ d2MII ⊗ LII ⊗M0p + d3LII ⊗MII ⊗M0p ,(4.11)

where I is utilized as a short-hand for the inner part of the corresponding matrix or vector,

e.g.

MIp = (M1p . . .MnIp)T . (4.12)

The diagonal mass matrix simplifies the above to

HFwFe =d1MII ⊗MII ⊗ L0p . (4.13)

Each face to face, face to edge, edge to face, edge to edge, edge to vertex, vertex to edge and

vertex to vertex operator can be derived in a similar fashion. From these, the face-to-face

operators have the highest operational complexity and need to be investigated. Table 4.1

lists the tensor-product forms of these. As they at most utilized two-dimensional tensor-

products working on n2I data points per face, the primary part HPrim can be implemented

in O(n3I ) operations.

Where the primary part inherits the tensor-product structure of the full operator, the con-

densed part HCond is more convoluted. It incorporates not only the interaction between

boundary nodes and the inner element, but the inverse Helmholtz operator in the in-

ner element H−1II as well. The inner element is associated with the basis functions φijk

with i, j, k ∈ I, and the operator can be retrieved via restriction. The boundary, how-

ever, is more complicated. For dense matrices M and L, the tensor-product structure of

the Helmholtz operator (2.44) couples every data point with every other data point in

the element. But the approximated GLL mass matrix M is diagonal. Hence, the mass


Algorithm 4.2: Evaluation of the condensed part using matrix-matrix productswith I = e,w, n, s, t, b.

1: for i ∈ I do2: vFi

←∑j∈I

HCondFiFjuFj

3: end for

term d0M⊗M⊗M only maps one point onto itself, whereas the other tensor products,

e.g.L⊗M⊗M, only work along mesh lines in the element, in this example in the ξ3 direc-

tion. The diagonal mass matrix leads to vertices only mapping to vertices and edges, edges

only mapping to vertices, edges, and faces, and faces only mapping to edges, faces, and the

inner element. Hence, the operators HBI and HIB consist of the interaction between the

interior of the element, and the element faces.

The inner element Helmholtz operator HII can be written in tensor-product form. How-

ever, the inverse of a tensor-product operator is not necessarily a tensor-product as well.

In general it takes O(n6I ) to apply the inverse and even the fast diagonalization technique

from Subsection 2.3.4 only reduces it to O(n4I ). Compared to a matrix-matrix implementa-

tion of the condensed part, no complexity gain is present when expressing the inverse with

fast diagonalization. Hence, implementations typically implement the face-to-face interac-

tion with matrix multiplication, e.g. done in [52, 51, 21]. The suboperators from face to face

are precomputed, stored, and reutilized in every application of the operator. The resulting

algorithm for the evaluation of the condensed part is shown in Algorithm 4.2. It utilizes 36

matrix multiplications to map from each face to each face, leading to 36n4I required multi-

plications. When storing the face values in a single array, one sole call to a well optimized

matrix-matrix multiplication, for instance DGEMM from BLAS, suffices as implementation.

4.3 Factorization of the statically condensed Helmholtz

operator

The operator evaluation technique from Algorithm 4.2 possesses two main drawbacks: First,

it scales super-linearly with the number of degrees of freedom, i.e. with O(n4p

). Hence, the

static condensation only leads to a better conditioned reduced equation system [21]. For

a system consisting of O(n2I ) values instead of O(n3

I ), less operations are to be expected.

Second, the matrices are precomputed and need to be stored. For non-homogeneous meshes

the element widths and, hence, the coefficients di,e vary, requiring one set of matrices per

element. At a polynomial degree of p = 15, these occupy 2.1GB of memory at double

precision – per element. While the operator complexity is an expensive inconvenience, the

4.3 Factorization of the statically condensed Helmholtz operator 47

Table 4.1: Suboperators of the primary part, face to face interaction, when assuming Gauß-Lobatto-Legendre points. Only non-zero entries are listed. Due to the symmetryof the GLL collocation nodes, the identities M00 = Mpp, L0p = L0p, and L00 = Lpp

were applicable.

i j HFiFj

w w d0MII ⊗MII ⊗MII + d1MII ⊗MII ⊗ L00 + d2MII ⊗ LII ⊗M00 + d3LII ⊗MII ⊗M00

e w d1MII ⊗MII ⊗ Lp0

e e HFwFw

w e HTFeFw

= HFeFw

s s d0MII ⊗MII ⊗MII + d1MII ⊗M00 ⊗ LII + d2MII ⊗ L00 ⊗MII + d3LII ⊗M00 ⊗MII

n s d2MII ⊗ Lp0 ⊗MII

n n HFsFs

s n HTFnFs

b b d0MII ⊗MII ⊗MII + d1M00 ⊗MII ⊗ LII + d2M00 ⊗MII ⊗MII + d3L00 ⊗MII ⊗MII

t b d3Lp0 ⊗MII ⊗MII

t t HFbFb

b t HTFtFb

memory requirement makes computations for inhomogeneous meshes next to impossible.

This section will first lift the latter restriction by introducing a matrix-free, tensor-product

formulation of the operator, then the first one by factorizing it to linear complexity, and

lastly demonstrate the attained linearity via runtime tests.

4.3.1 Tensor-product decomposition of the operator

The condensed part HCond consists of three suboperators: First, the boundary to inner

part HIB, second, the inner element Helmholtz operator to inner operator H−1II , third, the

inner to boundary interaction HBI. For the operator to be cast into a tensor-product form,

and hence allow for a better complexity, each sub-part requires a tensor-product notation.

The operator form the boundary to the inner element, HIB, is constructed by restricting

the Helmholtz operator to the boundary, and inner element, respectively. The inner

element is associated with the basis functions φijk with i, j, k ∈ I. Hence, the restriction

operator for it is

RI =(0 I 0

)⊗(0 I 0

)⊗(0 I 0

). (4.14)

As explained above, only the faces map to the interior of the element. The operators for all

six faces are similar. Without loss of generality, the east face is utilized for derivation of the

suboperator. The other five variants can be derived in the same fashion.


Multiplying the restriction operators (4.14) and (4.7) onto the Helmholtz operator leads

to

HIFe = RIHRTFe

(4.15)

⇒HIFe =

⎛⎜⎜⎝0

I

0

⎞⎟⎟⎠T

⊗

⎛⎜⎜⎝0

I

0

⎞⎟⎟⎠T

⊗

⎛⎜⎜⎝0

I

0

⎞⎟⎟⎠T

⏞ ⏟⏟ ⏞RI

H

⎛⎜⎜⎝0

I

0

⎞⎟⎟⎠⊗⎛⎜⎜⎝0

I

0

⎞⎟⎟⎠⊗⎛⎜⎜⎜⎜⎜⎝0...

0

1

⎞⎟⎟⎟⎟⎟⎠⏞ ⏟⏟ ⏞

RTFe

. (4.16)

When using the tensor-product representation of the Helmholtz operator (2.44), the above

equates to

HIFe = d0MII ⊗MII ⊗MIp + d1MII ⊗MII ⊗ LIp

+ d2MII ⊗ LII ⊗MIp + d3LII ⊗MII ⊗MIp ,(4.17)

where I is utilized as a short-hand for the inner part of the corresponding matrix or vector,

e.g.

MIp = (M1p . . .MnIp)T . (4.18)

As the mass matrix associated with the GLL points is diagonal, MIp = 0 and the operator

further simplifies to

HIFe = d1MII ⊗MII ⊗ LIp . (4.19)

The above can be applied in n3I +O(n2

I ) multiplications to a vector uFe when using the

following evaluation order:

HIFeuFe = d3LIp ⊗MII ⊗MIIuFe = d3 (LIp ⊗ I⊗ I) (I⊗MII ⊗MII)uFe . (4.20)

Due to the diagonal mass matrix, the right tensor-product is a diagonal matrix being applied

to the face using n2I operations. The second operator, LIp ⊗ I⊗ I, expands from the face into

the element, i.e. from n2I to n3

I points and as it uses one multiplication per generated data

point n3I multiplications are required.

The Helmholtz operator is symmetric. Hence, the operator from the inner element back

to the face, HFeI, is the transpose of the above, i.e.

HIFe = d1MII ⊗MII ⊗ LpI , (4.21)


Table 4.2: Operators from the element faces to the inner element and vice versa with the faceindex i corresponding to the compass notation as shown in Figure 4.2.

i HFiI HIFi

w d1MII ⊗MII ⊗ L0I d1MII ⊗MII ⊗ LI0

e d1MII ⊗MII ⊗ LpI d1MII ⊗MII ⊗ LIp

s d2MII ⊗ L0I ⊗MII d2MII ⊗ LI0 ⊗MII

n d2MII ⊗ LpI ⊗MII d2MII ⊗ LIp ⊗MII

b d3L0I ⊗MII ⊗MII d3LI0 ⊗MII ⊗MII

t d3LpI ⊗MII ⊗MII d3LIp ⊗MII ⊗MII

and can be evaluated in O(n3I ) multiplications as well when reversing the evaluation order:

HIFeuI = d3,eLIp ⊗MII ⊗MIIuFe = (d3,e ⊗MII ⊗MII) (LpI ⊗ I⊗ I)uI . (4.22)

First, a dimension reduction on n3I values requires n

3I multiplications, then the diagonal tensor

product utilizes n2I multiplications. With the reverse evaluation order, i.e. first restrict to

the face, then use the diagonal it can be evaluated in n3p +O

(n2p

)multiplications

The other operators can be derived in a similar fashion and are listed in Table 4.2.

To gain a tensor-product formulation for the inverse of the inner element Helmholtz op-

erator, a tensor-product formulation of the operator itself is required. The inner element

operator is obtained by restricting to the inner element, i.e. applying the restriction opera-

tor RI from (4.14), from the left and right to the Helmholtz operator.

HII = RIHRTI (4.23)

Substituting the tensor-product formulation of the operator (2.44) into the above leads to

HII = d0MII ⊗MII ⊗MII + d1MII ⊗MII ⊗ LII

+ d2MII ⊗ LII ⊗MII + d3LII ⊗MII ⊗MII .(4.24)

Using the fast diagonalization from Subsection 2.3.4, the inner element Helmholtz can be

expressed using (2.52)

HII = (MIISII)⊗ (MIISII)⊗ (MIISII)DII

(STIIMII

)⊗(STIIMII

)⊗(STIIMII

), (4.25)

where SII is the transformation matrix for the inner element and DII the diagonal matrix

containing the eigenvalues of the inner element Helmholtz operator computed by (2.54):

DII = d0I⊗ I⊗ I+ d1I⊗ I⊗Λ + d2I⊗Λ⊗ I+ d3Λ⊗ I⊗ I . (4.26)


As the homogeneous Dirichlet problem is solvable, the operator is invertible leading to

the inverse (2.55)

H−1II = (SII ⊗ SII ⊗ SII)D

−1II

(STII ⊗ ST

II ⊗ STII

). (4.27)

While (4.25) is a tensor-product operator, (4.27) is not. The inverse of the diagonal is not a

tensor-product anymore but can still be applied in n3I multiplications. However, the three-

dimensional tensor-product SII⊗SII⊗SII requires 3n4I multiplications, as does the transpose

of it. While no order reduction is present, the explicit formulation for the inverse allows

factorization of the operator.

4.3.2 Sum-factorization of the operator

With the suboperators mapping from the faces to the inner element from Table 4.2 and the

inverse of the inner element Helmholtz operator (4.27), explicit formulations with tensor-

products are present for all operators required for the condensed part HCond. However,

evaluating just one face-to-face operator utilized in Algorithm 4.2, HCondFiFj

, in the form

HCondFiFj

= HFiIH−1II HIFj

(4.28)

requires n3I multiplications for HIFj

, 3n4I for the transformation to the inner element eigen-

space, STII ⊗ ST

II ⊗ STII, n

3I for the application of the eigenvalues, further 3n4

I for the backward

transformation, and n3I for the inner to face operator. Hence, directly using the derived

tensor-product operators in Algorithm 4.2 leads to 6 · 6 · (6n4I + 3n3

I +O(n2I )) multiplica-

tions. Instead of 36n4I for the matrix-matrix implementation, 216n4

I multiplications are now

required, not only rendering the algorithm itself inferior, but additionally losing the benefit

of using optimized matrix-matrix multiplications from libraries. However, all of these op-

erators share one similarity: They first map to the inner element, then transform into the

inner element eigenspace, apply the diagonal, transform back, and map back to a face. As

all suboperators, except the application of the diagonal, are present in tensor-product form,

factorization of the operator is possible. For instance the operator from face east to face

west is

HCondFwFe

= d1MII ⊗MII ⊗ L0I⏞ ⏟⏟ ⏞HFwI

H−1II d1MII ⊗MII ⊗ LI0⏞ ⏟⏟ ⏞

HIFe

. (4.29)

Expanding the inverse of the inner element Helmholtz operator to (4.27) yields

HCondFwFe

= d1MII ⊗MII ⊗ L0I (SII ⊗ SII ⊗ SII)D−1II

(STII ⊗ ST

II ⊗ STII

)d1MII ⊗MII ⊗ LI0

⇒ HCondFwFe

= d1 (MIISII ⊗MIISII ⊗ L0ISII)⏞ ⏟⏟ ⏞HFwE

D−1II d1

(STIIMII ⊗ ST

IIMII ⊗ STIILI0

)⏞ ⏟⏟ ⏞HEFw

. (4.30)


Table 4.3: Operators from the element faces to the inner element eigenspace and vice versa withthe face index i corresponding to the compass notation as shown in Figure 4.2.

i HFiE HEFi

w d1 (MIISII)⊗ (MIISII)⊗ (L0ISII) d1(STIIMII

)⊗(STIIMII

)⊗(STIILI0

)e d1 (MIISII)⊗ (MIISII)⊗ (LpISII) d1

(STIIMII

)⊗(STIIMII

)⊗(STIILIp

)s d2 (MIISII)⊗ (L0ISII)⊗ (MIISII) d2

(STIIMII

)⊗(STIILI0

)⊗(STIIMII

)n d2 (MIISII)⊗ (LpISII)⊗ (MIISII) d2

(STIIMII

)⊗(STIILIp

)⊗(STIIMII

)b d3 (L0ISII)⊗ (MIISII)⊗ (MIISII) d3

(STIILI0

)⊗(STIIMII

)⊗(STIIMII

)t d3 (LpISII)⊗ (MIISII)⊗ (MIISII) d3

(STIILIp

)⊗(STIIMII

)⊗(STIIMII

)

The above directly maps from the face to the inner element eigenspace, denoted by the

subscript E, via HEFw , applies the inverse eigenvalues, and maps back with HFwE. As the

utilized matrices are now not diagonal anymore, the application of HEFw and HFwE both

use 3n3I multiplications, and the diagonal n3

I , leading to 7n3I multiplications per face to face

operator. This factorization lowers the number of multiplications for the condensed part

when using Algorithm 4.2 from 216n4I + 108n3

I +O(n2I ) down to 252n3

I +O(n2I ), i.e. in linear

complexity. The operators HFiE and HEFirequired for the evaluation are listed in Table 4.3.

While the new operator formulation achieves linear complexity, the leading factor for the

algorithm remains prohibitively high: The matrix-matrix variant still uses fewer multiplica-

tions until p = 8, and has the benefit of optimized libraries. Further factorization is required

to attain a competitive operator for polynomial degrees utilized in practice.

Hitherto factorization was only performed on single suboperators. The next step consists of

factorizing common factors in the condensed part. The result of applying it to a vector uF

is

vFi=∑j∈I

HCondFiFj

uFj(4.31)

⇒ vFi=∑j∈I

HFiED−1II HEFj

uFj. (4.32)

As the left part is not dependent upon j, a sum factorization yields

vFi= HFiE D

−1II

∑j∈I

HEFjuFj⏞ ⏟⏟ ⏞

vE⏞ ⏟⏟ ⏞uE

. (4.33)

The first term,∑

j∈I HEFjuFj

, corresponds to the residual in the inner element eigenspace

induced by the boundary nodes and is, hence, named vE. The application of the diag-


Algorithm 4.3: Evaluation of the condensed part using a sum factorization of the tensor-productoperators with I = e,w, n, s, t, b.

1: vE ←∑j∈I

HEFjuFj

2: uE ← D−1II vE

3: for i ∈ I do4: vFi

← HFiEuE

5: end for

onal transforms it into the corresponding degrees of freedom in the inner element and is,

named uE. The third one maps these back onto the face. As the calculation of uE is the same

for every face, it can be stored and reused, culminating in Algorithm 4.3. The algorithm

maps from all six faces into the inner element eigenspace, using 3n3I multiplications per face

to compute the residual vE induced by the boundary nodes. Then the diagonal is applied to

compute the resulting solution in the inner element uE, requiring a further n3I , and stores uE.

The impact of the solution in the inner element eigenspace results from mapping back with

further 3n3I multiplications per face. In total 37n3

I multiplications occur and the algorithm,

hence, is of lower complexity than the matrix-matrix multiplication variant of Algorithm 4.2

starting from p = nI + 1 = 3.

4.3.3 Product-factorization of the operator

The operator evaluation method derived in Subsection 4.3.2 already attains two of the main

goals: It possesses a linear operator complexity, i.e. it scales with O(p3), and requires a

linearly scaling amount of memory. However, it only uses less multiplications than matrix-

matrix based variants starting from p = 3, and can not easily harness optimized libraries.

Thus, it is to be expected, that it will only be faster in practice when using very high

polynomial degrees. And, as shown in [60], that is the case. Further factorization is required

to become competitive for polynomial degrees relevant in practice.

Algorithm 4.3 utilizes each of the operators HFiE and HEFilisted in Table 4.3. All of

these suboperators share a similar structure: In two dimensions the matrix STIIMII, or the

transpose of it is applied, while in the direction orthogonal to the face either STIILIp or S

TIILI0

prolong into the eigenspace, or LpISII or L0ISII restrict to the face. For an evaluation of the

condensed part, 24n3I multiplications out of 37n3

I are spent applying STIIMII or the transpose

of it. To further improve the efficiency of the condensed part, these matrices need to be

removed from the operator.

One way to eliminate the extraneous matrix products from the operator is a coordinate

transformation. The transformation is chosen in a way such that STIIMII equates to identity,


eliminating these matrices from the tensor-product operators, while not transforming the

vertices. To this end, the matrix

S =

⎛⎜⎜⎝1 0 0

0 SII 0

0 0 1

⎞⎟⎟⎠ (4.34)

is applied in all three directions to the Helmholtz operator (2.44), leading to the trans-

formed Helmholtz operator H

H :=(ST ⊗ S

T ⊗ ST)H(S⊗ S⊗ S

)(4.35)

H = d0

(STMS⊗ S

TMS⊗ S

TMS

)+ d1

(STMS⊗ S

TMS⊗ S

TLS)

+ d2

(STMS⊗ S

TLS⊗ S

TMS

)+ d3

(STLS⊗ S

TMS⊗ S

TMS

).

(4.36)

The transformed mass and stiffness matrices are defined as

M := STMS (4.37)

L := STLS . (4.38)

Using (2.49), they compute to

M =

⎛⎜⎜⎝M00 0 0

0 STIIMIISII 0

0 0 Mpp

⎞⎟⎟⎠ =

⎛⎜⎜⎝M00 0 0

0 I 0

0 0 Mpp

⎞⎟⎟⎠ (4.39)

L =

⎛⎜⎜⎝L00 L0ISII L0p

STIILI0 ST

IILIISII STIILIp

Lp0 LpISII Lpp

⎞⎟⎟⎠ =

⎛⎜⎜⎝L00 L0ISII L0p

STIILI0 Λ ST

IILIp

Lp0 LpISII Lpp

⎞⎟⎟⎠ . (4.40)

Thus, the inner element mass and stiffness matrices in the new coordinate system are MII = I

and LII = Λ. They are diagonal and reduce the face to eigenspace operators from Table 4.3

to those in Table 4.4. Where previously every tensor-product consisted of one row or column

matrix in addition to two dense matrices and, now only the row and column matrices remain.

Each suboperator now requires n3I instead of 3n3

I multiplications, reducing the number in

the condensed part further down from 37n3I to 13n3

I .

While the operation reduction in the condensed part is significant, the primary part is

simplified as well. In the condensed system it includes face-to-face interaction as well and


Table 4.4: Operators from the element faces to the inner element eigenspace and vice versa withthe face index i corresponding to the compass notation as shown in Figure 4.2.

i HFiE HEFi

w d1I⊗ I⊗ L0I d1I⊗ I⊗ LI0

e d1I⊗ I⊗ LpI d1I⊗ I⊗ LIp

s d2I⊗ L0I ⊗ I d2I⊗ LI0 ⊗ I

n d2I⊗ LpI ⊗ I d2I⊗ LIp ⊗ I

b d3L0I ⊗ I⊗ I d3LI0 ⊗ I⊗ I

t d3LpI ⊗ I⊗ I d3LIp ⊗ I⊗ I

the operators are shown in Table 4.1. For these, two types exist. The first maps from one

face onto itself, e.g. for face east

HFeFe = d0MII ⊗MII ⊗MII + d1MII ⊗MII ⊗ L00

+ d2MII ⊗ LII ⊗M00 + d3LII ⊗MII ⊗M00 .(4.41)

In the transformed system these read

HFeFe = d0I⊗ I⊗ I+ d1I⊗ I⊗ L00 + d2I⊗Λ⊗M00 + d3Λ⊗ I⊗M00 . (4.42)

While the application of (4.41) requires two tensor products containing the densely populated

stiffness matrix and, hence, uses 2n3I +O(n2

I ) multiplications, application of (4.42) only re-

quires O(n2I ) multiplications. In the second case, the faces map to opposing faces. These op-

erators are diagonal to begin with and the transformed system retains this, resulting in a com-

plexity of O(n2I ). Therefore, the transformed system not only reduces the number of multipli-

cations in the condensed part, it also reduces the primary part from 12n3I +O(n2

I ) to O(n2I )

multiplications. The primary part face to face operators are listed in Table 4.5.Lastly, the

transformation can be applied on each face and edge after condensing the right-hand side

in Algorithm 4.1 and before recomputing the solution. As the transformation is a tensor-

product operation on the faces, only O(p3) operations are added to the complexity of pre-

and post-processing, adding a runtime similar to the one of an operator evaluation to these.

4.3.4 Extensions to variable diffusivity

In Section 4.3 a new variant of the static condensed Helmholtz operator was proposed.

It is capable of applying the operator with a linear complexity. However, this is only the


Table 4.5: Suboperators of the primary part, face to face interaction in the transformed system.Only non-zero entries are listed.

i j HFiFj

w w d0I⊗ I⊗ I+ d1I⊗ I⊗ L00 + d2I⊗Λ⊗M00 + d3Λ⊗ I⊗M00

e w d1I⊗ I⊗ Lp0

e e HFwFw

w e HT

FeFw= HFeFw

s s d0I⊗ I⊗ I+ d1I⊗M00 ⊗Λ+ d2I⊗ L00 ⊗ I+ d3Λ⊗M00 ⊗ I

n s d2I⊗ Lp0 ⊗ I

n n HFsFs

s n HT

FnFs

b b d0I⊗ I⊗ I+ d1M00 ⊗ I⊗Λ+ d2M00 ⊗ I⊗ I+ d3L00 ⊗ I⊗ I

t b d3Lp0 ⊗ I⊗ I

t t HFbFb

b t HT

FtFb

case for a constant Helmholtz parameter λ, i.e. a constant diffusivity µ. For non-constant

diffusivities the Helmholtz equation can be written as

λu−∇ · (µ∇u) = f (4.43)

which corresponds, when disregarding the boundary terms, to the weak version being∫x∈Ω

(λuv)(x) dx+

∫x∈Ω

((∇v)T µ∇u

)(x) dx =

∫x∈Ω

(vf)(x) dx . (4.44)

With the viscosity being approximated with a polynomial of degree p, the GLL quadrature

of degree p is not sufficient to fully evaluate the term, as it only exactly integrates until

polynomial order 2p− 1. Still, in literature the GLL quadrature of degree p is employed,

hence, commiting a “variational crime”. With it, the element Helmholtz operator of the

above is

He = d0,eM⊗M⊗M

+ d1,e(MT/2 ⊗MT/2 ⊗DTMT/2

)diag(µe)

(M1/2 ⊗M1/2 ⊗M1/2D

)+ d2,e

(MT/2 ⊗DTMT/2 ⊗MT/2

)diag(µe)

(M1/2 ⊗M1/2D⊗M1/2

)+ d3,e

(DTMT/2 ⊗MT/2 ⊗MT/2

)diag(µe)

(M1/2D⊗M1/2 ⊗M1/2

),

(4.45)


where diag(µe) is the diagonal matrix consisting of the diffusivity vector µe. Mind, that

for µe = 1, the diffusion operators yield the expected stiffness matrix, as due to the symmetric

mass matrix

DTMT/2M1/2D = DTMD = L . (4.46)

The factorization of the static condensed Helmholtz operator is not applicable to (4.45),

as the diagonal matrix diag(µe) disrupts the tensor-product structure of the operator. While

the whole operator can not be easily condensed, reduced version thereof can. Application of

the fast diagonalization requires a tensor-product structure for the operator. This structure

can be recovered, by approximating the diffusivity with a tensor product decomposition, i.e.

diag(µe) ≈ diag(µ3,e)⊗ diag(µ2,e)⊗ diag(µ1,e) . (4.47)

The above only generates a crude approximation, but suffices to capture the main features

inside an element. These can be treated implicitly, whereas high-frequency components can

be treated explicitly in a time-marching scheme.

After introducing the tensor-product structure, fast diagonalization and, in turn, the oper-

ator evaluation technique from Subsection 4.3.2 are applicable. The main difference stems

from one generalized eigenvalue decomposition being required per direction, such that, e.g. for

the first direction,

ST(DTMT/2diag(µ1,e)M

1/2D)S = Λ (4.48a)

STMS = I . (4.48b)

After exchanging the sole transformation matrix with the appropriate transformation ma-

trices per direction, the operators from Table 4.3 operator can be evaluated at the same

cost. This allows to apply the condensed part using Algorithm 4.3 with the same number of

multiplications. Hence, the above technique allows for static condensation with treatment

of varying diffusivities, albeit at a reduced resolution.

4.3.5 Runtime comparison of operators

Several variants for the application of the condensed Helmholtz operator were derived in

the previous sections. The first one maps from each face of the element to every other,

leading to the 36 face to face interactions in Algorithm 4.2. One large matrix multiplication

consitutes the most efficient way to implement this algorithm. The variant implementing it

is labeled MMC1 and requires 36n4I multiplications during the application and incorporates

the primary part for the face to face interaction as well.


Table 4.6: Number of multiplications of leading terms of application and precomputation stepsfor three different variants of the condensed Helmholtz operator.

Variant Precomputation Primary part Condensed part

TPF O(n3Ine) 3n4

Ine –

MMC1 O(n5Ine) 56n2

Ine 36n4Ine

MMC2 O(n5Ine) 12n3

Ine 12n5Ine

TPC O(n3Ine) 12n3

Ine 37n3Ine

TPT O(n3Ine) 68n2

Ine 13n3Ine

Algorithm 4.3 can be evaluated in multiple fashions. The first one is a tensor-product in the

condensed system, requiring 37n3I multiplications for the condensed part and 12n3

I for the

primary part. The associated variant is called TPC. The second variant employs the same

algorithm, but with a coordinate transformation that streamlines the number of multipli-

cations to 13n3I and is called TPT. The last variant, MMC2, uses a matrix multiplication

to map from the faces of the element into the inner element eigenspace instead of a tensor

product. The operator requires 12n5I +O(n3

I ) multiplications, leading to a higher complexity

than the matrix matrix variant MMC1. However, the memory requirement does not scale

with the number of elements anymore, enabling it to be used for non-homogeneous grids.

The last considered operator is called TPF and implements a tensor-product version of the

Helmholtz operator in the full, uncondensed system. It is enhanced with the techniques

described in Chapter 3 to constitute an efficient implementation of the full operator and,

hence, a measure whether using static condensation is beneficial runtime-wise.

Table 4.6 summarizes the different operator complexities as well as the required precompu-

tation costs. While the transformed variant exhibits the lowest multiplication count starting

from p = 1, the runtimes of the implementation can and will differ in practice. Multiplication

counts do not directly transfer to runtime, as, e.g., loading, storing, execution, and cache

effects have an influence as well. Hence, tests were conducted to compare the efficiency of

the five operator variants.

The operators were implemented using Fortran 2008, with the matrix multiplications being

delegated to DGEMM from BLAS. The Intel Fortran compiler v. 2018 compiled the program

using the Intel Math Kernel Library (MKL) as BLAS implementation.

On one CPU core of an Intel Xeon E5-2680 v3 running at 2.5GHz, the operators, as well

as the required precomputations, were repeated 100 times and the runtimes, being mea-

sured by MPI Wtime, averaged. The polynomial degree was varied between 2 ≤ p ≤ 32 for a

constant number of elements ne = 512 and a Helmholtz parameter λ = π, allowing for in-

sights on the runtime for technically relevant polynomial degrees, as well as the asymptotical

behaviour of the operator variants.


TPF MMC1 MMC2 TPC TPT

2 4 8 16 32

polynomial degree p

10−5

10−4

10−3

10−2

10−1

100

101

102

Preprocessingtime[s]

5

1

1

3

2 4 8 16 32

polynomial degree p

10−5

10−4

10−3

10−2

10−1

100

101

102

Operatorruntime[s]

5

1

1

3

2 4 8 16 32

polynomial degree p

106

107

108

109

Eqv.D

OF/s[1/s]

2 4 8 16 32

polynomial degree p

0

10

20

30

40

Perform

ance

[GFLOP/s]

Figure 4.3: Runtimes for the different operator variants. Top left: precomputation times for theoperator, top right: runtimes of the operators, bottom left: runtime per equivalentdegree of freedom being computed as (p+ 1)3ne, bottom right: achieved performancein Giga Floating Point OPeration per Second (GFLOP/s) computed by using theleading terms from Table 4.6.

Figure 4.3 depicts the resulting runtimes and precomputation times for the different op-

erator variants. The precomputation times of the operators scale with the expected or-

der: For MMC2 O(p5) results, leading to the largest total precomputation time starting

from p = 4 and for MMC1 O(p5) as well, but with a lower coefficient, hence, a slightly

lower precomputation time. The tensor-product variants only require the generalized eigen-

value decomposition, scaling with O(p3), and setting the the elementwise inverse eigenvalues.

Thus, they have the lowest precomputation time of the condensed operators, though the pre-

computation time of TPF is two orders of magnitude lower as it does not require the gener-

alized eigenvalue transformation and the calculation of D−1II . One has to bear in mind that,

due to memory restrictions, the precomputation of MMC1 is done for only one element,

whereas every other condensed variant computed the matrix D−1II . For non-homogeneous

meshes the precomputation time of MMC1 would increase ne-fold.


The runtimes of the operators also exhibit the expected order: MMC2 scales with O(p5),TPF and MMC1 with O(p4), and the factorized variants with O(p3). At p = 2, the tensor-

product variant TPF has the lowest runtime, with TPT and MMC1 sharing a slightly larger

one and TPC being the slowest variant after MMC2. The matrix-matrix variant MMC1 stays

faster than TPC until p > 9, and the full variant TPF generates runtimes comparable to TPC

until p = 20. Overall, the linearly scaling operator in the condensed system grants a linearly

scaling runtime. But the operator is slower than the one in the full system until p = 17.

Computing in the transformed system, however, remedies the situation: With the operator

being faster than MMC1 starting from p = 2 and a constant factor of three compared

to evaluation in the condensed system, the operator is capable of being faster than the

optimized TPF for p > 7. The low performance of the new variants stems from two factors.

First and foremost, the primary part comprises a multitude of smaller operators, e.g.mapping

from vertices to edges. Most of these work on the whole dataset and are memory bound and

therefore have a low optimization potential. As their number of loads, stores, and operations

scales with O(n2I ) for the transformed case, the operator can attain a better performance.

And second, while the condensed part is a monolithic operator, it is memory-bound as well

due to loading the eigenvalues of the inner element operator. The combination of these two

operators leads to a low performance for low polynomial degrees while limiting the maximum

performance for high ones. Nevertheless, the transformed part outpaces the hand-optimized

version for the full system for every relevant polynomial degree but p = 7, by a factor of two

for p > 11 and by a factor of six for p = 31, polynomial degrees the latter is optimized for.

Compared to the condensed operators, TPT has the largest throughput starting at p = 2,

rendering it preferable for all polynomial degrees.

In Figure 4.3, the performance in GFLOP/s is shown as well. For low polynomial degrees,

only TPF is capable of extracting a significant amount of the maximum performance., This

stems from the many suboperators present in the primary part, which limit the performance.

After p = 9, however, the matrix-matrix based variants nearly reach peak performance, as

the condensed part dominates their runtime. The loop-based implementations for condensed

and transformed system do not even extract a third of that performance. However, for large

polynomial degrees they reach 10GFLOP/s whenever the operator width is divisable by the

SIMD width of 4. The achieved performance corresponds to half of expected optimum with

loop-based tensor-product implementations, as TPF shows. Hence, the current implementa-

tions still have optimization potential, however, due to the plethora of small operators, it is

not as exploited as with the monolithical tensor-product operators invesitigated in Chapter 3.


4.4 Efficiency of pCG solvers employing fast static con-

densation

In the previous sections, linearly scaling operator were derived as a prerequisite for linearly

scaling solvers. However, a solver comprises more operations and good iteration schemes and

preconditioners are required for the fast solution of the equation set. This section utilizes

the conjugate gradient (CG) method [54, 118] to investigate the impact of the polynomial

degree on the condition of the system matrices and, hence, the solution process. As linearly

scaling preconditioners are required, these are derived first. Then, solvers are proposed and

their efficiency compared.

4.4.1 Element-local preconditioning strategies

In a preconditioned conjugate gradient solver, the preconditioner is called once in a every

iteration, just as the operator. Hence, it induces an overhead into a solver that can not

be neglected. If the preconditioner were to scale super-linearly with the number of degrees

of freedom, for instance with O(p4ne), even the best operator evaluation technique will not

fix the super-linear iteration time. With static condensation, the requirement limits the

preconditioners to be either diagonal in nature, or exhibit a tensor-product form.

In [21] three cases of preconditioning were investigated for the static condensation method

in two-dimensions. The first one did not apply preconditioning at all, the second utilized a

diagonal preconditioner, and the third method utilized the block-wise exact inverse on the

faces, edges, and vertices. It will be called block preconditioner in this work. The condition κ

of the unpreconditioned case scaled with κ = O(p2), the diagonally preconditioned version

with κ = O(p1), and for the block preconditioner κ = O(1) was reported. The same three

preconditioners will be utilized here, the identity, i.e. no preconditioning, a diagonal pre-

conditioning, and a block-wise preconditioning. The application of the former two trivially

scales with the number of degrees of freedom. However, the calculation of the diagonal pre-

conditioning can be expensive. And the last preconditioner incorporates the exact inverse of

the operator from a face onto itself. Hence, these need to be factorized to linear complexity

prior to their applicability.

The diagonal preconditioner for the condensed case consists of the inverse main diagonal of

the operator. It can be directly computed by restricting the primary and condensed parts

to a specific point on face, an edge, or to a vertex. As the degrees of freedom are present on

multiple elements, the contributions from adjoining elements are summed and then inverted.

Due to the diagonal mass matrix, the condensed part works exclusively on the faces. Hence,

the main diagonal is the same as in the uncondensed case for edges and vertices. For the

4.4 Efficiency of pCG solvers employing fast static condensation 61

faces, however, additional terms arise from the condensed part. For instance for face east to

face east, these are

HCondFeFe

= HFeIH−1II HIFe , (4.49)

which can be expressed as

HCondFeFe

= HFeED−1II HEFe (4.50)

and expands with Table 4.3 to

HCondFeFe

= d21 (MIISII ⊗MIISII ⊗ LpISII)D−1II

(STIIMII ⊗ ST

IIMII ⊗ STIILIp

)(4.51)

With the definitions

MII = STIIMII (4.52)

LIp = STIILIp , (4.53)

the above simplifies to

HCondFeFe

= d21

(M

T

II ⊗ MT

II ⊗ LpI

)D−1

II

(MII ⊗ MII ⊗ LIp

)(4.54)

⇒ HCondFeFe

= d21

(MII ⊗ MII

)T (I⊗ I⊗ LpI

)D−1

II

(I⊗ I⊗ LIp

)(MII ⊗ MII

). (4.55)

The outer two tensor products map the face onto itself and do not interfere with the com-

putation of the interior operator. The middle three terms are a restriction of the diagonal

matrix D−1II to two dimensions. As it is constant in an element, it needs to be computed once

per face, not once per point, leading to O(p3) multiplications and, hence, a linearly scaling

preconditioner initialization. The diagonal for the other faces can be derived in the same

fashion, leading to a preconditioner that can be computed in linear runtime and evaluated

in O(n2Ine) operations. The inverse of the operator itself can be computed similarly, only

requiring further computation of M−1. Moreover, in the transformed system M = I holds,

leading to a diagonal preconditioner.

4.4.2 Considered solvers and test conditions

Five solvers are considered for testing. The block-preconditioned solver in the full sys-

tem bfCG from Section 2.4 serves as baseline, and includes the faster operators from Chap-

ter 3. The second one, cCG, solves in the condensed system, with dcCG adding diagonal

and bcCG block preconditioning. Lastly, dtCG applies diagonal preconditioning in the trans-

formed system, lowering the amount of work for operator as well as for the preconditioner.


For these five solvers, the tests from Section 2.4 are repeated, with the domain being dis-

cretized with ne = 8× 8× 8 elements of polynomial degrees p ∈ 2 . . . 32. From the used

number of iterations n10, the minimum possible condition number is calculated via

κ ≥ κ⋆ =

(2n10

ln(2ε

))2

, (4.56)

where ε is the achieved residual reduction, approximated by 10−10. To attain reproducible

runtime results, the solvers are called 11 times, with the runtimes of the last 10 solver calls

being averaged. This precludes measurement of instantiation effects, e.g. library loading,

that would not be present in a real-world simulation.

As done for the operators, the tests were conducted on one node of the HPC Taurus at ZIH

Dresden. It consisted of two sockets, each containing an Intel Xeon E5-2680 v3 with twelve

cores running at 2.5GHz. Only one of the cores was utilized during the tests, leading to the

algorithms, not the parallelization efficiency being measured. The solvers were implemented

in Fortran 2008, compiled with the Intel Fortran compiler and MPI Wtime was used for time

measurements.

4.4.3 Solver runtimes for homogeneous grids

Figure 4.4 depicts the number of iterations and the upper limit of the condition numbers

computed thereof. The solvers can be classified into three distinct categories: The first

one consists of the uncondensed block-preconditioned solver bfCG. It exhibits the highest

iteration count, with a slightly lower slope than 1. The unpreconditioned and diagonally

conditioned condensed solvers are the second category, generating less than half the number

of iterations at p = 32. Both start at the same number, with a lower slope for the diagonal

preconditioner, albeit not significantly, as even at the highest polynomial degree it is only

reduced by one fifth. The third category consists of the block-preconditioned condensed

solver bcCG and the diagonally preconditioned solver in the transformed system dtCG.

They use significantly less iterations than dcCG, less than half at p = 32, and have virtually

the same number of iterations. Moreover, differing from the other solvers, they do not

have a constant slope after p = 8. Rather, the slope slightly decreases when increasing the

polynomial degree.

The number of iterations directly translates into the condition number, also shown in Fig-

ure 4.4. With the condition, the categories from the number of iterations become more

distinct: The block-preconditioned uncondensed system exhibits a slope of O(p7/4

)and the

highest condition number. The unpreconditioned condensed system performing, as the diag-

onally preconditioned one, possess slopes between p3/2 and p after p = 8. Lastly, the block-

preconditioned version does not have a constant gradient, but a decreasing one matching


bfCG cCG dcCG bcCG dtCG

2 4 8 16 32

Polynomial degree p

102

103

Number

ofiterationsn10

1

1

4

1

2 4 8 16 32

Polynomial degree p

101

102

103

104

Conditionκ∗

7

4

32

2 4 8 16 32

Polynomial degree p

10−2

10−1

100

101

102

103

Runtime[s]

5

1

1

3

2 4 8 16 32

Polynomial degree p

100

101

102

Runtimeper

DOF[µs]

3

4

Figure 4.4: Number of iterations, approximated condition number κ⋆, and runtimes for thefive solvers when varying the polynomial degree p at a constant number of ele-ments ne = 8× 8× 8. Top left: number of iterations, top right: resulting conditionnumber, bottom left: runtimes, bottom right: runtimes per degree of freedom (DOF).

the poly-logarithmic bound [61]. While the resulting condition numbers do not match those

reported in [21], one has to keep in mind, that condition numbers change when changing the

number of dimensions, e.g. [118].

The number of iteration does not directly translate to the runtime. For the uncondensed

solver, the number of iterations scales linearly with the polynomial degree and the operator

complexity with O(p4), leading to the runtime scaling with O(p5). Different to the full

system, the condensed and transformed solvers seem to posses a linear runtime, i.e.O(p3).With the unpreconditioned condensed version having the highest, followed by the diagonally

preconditioned version, and then the block-preconditioned one. However, their runtime does

not differ substantially. Using the block preconditioning only gains a factor of 1.5 in the


Table 4.7: Approximated runtime per iteration and runtime per degree of freedom (DOF) of thecondensed solvers for certain polynomial degrees p.

Runtime per iteration and DOF [ns] Runtime per DOF [µs]

p cCG dcCG bcCG dtCG cCG dcCG bcCG dtCG

5 32.60 33.27 41.44 21.04 3.49 3.43 3.11 1.58

9 18.58 19.13 23.75 11.70 2.77 2.58 2.16 1.06

13 17.45 18.00 22.73 10.72 3.21 2.88 2.27 1.09

17 16.10 16.43 20.79 10.48 3.44 3.04 2.27 1.15

21 14.94 15.21 19.01 9.27 3.66 3.16 2.17 1.08

25 14.35 14.62 18.62 8.69 3.89 3.36 2.23 1.05

29 13.59 14.03 18.47 7.86 3.98 3.40 2.29 0.99

runtime compared to diagonal preconditioning. This is due to the high preconditioning cost

compared to the operator evaluation, the preconditioner takes nearly as many operations as

the operator itself. Even a factor of two in the number of iterations does not set off this

runtime disadvantage. The transformed solver, on the other hand, is capable of employing

a diagonal preconditioner in addition to a faster operator, resulting in a factor of three

compared to the runtime of the block-preconditioned version.

The runtime per degree of freedom stagnates for the last three solvers in the range of 1 µsto 5 µs. Diagonal and unpreconditioned version exhibit a minute increase p = 8 to p = 32

and for the block-preconditioned and transformed version none is present. Slight kinks

occur when p+ 1 is a multiple of four, i.e. when nI is a multiple of four, matching the

vector instruction width of the architecture [48]. The stagnation of the runtime can be

explained by the increase in the number of iterations being small, and CG type solvers

incorporating a large number of array instructions. As the data set scales with O(p2) for

the static condensation method, their relative runtime decreases, offsetting the increase in

iterations. Furthermore, the runtime of the operator does not directly scale with p3 but

exhibits a slightly lower slope, as shown in Subsection 4.3.5.

Table 4.7 lists the number of iterations in combination with the runtimes per degree of

freedom for the condensed solvers. As expected, the unpreconditioned solver has the lowest

runtime per iteration of the three condensed system solvers. The addition of a diagonal

preconditioner increases the runtime by approximately 2 %, making the increased effort

worthwhile: The number of iterations decreases significantly and, as a result, the runtime by

a quarter. The block-preconditioner takes up to a third of the runtime of the solver. For it

the large increase in runtime per iteration counters the decrease in iterations, diminishing the

potential runtime savings to only a third. Due to the cheaper operator, the transformed solver

possesses the lowest runtime per iteration and, as it shares the number of iterations with

the block-preconditioned version, also the lowest runtime. One iteration of dtCG only costs



23 43 83 163


101

102

103

Number

ofiterationsn10

3

1

1

3

23 43 83 163


10−2

10−1

100

101

102

103

Runtime[s]

3

4

4

3

Figure 4.5: Number of iterations and corresponding runtimes for the five solvers when varyingthe number of elements ne at a constant polynomial degree p = 16. Top left: numberof iterations, top right: resulting condition number, bottom left: runtimes, bottomright: runtimes per degree of freedom (DOF).

a third of one iteration for bcCG, rendering it the fastest solvers. While these savings seem

small compared to the savings from the operator, the remainder of the solver has to be taken

into account: Solvers based on the conjugate gradient method heavily depend upon array

operations and scalar products. With ever more efficient operators, these occupy a significant

amount of the runtime. For instance, for the transformed system solver, the largest portion

of the runtime is spent in just these operations, not the operator itself. This constitutes a

hard limit on the runtime per iteration and limits the potential for further factorizations.

Lastly, when comparing to the data from Table 3.1, solving with dtCG at p = 29 requires a

third of the runtime per degree of freedom and iteration than solving with bfCG at p = 7,

where bfCG was most efficient. Therefore the main goal was accomplished: The performance

of the full system was extended to high polynomial orders. Moreover, the current solvers are

a factor of three faster per iteration and degree of freedom.

To investigate the robustness of the solvers against the number of elements, the tests were

repeated using a constant polynomial degree p = 16 and increasing the number of elements

from ne = 23 to ne = 163. Figure 4.5 depicts the required number of iterations and the

resulting runtimes per degree of freedom. In three dimensions, the runtime of CG-based

finite element solvers without global coupling scales with n4/3e [118]. Here, however, the

number of iterations exhibits a slightly lower slope than 1/3. The effect probably stems from

not computing in the asymptotic regime. The main conclusion, however, is that the solvers

are not robust against increases in the number of elements. This is to be expected, as global

preconditioning is required to achieve this feat, e.g. with low-order finite elements [90, 50]

or even multigrid [130].


x1

x2

x1

x2

x1

x2

Figure 4.6: Cut through the x3 = 0 plane for the three meshes with constant expansion factor α.Left: α = 1, middle: α = 1.5, right: α = 2.

4.4.4 Solver runtimes for inhomogeneous grids

In real-life simulations, homogeneous grids are seldomly applicable. To capture all relevant

features of a flow, the grid needs to adapt to the solution, with the most common case being

refinement near the wall.

To investigate the behavior of the solvers for inhomogeneous meshes, the testcase from Sec-

tion 2.4 was adapted. Instead of a homogeneous grid, grids generated using a constant

expansion factor α are employed. Three cases are considered: α = 1, α = 1.5, and α = 2,

resulting in the meshes depicted in Figure 4.6. While an expansion factor of α = 1 yields a

homogeneous grid, α = 1.5 stretches every cell in the grid and results in a maximum aspect

ratio of ARmax = 17. Applying α = 2 further magnifies this effect and leads to ARmax = 128.

The grids are populated by a large variety of element, with their shapes ranging from cubes,

over pancakes to needles. With non-matrix-free solution techniques, these meshes are not

treatable at high polynomial degrees due to stifling memory requirements. The solvers

presented in this section, however, are matrix-free, leading to comparably small memory

requirements and allowing to easily compute with high polynomial degrees such as 32.

Figure 4.7 compares the runtimes and number of iterations for the three grids. Independent

of the expansion factor, the runtime of bfCG scales with O(p5), and of the condensed solvers

with O(p4). However, the number of iterations increases with the expansion factor. When

raising it from α = 1 to α = 1.5, the number of iterations of cCG increases by a factor of

four, whereas the preconditioned solvers only incur an increase of one quarter. The situation

gets more pronounced when increasing the maximum aspect ratio to 128 at α = 2. There,

using cCG is not feasible anymore, while it already requires 650 iterations at p = 2, 6000

are needed at p = 30. With diagonal preconditioning, these numbers lower to 85 and 360,

respectively, only showing a 50% increase total and the block-preconditioned variants exhibit

similarly high robustness. While not attaining the full robustness permitted by iterative

substructuring methods [107], the solvers are near impervious to the aspect ratio of the

mesh.



2 4 8 16 32

Polynomial degree p

102

103

Number

ofiterationsn10

1

1

4

1

α = 1.0

2 4 8 16 32

Polynomial degree p

10−2

10−1

100

101

102

103

Runtime[s]

5

1

1

3

α = 1.0

2 4 8 16 32

Polynomial degree p

102

103

Number

ofiterationsn10

1

1

4

1

α = 1.5

2 4 8 16 32

Polynomial degree p

10−2

10−1

100

101

102

103

Runtime[s]

5

1

1

3

α = 1.5

2 4 8 16 32

Polynomial degree p

102

103

Number

ofiterationsn10

1

1

4

1

α = 2.0

2 4 8 16 32

Polynomial degree p

10−2

10−1

100

101

102

103

Runtime[s] 5

1

1

3

α = 2.0

Figure 4.7: Number of iterations and runtime per degree of freedom for the four solvers whenusing stretched meshes with constant expansion factor α. Top: α = 1, middle: α = 1.5,bottom: α = 2.


4.5 Summary

This chapter investigated the static condensation method as ways to lower both, operation

count as well as memory bandwidth for elliptic solvers for the spectral-element method.

To lower the operation count, a tensor-product formulation of the static condensed Helm-

holtz operator was derived and factorized to linear complexity, with a further coordinate

transformation streamlining the multiplication count. Not only are the resulting operators

matrix-free and allow to circumvent the growing memory bandwidth gap, their runtime

scales linearly with the number of degrees of freedom in the grid as well. The new evaluation

technique outpaces variants employing highly optimized libraries for matrix-matrix multipli-

cations as well as the optimized tensor-product variants for the full system from Chapter 3

by a factor of 20 and 5, respectively. Moreover, the linear scaling unlocks computations

on high polynomial degrees which were previously unfeasible due to the staggering operator

costs for linear solvers, removing the barrier between spectral and spectral-element methods.

After comparing the efficiency of the operators, solvers based on the two fastest evalua-

tion techniques were investigated, with the solvers from Chapter 3 serving as baseline. As

in Chapter 3, block-Jacobi type preconditioners significantly lower the iteration count com-

pared to pure diagonal preconditioning. Moreover, they lead to an astounding robustness

with the condition number being near independent of the aspect ratio of the elements. For

instance raising the maximum aspect ratio from ARmax = 1 to ARmax = 128 only leads to

an increase by 50 % in the iteration count. While the block-Jacobi preconditioning is as

expensive as the condensed Helmholtz operator itself, a coordinate transformation stream-

lines both and results in a diagonalization of the preconditioner. The combination of linear

scaling operators and low increase in the condition number leads to a linearly scaling runtime

for the solver with respect to the number of degrees of freedom. This allows to outperform

the solvers presented in Chapter 3 by a factor of 50 at p = 32, with a runtime per degree of

freedom just over 1 µs. Lastly, the effect is not only due to a decreased number of iterations,

the runtime of one iteration is a third of the one for the full case, when comparing a con-

densed p = 29 with p = 7 in the full system. This extends the performance of the high-order

methods towards very high polynomial degrees, and therefore convergence rates.

While the resulting solvers scale close to optimally with respect to the polynomial degree, as

investigated in [61], the performance degrades with the number of elements. Global informa-

tion exchange is required to gain robustness with respect to the number of elements, which

the block-Jacobi preconditioners considered here do not provide. This will be addressed in

the next chapter.

69

Chapter 5

Scaling to the stars – Linearly scaling

spectral-element multigrid

5.1 Introduction

In the previous chapter, a linearly scaling operator for the Helmholtz equation was derived.

It allows for linearly scaling iteration times and in conjunction with a pCG method attains

solution times per unknown near 1 µs. But while the solver incurs only minute increases in

the number of iterations while raising the polynomial degree, it is not robust with regard

to the number of elements. Long-range coupling is required to attain a constant number of

iterations and to allow the solvers to stay competitive.

For low-order schemes, h-multigrid has been established as an efficient solution technique,

increasing and decreasing the element width h between the levels [13, 12, 47]. With high-

order methods, p-multigrid allows to lower and raise the polynomial degree instead of the

element width h. Both kinds of multigrid require a smoother to eliminate high-frequency

components which can not be represented on the coarser grids and overlapping Schwarz

smoothers have proven to be a very effective choice [88, 52, 122], lowering the iteration count

below 10.

With an overlapping Schwarz method, the domain is decomposed into small overlapping

subdomains, typically blocks of multiple elements. On these, the inverse operator is ap-

plied and the results combined to a new solution. However, while the approach generates

exceptional convergence rates, application of the smoother is expensive. In the full equation

system, the inverse is often facilitated using fast diagonalization [88], whereas no matrix-

free inverse is known for the condensed case and using a precomputed inverse leads to a

linearly scaling in two dimensions, but not in three [52, 51]. In the full as in the condensed

case, the operator scales super-linearly when increasing the polynomial degree. The increas-

ing smoother costs inherent to current Schwarz methods limit the efficiency of high-order

70 5 Scaling to the stars – Linearly scaling spectral-element multigrid

methods at high polynomial orders and therefore convergence rates. Again, linearization of

the operation count would allow for significant performance gains.

The goal of this chapter lies in attaining a linearly scaling multigrid solver, combining the

convergence properties from [123, 51] with the operator derived in the last chapter. To-

wards this end, Section 5.2 recapitulates the main ideas of Schwarz methods, extracts

the main kernel and then factorizes it to linear complexity. Then, Section 5.3 discusses

multigrid solvers founding on the static condensation operators proposed in Chapter 4 and

the Schwarz smoother. Lastly, the runtime tests conducted in Section 5.4 prove that the

convergence properties from [123, 51] are retained while the runtime scales linearly with the

number of degrees of freedom.

The results presented in this chapter are available in [63] and are inspired by the work

presented in [52, 51].

5.2 Linearly scaling additive Schwarz methods

5.2.1 Additive Schwarz methods

Overlapping Schwarz decompositions are a standard solution technique in continuous as

well as discontinuous Galerkin spectral-elements methods [31, 88, 27, 117]. Instead of

solving the whole equation system, a Schwarzmethod determines a correction of the current

approximation by combining local results obtained from overlapping subdomains. Repeating

the process leads to convergence.

If the current approximation u is not the exact solution, it leaves a residual r:

r = F− Hu . (5.1)

The goal lies in lowering the residual below a certain tolerance. Towards this end, cor-

rections ∆u to u are sought with H∆u ≈ r. Gaining the exact solution in one step re-

quires Hu = r and, hence, a full solve, which is global and expensive. On small subdomains,

however, solution is relatively cheap. Such local corrections ∆ui are computed on multiple

overlapping subdomains Ωi and afterwards combined to a global solution.

To attain the operator on subdomain Ωi, the Helmholtz operator H is restricted to the

subdomain using the Boolean restriction Ri

Hi = RiHRT

i (5.2)

5.2 Linearly scaling additive Schwarz methods 71

Figure 5.1: Block utilized for the star smoother. Left: Subdomain consisting of 23 elements, mid-dle: full system including Dirichlet nodes, right: Collocation nodes correspondingto the star smoother setup.

and a solution to the residual computed:

Hi∆ui = Rir . (5.3)

These corrections are local to the subdomain and need to be combined into a global cor-

rection. With an additive Schwarz method, all corrections are simultaneously computed,

weighted and then added, resulting in a global correction ∆u satisfying

∆u =∑i

RT

i Wi H−1

i Rir⏞ ⏟⏟ ⏞∆ui

. (5.4)

In the above, Wi is the weight matrix on the respective subdomain. Multiple options exist

for the weighting: Traditional additive Schwarz methods employ the identity matrix as

weight and a relaxation factor is required to ensure convergence [37]. For spectral-element

methods using the inverse multiplicity instead, i.e. weighting each grid point with the num-

ber of subdomains it occurs in, removes this restriction [88], and further refinement with

distance-based weight functions yields convergence rates where only two or three iterations

are required with multigrid [123].

For Schwarz methods, the choice of subdomain dictates both time to solution on the sub-

domain as well as the convergence rate of the whole algorithm. While smaller subdomains

are preferrable for the first, larger subdomains and, correspondingly, large overlaps are of

paramount importance for the latter [123]. Typically, element-centered subdomains are uti-

lized, which overlap into every neighboring element. However, with static condensation

the number of remaining degrees of freedom is large. In [52, 51] a vertex-based Schwarz

smoother was constructed with a 2d element block as subdomain in Rd. Using static con-

densation on this block leaves only the three planes interconnecting the elements remaining.

Compared to an element-centered subdomain with half an element overlap, less degrees of

freedom are present, rendering the method preferable and it is, hence, used here. As it

resembles a star in the two-dimensional case, it is called star smoother.


In Figure 5.1 the 23 element block is depicted. With a residual-based formulation, theDirich-

let problem is homogeneous, allowing to eschew the boundary nodes, thus resulting in nS =

2p − 1 points per dimension. For the condensed system only the three planes connecting

the elements remain, each with with n2S degrees of freedom. The resulting star operator,

Hi, can be expressed with tensor-products when using the technique from Chapter 4, lead-

ing to the operation count scaling with O(n3S). However, the operator consists of primary

and condensed part and, furthermore, is assembled over multiple elements. This intricate

operator structure renders solution hard and a matrix-free, linearly scaling inverse is yet un-

known. While using explicit matrix inversion scales linearly with the number of unknowns

in two dimensions [52], the three-dimensional version scales with O(n4S) [51]. This section

will develop a linearly scaling inverse for the three-dimensional case.

5.2.2 Embedding the condensed system into the full system

Execution of the Schwarz method in the condensed system requires solution on subdo-

mains Ωi as key component. However, even with tensor-product operators, finding a matrix-

free inverse has so far eluded the community, leaving only matrix inversion on the table.

The application scales with O(n4S) = O(p4) and is, by definition, not matrix free. This not

only increases the operation count super-linearly when going to high polynomial degrees, but

also results in overwhelming memory requirements. While the inverse for one star at p = 16

requires 63 MB, over 1 GB is utilized at p = 32, rendering the method unfeasible for inho-

mogeneous meshes. Both drawbacks, runtime and memory requirement, constitute a hard

limit on the polynomial degree used for the method and need to be circumvented. This

section lifts both restrictions by first deriving a matrix-free inverse and then linearizing the

operation count. The linear operation count directly extends to a linearly scaling smoother

which, in turn, allows for a linearly scaling multigrid cycle. To this end, the condensed

system is embedded into the full equation system. Then, a solution technique from the full

case is exploited to attain a matrix-free inverse, which is afterwards factorized.

To investigate embedding the star subdomains into their respective full 2 × 2× 2 systems,

considering one block suffices, allowing for the subscript i to be dropped in favor of read-

ability. The 2d element block is reordered into degrees of freedom remaining on the star,

uS, and element interior ones, uI, leading to an operator structure similar to the condensed

case (4.6) (HSS HSI

HIS HII

)(uS

uI

)=

(FS

FI

)(5.5)

⇒(HSS −HSIH

−1II HIS

)⏞ ⏟⏟ ⏞H

uS = FS −HSIH−1II FI . (5.6)


For the condensed system to be embedded in the full system, the right-hand side of the

condensed system needs to generate the same solution when used in the full system. Here,

the modified right-hand side

F =

(FS

FI

)=

(FS −HSIH

−1II FI

0

)(5.7)

is considered. It leads to a solution u with(HSS HSI

HIS HII

)(uS

uI

)=

(FS

0

)(5.8)

⇒(HSS −HSIH

−1II HIS

)⏞ ⏟⏟ ⏞H

uS = FS = FS −HSIH−1II FI . (5.9)

Due to H being positive definite, the system possesses a unique solution and uS = uS. How-

ever, F and F differ and, hence, the overall solution differs as well, generating a differ-

ence in the interior points, i.e. uI = uI. As the solution on the interior is not required for

the Schwarz method in the condensed system, the inverse full operator on F generates the

desired solution. This allows to apply solution methods from the full system to infer solution

methods into the condensed system.

5.2.3 Tailoring fast diagonalization for static condensation

To solve the condensed system, a method for solving in the full system can be used. Hence,

investigating the operator on the full star block is required. The collocation nodes on the

full star Ωi exhibit a tensor-product structure, allowing for the operator Hi to be written in

a similar fashion as (2.44):

Hi = d0Mi ⊗Mi ⊗Mi

+ d1Mi ⊗Mi ⊗ Li

+ d2Mi ⊗ Li ⊗Mi

+ d3Li ⊗Mi ⊗Mi .

(5.10)

Here, the matrices Mi and Li are the one-dimensional mass and stiffness matrices restricted

to the full star Ωi, which correspond to the assembly of the respective one-dimensional

matrices from the elements, as shown in Figure 5.1. In practice these will differ per direction

to allow for varying element widths inside one block. However, to improve readability, the

same stiffness and mass matrices are utilized in all three dimensions.


As Mi and Li are symmetric positive definite, the fast diagonalization technique from Sub-

section 2.3.4 is applicable. The inverse can be expressed with tensor products

H−1i = (Si ⊗ Si ⊗ Si)D

−1i

(STi ⊗ ST

i ⊗ STi

), (5.11)

where Si is a non-orthogonal transformation matrix and Di is the diagonal matrix com-

prising the eigenvalues of the block operator. The tensor-product application of the above

requires 12 · n4S operations, which still scales super-linearly with the number of degrees of

freedom when increasing p. However, using the reduced right-hand side from (5.7) is sufficient

to attain solution on the condensed star. The application of (5.11) consists of three steps:

Mapping into the three-dimensional eigenspace, applying the inverse eigenvalues, and map-

ping back. The right-hand side Fi is zero in element-interior regions, allowing factorization

of the operator and, furthermore, the values in the interior are not sought, further increasing

the potential. The forward operation, computing FE =(STi ⊗ ST

i ⊗ STi

)Fi, now only works

on three planes rather than the whole three-dimensional data. Hence, only two-dimensional

tensor products remain when expanding to the star eigenspace last, which require O(n3S) op-

erations. This leads to the forward operation scaling linearly. The application of the inverse

eigenvalues consists of one multiplication with a diagonal matrix and scales linearly as well.

Lastly, mapping back is only required for the faces of the star, not for interior degrees. When

not computing these interior degrees, the mapping from block eigenspace to the faces is the

transpose of the forward operation and uses O(n3S) operations as well. The combination of

all three operations yields an inverse that can be applied with linear complexity.

The algorithm to apply the linearly-scaling inverse is summarized in Algorithm 5.1 where for

clarity of presentation the star index i was dropped. The first step consists of extracting the

right-hand for the subdomains and store it on the three faces, which are perpendicular to

the x1, x2, and x3 directions and named F1, F2, and F3, respectively. To retain the tensor-

product structure, the three faces are stored seperately as matrix of extent nS × nS, with an

index range of I = −p . . . p where index 0 corresponds to the 1D index of the face in the

full system. However, this leads to the edges and the center vertex being stored multiple

times. Hence, as first action, the inverse multiplicity M−1 is applied to account for these

multiply stored values. Then, the two-dimensional transformation matrix ST ⊗ ST is used

on the three faces. Using permutations of I ⊗ I ⊗ STI0, the results are mapped into the star

eigenspace, where the inverse eigenvalues are applied. Lastly, the reverse order of operations

maps back from the eigenspace back to the star faces. In Algorithm 5.1, 18 one-dimensional

matrix products are utilized, 9 to map into the eigenspace, 9 to map back from it, and one

further multiplication in the eigenspace. In total, 37n3S operations are utilized.

The additive Schwarz-type smoother resulting from using the above algorithm is shown

in Algorithm 5.2. First, the data is gathered for every subdomain. On these the inverse is

applied via Algorithm 5.1, leading to local corrections ∆ui, which are weighted and prolonged


Algorithm 5.1: Inverse of the star operator using the right-hand side u on the three faces of thestar, F1, F2, and F3. For clarity of presentation, the star index i was dropped.

FF ←M−1FF ▷ Account for multiply occurring pointsFE ←

(ST ⊗ ST ⊗ ST

I0

)FF1 ▷ contribution from face perpendicular to x1

+(ST ⊗ ST

I0 ⊗ ST)FF2 ▷ contribution from face perpendicular to x2

+(STI0 ⊗ ST ⊗ ST

)FF3 ▷ contribution from face perpendicular to x3

uE ← D−1FE ▷ application of inverse in the eigenspaceuF1 ← (S⊗ S⊗ S0I) uE ▷ map solution from eigenspace to x1 faceuF2 ← (S⊗ S0I ⊗ S) uE ▷ map solution from eigenspace to x2 faceuF3 ← (S0I ⊗ S⊗ S) uE ▷ map solution from eigenspace to x3 face

Algorithm 5.2: Schwarz-type smoother using stars subdomains corresponding to the nV ver-tices.

function Smoother(r)for i = 1, nV do

Fi ← M−1

i Rir ▷ extraction of data

∆ui ← H−1

i Fi ▷ inverse on starsend for∆u←

∑nV

i=1 RT

i Wi∆ui ▷ contributions from 8 vertices per elementreturn ∆u

end function

to the global domain. The weight matrix Wi is inferred by restricting the tensor-product

of one-dimensional diagonal weight matrices W = W1D ⊗W1D ⊗W1D to the condensed

system. As in [123], diagonal matrices populated by smooth polynomials of degree pW

are used. These are constructed to be one on the vertex of the star and zero on all other

vertices. For polynomial degrees larger than one, higher derivatives are set to zero at the

vertices smoothing the transition, as shown in Figure 5.2. In preliminary studies pW = 7

produced the best results, hence, this value is utilized here as well.

In Algorithm 5.2 the inverse is applied in the condensed system and then the results are

weighted. In nodal space, i.e. when using the condensed system, the weight matrix is di-

agonal and cheap to apply. In modal space, however, the weight matrix is dense, limiting

the applicability of the transformed operator and, hence, a faster residual evaluation. To

circumvent the increased operator cost, the tensor-product structure of transformation and

weight matrix can be exploited. As shown in Subsection 4.3.3, the transformation works

separately on faces, edges, and vertices, and therefore also separately on the planes of the

stars. This allows the transformation to be interchanged with the application of the multi-

plicity, and be merged with the forward application of mapping into the eigenspace. Vice

versa, the backward operator is merged with applying the weights in nodal space and the

tensor product of mapping to transformed space. As a result, the operator in Algorithm 5.1


−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00x

0.00

0.25

0.50

0.75

1.00

Weightfunctionw

Figure 5.2: One-dimensional subdomain corresponding to one center vertex at x = 0 rangingfrom x = −1 to x = 2. The weight functions corresponding to the left (dark grey),center (black), and right vertex are shown (light grey). As they are one on theirrespective vertex and zero on the other vertices, the weight functions are a partitionof one.

Figure 5.3: Implementation of boundary conditions for the one-dimensional case on the right do-main boundary. Utilized data points are drawn in black, non-utilized ones in white.Left: Used smoother block consisting of two elements of polynomial degree p = 4.Middle: A Neumann boundary condition decouples the right element. Right: Ho-mogeneous Dirichlet boundary condition decouples the middle vertex as well.

can be utilized for the transformed system when using separate sets of matrices for forward

and backward operation. Furthermore, these matrices can include the weighting, lowering

the amount of loads and stores and leading to performance gains in the smoother.

5.2.4 Implementation of boundary conditions

The additive Schwarz smoother outlined above is vertex-based and uses the 23 element

block surrounding a vertex as subdomain. This approach, however, does not work at the

boundaries. There, the subdomain corresponding to a vertex reduces in size, in the worst

case to one element, and the boundary condition implementation needs to account for this.

When implementing every possible combination of boundary conditions, 53 = 125 variants

are required. Here, an approach is presented which utilizes the same operator for interior as

well as boundary vertices, hence, lowering the required implementations effort.

When parallelizing overlapping Schwarz methods with domain decomposition, data from

partitions on other processes is required. A typical implementation is adding a layer of ghost

elements to the partition, generating an overlap with the surrounding partitions. With ghost

elements, changing the matrices for the stars instead of the structure of the subdomains is

possible. As structured meshes are assumed, the subdomains are always a tensor-product of

one-dimensional domains, rendering investigation of the one-dimensional case sufficient.


Figure 5.3 depicts a boundary vertex in combination with the boundary element and the

ghost element. For the periodic case the ghost element is part of the domain and no change

in the operator is required. But when applying Neumann boundary conditions, the ghost

element is not part of the domain anymore, and the subdomain for the vertex reduces to

one element. To regain the operator size, the corresponding transformation matrices and

eigenvalues are padded with identity so that the same operator size as for the two-element

subdomain results. Furthermore, the right-hand side is set to zero outside of the domain, such

that the correction computes to zero. In addition to the treatment of the Neumann case, the

boundary vertex is decoupled forDirichlet boundaries, the corresponding matrices padded,

and the right-hand side zeroed out. This leads to a correction of zero on the Dirichlet

point and, hence, retaining the initial value for the inhomogeneous boundary conditions.

The treatment expands from one dimension to multiple via the tensor-product structure of

the operator. Due to the transformation matrices working in their respective directions,

applying the treatment for the one-dimensional case in each direction is sufficient. Hence,

either no change is required as no boundary condition is present, or only the matrices of

one dimension, two dimensions, or three dimensions are changed according to the method

outlined above. As the transformation matrices are padded with identity, decoupled parts

only map onto themselves, computing a correction of zero.

5.2.5 Extension to element-centered block smoothers

In the above, a vertex-based smoothers was considered for the lower number of degrees

of freedom remaining on the star in the condensed case. However, many algorithms in the

literature utilize element-centered smoothers [88, 123]. This section derives a linearly scaling,

matrix-free inverse for element-centered subdomains.

Figure 5.4 shows an element-centered subdomain overlapping into the neighboring elements

in every direction. Here, an overlap of one whole element per side is assumed. As with the

star block, the collocation points exhibit a tensor-product structure and the operator can be

written in tensor-product form (2.44), with the matrices being replaced by those restricted

to the subdomain. The condensed block system can be embedded into the full block with

the same arguments used for the star block in Subsection 5.2.3. Hence, as in the former case,

fast diagonalization on the full 33 element block can be restricted to gain a solution method

in the condensed system. As the right-hand side is still zero in the element interiors and six

faces are present, the number of operations is 73n3S, where nS = 3p− 1. Compared to fast

diagonalization in the full system, with 12n4S operations, the new algorithm is more efficient

starting from p = 4.


Figure 5.4: Block utilized for the element-centered smoothers. Left: full system including Dirich-let nodes. Right: Collocation nodes corresponding to the condensed setup.

5.3 Multigrid solver

5.3.1 Multigrid algorithm

For finite volume and finite difference methods, h-multigrid is one of the most prominent

solution techniques. The element width h is changed on each level to exploit faster con-

vergence on these [14, 12, 47]. With spectral-element methods, changing the polynomial

degree instead is an option, leading to so-called p-multigrid, a standard building block for

higher-order methods [114, 88, 104]. Levels L to 0 are introduced, with their polynomial

degrees being defined as

∀ 0 ≤ l < L : pl = p0 · 2l (5.12a)

pL = p , (5.12b)

where p0 is the polynomial degree utilized on the coarse mesh.

A multigrid method requires four main components. The first one is the operator for residual

evaluation. Then, a smoother smoothing out high-frequency residual components, which do

not converge as fast on coarser levels. A grid transfer operator restricts the residual to the

coarser grid and prolongs back, and, lastly, a solver to solve on the coarsest grid. Here, the

condensed system is utilized, with the operator derived in Chapter 4 as residual evaluation

technique. The grid transfer operator from level l− 1 to level l, J l, is implemented with the

embedded interpolation, restricted from the tensor-product of one-dimensional operators to

the condensed system, and the restriction from level l to l − 1, JT

l−1, is the transpose of it.

5.3 Multigrid solver 79

Algorithm 5.3: Multigrid V-cycle for the condensed system using νpre pre- and νpost post-smoothing step.

function MultigridCycle(u, F)uL ← uFL ← Ffor l = L, 1,−1 do

if l = L thenul ← 0

end iffor i = 1, νpre,l do

ul ← ul + Smoother(Fl − Hlul) ▷ Presmoothingend forFl−1 ← J

T

l−1

(Fl − Hlul

)▷ Restriction of residual

end forSolve(H0u0 = F0) ▷ Coarse grid solvefor l = 1, L do

ul ← ul + J lul−1 ▷ Prolongation of correctionfor i = 1, νpost,l do

ul ← ul + Smoother(Fl − Hlul) ▷ Postsmoothingend for

end forreturn u← uL

end function

Lastly, a conjugate gradient method for the condensed system at p0 = 2 serves as coarse grid

solver.

The components suffice to construct a V-cycle, as shown in Algorithm 5.3, with a varying

number of pre- and post-smoothing steps νpre,l and νpost,l on level l. For these, two vari-

ants are considered. The first one utilizes one step of pre- and one of post-smoothing,

leading to a traditional V-cycle, whereas the second one doubles both with each level,

i.e. νpre,l = νpost,l = 2L−l. The increasing number of smoothing steps can stabilize the method

for stretched and non-uniform meshes [124].

A simple multigrid method constructed from the V-cycle is shown in Algorithm 5.4. To

switch from the full equation system to the condensed, the initial guess is restricted and the

right-hand side condensed. Then, V-cycles are performed until convergence is reached. After

convergence, the interior degrees of freedom are computed. As the performance of multigrid

can deteriorate with anisotropic meshes, a second algorithm implements Krylov accelera-

tion [100, 136], which uses the V-cycle as a preconditioner instead of an iterator. Tradition-

ally preconditioned conjugate gradient (pCG) methods are utilized to this end. However, the

weighting in the smoother does not lead to a symmetric preconditioner, and standard pCG

is not guaranteed to converge [54]. The inexact preconditioner CG method (ipCG) extends


Algorithm 5.4: Multigrid algorithm for the condensed system.

function MultigridSolver(u = (uB,uI)T , F = (FB,FI)

T )u← uB ▷ restrict to element boundariesF← FB −HBIH

−1II FI ▷ restrict and condense inner element RHS

while√rT r > ε do

u←MultigridCycle(u, F) ▷ fixpoint iteration with multigrid cycleend whileu← (u,H−1

II (FI −HIBu))T

▷ regain interior degrees of freedomreturn u

end function

the pCG method towards non-symmetric preconditioners [41] and was previously already

employed for multigrid [122]. Algorithm 5.5 shows the Krylov-accelerated multigrid for

the condensed system.

5.3.2 Complexity of the resulting algorithms

The solvers presented in the last section have three phases. In the first one, the initial

guess and right-hand side are condensed, using Wpre operations. Then, the solution process

takes place with Wsol operations. The third phase computes the interior degrees of freedom

with Wpost operations. Pre- and post-processing utilize three-dimensional tensor products to

map to the eigenspace of the element interior and back, leading to Wpre +Wpost = O(p4ne).

The solution process, on the other hand, consists of applying the multigrid cycle. On each

level, smoother and operator are the main contributors to the operation count, with both

scaling with O(p3l ). On the finest level this results in O(p3). Each coarsening step lowers

the polynomial degree by a factor 1/2 and with a constant number of smoothing steps, the

costs scale with (1/2)3 = 1/8. A geometric series results, limiting the operation count of

the cycle to the operation count on the fine level times a factor of 8/7. The second variant

uses νpre,l = νpost,l = 2L−l. Here, the effort per level is lowered by 2 · (1/2)3 = 1/4, resulting

in a factor of 4/3. In both cases, the amount of work for the branches of the V-cycle is a

constant factor times the work on the fine grid. The last portion of the solution time stems

from the coarse grid solver. The cost scales with O(p30nαe ) = O(nα

e ), where α depends on the

solution method. In the present case with a CG solver, α is 4/3, but α = 1 is possible with

appropriate low-order multigrid solvers [14]. Hence, in practice the cost per cycle is O(p3ne),

i.e. linear in the number of degrees of freedom, and the required work solving the Helmholtz

equation computes to

Wtotal = Wpre +NcycleWcycle +Wpost (5.13)

⇒ Wtotal = O(p4ne

)+NcycleO

(p3ne

). (5.14)

5.4 Runtime tests 81

Algorithm 5.5: Krylov-accelerated multigrid algorithm for the condensed system.

function ipCGMultigridSolver(u = (uB,uI)T , F = (FB,FI)

T )u← uB ▷ restrict to element boundariesF← FB −HBIH

−1II FI ▷ restrict and condense inner element RHS

r← F− Hu ▷ initial residuals← r ▷ ensures β = 0 on first iterationp← 0 ▷ initializationδ = 1 ▷ initializationwhile

√rT r > ε do

z←MultigridCycle(0, r) ▷ preconditionerγ ← zT rγ0 ← zT sβ = (γ − γ0)/δδ = γp← βp+ z ▷ update search vectorq← Hp ▷ compute effect of pα = γ/(qT p) ▷ compute step widths← r ▷ save old residualu← u+ αp ▷ update solutionr← r− αq ▷ update residual

end whileu← (u,H−1

II (FI −HIBu))T

▷ Regain interior degrees of freedomreturn u

end function:

The main contribution to the work stems from smoothing on the fine grid, not from pre-

or post-processing. When assuming just one V-cycle, the respective work on the fine grid

is 2 · 37n3S ≈ 2 · 37(2p3) = 592p3, whereas pre- and post-processing require 12p4 + 25p3 each.

Hence, for only one V-cycle, the cost of the smoother dominates the runtime until p > 48.

Thus, for all practical purposes, the multigrid algorithm scales linearly with the degrees of

freedom, when increasing the number of elements as well as when increasing the polynomial

degree.

5.4 Runtime tests

5.4.1 Runtimes for the star inverse

To evaluate the efficiency of the star inverse, runtime tests were conducted. Here, three

implementations were tested. First, a fast diagonalization variant working on the full 23

element block. As it was implemented with tensor products and works in the full system,

it is called “TPF”. Then, a matrix-matrix multiplication in the condensed system, applying


a precomputed inverse for the star operator, called “MMC”. Lastly, the new variant imple-

menting Algorithm 5.1 was considered. Due to computing in the condensed system with

tensor products, it has the name “TPC”. All three variants were implemented in Fortran

using double precision. As the data size is nS = 2p− 1 and thus odd by definition, oper-

ators and data were padded by one to attain an even operator width allowing for better

code optimization. One call to DGEMM from BLAS implemented variant “MMC”, whereas

tensor-product variants were facilitated with loops. The outermost loop corresponded to a

subdomain, leading to inherent cache blocking, whereas the innermost non-reduction loop

was treated with the Intel-specific single-instruction multiple-data (SIMD) compiler direc-

tive !dir$ simd, enforcing vectorization of the loops. Furthermore, the variant “TPC” was

refined with the techniques from Chapter 3, using parametrization, unroll and jam, and

blocking for matrix accesses.

The tests were performed on one node of the high-performance computer Taurus at ZIH

Dresden, which consisted of of two Xeon E5-2590 v3 processors, with twelve cores each, run-

ning at 2.5 GHz. For testing purposes, only one core was utilized, allowing for 40 GFLOP/s

as maximum floating point performance [48]. Furthermore, the code was compiled by the

Intel Fortran Compiler v. 2018, with the corresponding Intel Math Kernel Library (MKL)

serving as BLAS implementation for “MMC”.

As test case the inversion on 500 star subdomains was considered, corresponding to 500

vertices being present on the processor. Each subdomain was set to Ωi = (0, π)× (0, π/4)×(0, π/3), where a solution of

uex(x) = sin(µ1x1) sin(µ2x2) sin(µ3x3) , (5.15)

with parameters µ1 = 3/2, µ2 = 10, and µ3 = 9/2 is employed. From the exact solution

on the collocation points, the Helmholtz residual with λ = π served as right-hand side.

The three operators were applied, the runtime measured and the maximum error compared

to (5.15) computed on the collocation nodes. The polynomial degree was varied from p = 3

to p = 32 and each application was repeated 101 times, with only the last 100 times being

measured with MPI WTime, excluding instantiation effects from the measurement.

Figure 5.5 shows the runtime, runtime per degree of freedom, rate of floating point operations,

and maximum error for the three operators. The errors are below 10−12 and there is little

difference between the variants, but a slow increase from 10−15 to 10−13 occurs. This is an

artifact of the test: First, the input to the operators is calculated by using the Helmholtz

operator, then, the inverse is applied. Applying the Euclidean norm to this operation leads

to the definition of the condition number. Hence, the increased error stems from an increase

in the condition of the system and the variants achieving the same error validates that they

are correctly implemented.


MMC TPF TPC

2 4 8 16 32

polynomial degree p

10−5

10−4

10−3

10−2

10−1

100

101

102

Runtime[s] 4

1

1

3

2 4 8 16 32

polynomial degree p

101

102

103

104

Runtimeper

DOF[ns]

11

2 4 8 16 32

polynomial degree p

0

10

20

30

40

Perform

ance

[GFLOP/s]

2 4 8 16 32

polynomial degree p

10−15

10−14

10−13

10−12

Max

.error

Figure 5.5: Results for the application of the inverse star operator. Top left: Operator runtimeswhen using the fast diagonalization in the full system (TPF), applying the inversevia a matrix-matrix product in the condensed system (MMC), and using the inversevia tensor-product factorization (TPC), top right: Runtimes per equivalent numberof degrees of freedom (DOF) being computed as p3 per block, bottom left: rate offloating point operations, measured in Giga FLoating point OPerations per second,bottom right: maximum error of solution from the variants compared to (5.15).

The runtimes of “TPF” as well as “MMC” exhibit slope four, whereas “TPC” achieves a

slope of three, i.e. linear scaling with the number of degrees of freedom. This translates

to a constant runtime per degree of freedom for “TPC”, and increasing ones for “TPF”

and “MMC”. For every tested polynomial degree, the matrix-matrix based variant is faster

than the tensor-product variant in the full system. Furthermore, it is the fastest implemen-

tation for p < 5. However, starting from p = 6, “TPC” becomes faster, reaching a factor of

three at p = 8 and a factor of 20 at p = 32. As to be expected, the matrix-matrix based im-

plementation nearly attains peak performance with 35GFLOP s−1, whereas the loop-based

variants range between 5 and 20GFLOP s−1. However, where “MMC” has a constant perfor-

mance, the performance of the loop-based variants heavily depends upon the operator size.

For even p the operator widths are a multiple of four and, hence, a multiple of the SIMD


width. There, and double the performance for odd p is achieved, which is an artifact of

compiler optimization, where the treatment of remainder and smaller SIMD operations are

detrimental to the performance

5.4.2 Solver runtimes for homogeneous meshes

To verify both, that the new multigrid solvers scale linearly with the number of degrees of

freedom as well as that they can be more efficient than the previously developed solvers, the

tests from Section 2.4 were repeated using the multigrid solvers. Again the domain is set

to Ω = (0, 2π)3 using inhomogeneous Dirichlet boundary conditions and a Helmholtz

parameter λ = 0, leading to the harder to solve Poisson’s equation.

Four solvers are considered for testing. The baseline solver is dtCG presented in Section 4.4,

which is a conjugate gradient (CG) solver using diagonal preconditioning in the transformed

system. Then, a multigrid solver implementing Algorithm 5.4, called tMG, using one pre-

and one post-smoothing step, a Krylov-accelerated version thereof based on Algorithm 5.5

called ktMG, and, lastly, a Krylov-accelerated version with varying number of smoothing

steps called ktvMG. All of these multigrid solvers utilize the residual evaluation technique

in the transformed system from Section 4.3.

For ne = 83, the polynomial degree was varied between p = 2 and p = 32. To preclude

measurement of instantiation effects, the solvers were run 11 times and for the last 10 times

the number of iterations required to reduce the residual by a factor of 10−10, called n10,

as well as the runtime measured. From the number of iterations, initial residual ∥r0∥ and

reached residual ∥rn10∥, the convergence rate

ρ = n10

√∥rn10∥∥r0∥

(5.16)

is computed, which reflects the residual reduction achieved per iteration.

Figure 5.6 shows the results of the tests. For dtCG the number of iterations increases

with the polynomial degree, which is countered by more efficient operators, leading to a near

constant runtime per unknown for p > 8 near the 1 µs. For the multigrid solvers, however, the

number of iterations is mostly constant, starting at four iterations, and decreasing with the

polynomial degree. Everytime a new multigrid level is introduced, e.g. at p = 5 and p = 17,

the convergence rate decreases leading to faster convergence and, possibly, fewer required

iterations. The solvers tMG and ktMG attain convergence rates near 10−4 for polynomial

degrees lower than 16 and introducing varying smoothing significantly lowers the convergence

rate to 10−5, and later below 10−6. This leads to the two former solvers using three iterations

for p < 17 and ktvMG using only two, with ktMG also using only two iterations for p > 17.

The attained convergence rate matches the one for solvers with similar overlap, e.g. [123].


dtCG tMG ktMG ktvMG

2 4 8 16 32

Polynomial degree p

10−2

10−1

100

101

102

Runtime[s]

1

3

2 4 8 16 32

Polynomial degree p

100

101

102

Number

ofiterationsn10

2 4 8 16 32

Polynomial degree p

100

101

Runtimeper

DOF[µs]

2 4 8 16 32

Polynomial degree p

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Con

vergence

rate

ρ

Figure 5.6: Results for homogeneous meshes of ne = 83 elements when varying the polynomialdegree p. Top left: runtime of the solvers, top right: number of iterations required toreduce the residual by 10 orders of magnitude, bottom left: runtimes per degrees offreedom (DOF), bottom right: convergence rates of the solvers.

When comparing the runtime, all multigrid solvers are slower than the CG-based solver un-

til p = 10. This is a result of the rather large overhead inherent to the multigrid algorithm

as well as the low number of elements used in the test, favoring the CG solver. From p = 10

on, the multigrid solvers become more efficient and are faster than dtCG, with p = 17 in-

troducing a new level and, hence, creating further overhead and an exception. Every one

of the multigrid solvers is capable of breaching the 1 µs mark per unknown. For mid-range

polynomial degrees, the Krylov-accelerated solver with varying smoothing uses only two

iterations, and it uses 0.6 µs per unknown for p = 16. Afterwards, the Krylov-accelerated

version uses two iterations as well and incurs less overhead, allowing it to be slightly faster

and attaining 0.5 µs per unknown at p = 32

As verification that the solvers attain a constant number of iterations when varying the

number of elements, the tests were repeated at p = 16 with the number of elements per

direction k being varied from k = 4 to k = 28. Figure 5.7 shows the resulting convergence


dtCG tMG ktMG ktvMG

4 8 16 32


10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Conv ergence

rate

ρ

4 8 16 32


100

6× 10−1

2× 100

3× 1004× 100

Runtimeper

DOF[µs]

Figure 5.7: Convergence rates and runtimes per degree of freedom for the four solvers for homo-geneous meshes of ne = k3 elements of polynomial degree p = 16.

rate and runtime per degree of freedom. As expected, the solver dtCG has an increasing

runtime per degree of freedom with slope n1/3e , stemming from an increase in the number of

iterations. The multigrid solvers, on the other hand, exhibit a slight increase in the conver-

gence rate, i.e. they converge more slowly due to the boundary conditions having a decreased

impact on the domain. However, they are capable of reaching a constant convergence rate

after k = 8. While the convergence rate stagnates, the runtime per degree of freedom does

not. It decreases for the multigrid solvers. This is an artifact of the vertex-based smoother:

While the number of elements is k3, the number of vertices computes to (k + 1)3. Hence,

increasing k leads to a reduction in the ratio of vertices. The smoother becomes more effi-

cient when increasing the number of elements. The increase in runtime at k = 28, however,

is due to occupying the whole RAM of one socket, and using non-uniform memory access

afterwards with lower bandwidth and, hence, a larger overall runtime.

5.4.3 Solver runtimes for anisotropic meshes

In the above section, only homogeneous meshes were considered. In simulations, however, the

resolution often needs to be adapted to fully resolve the solution at reasonable computational

cost, e.g. to capture high gradients in the boundary layer near the wall. This leads to

anisotropic or even stretched meshes with considerably higher condition numbers and, in

turn, decreased performance of the solvers. To investigate the influence of high aspect

ratios, the testcase from [124] was combined with the test from Section 4.4. For a given

aspect ratio AR, the domain is set to

Ω = (0, 2π · AR)× (0, π ⌈AR/2⌉)× (0, 2π) , (5.17)


dtCG tMG ktMG ktvMG

1 2 4 8 16 32 48

Aspect ratio

100

101

102

Number

ofiterationsn10

1 2 4 8 16 32 48

Aspect ratio

100

101

Runtimeper

DOF[µs]

Figure 5.8: Runtimes for the four solvers for anisotropic meshes of ne = 83 elements of polynomialdegree p = 16.

allowing to use homogeneous meshes consisting of anisotropic brick-shaped elements. Here,

ne = 83 elements of polynomial degree p = 16 discretized the domain and the aspect ratio

was varied from AR = 1 to AR = 48.

In Figure 5.8 the number of iterations and runtimes per degree of freedom of the solvers

are depicted. The multigrid solvers have a large increase in the number of iterations, lead-

ing to a higher runtime in turn: For tMG the number of iterations increases from three to

sixty. The introduction Krylov acceleration mitigates the effect and stabilizes the num-

ber of iterations until AR = 4. Further varying the number of smoothing steps stabilizes

the iteration count until AR = 8, but the deterioration is still present. While the solvers

are very capable for homogeneous meshes, their efficiency rapidly deteriorates for high as-

pect ratios. None of the multigrid solvers is capable of attaining the high robustness of

the locally-preconditioned solver dtCG, which only takes twice as long for an aspect ratio

of AR = 48. With the Schwarz smoothers, only a higher spatial overlap guarantees effi-

cient solution [124], requiring an increase in the amount of overlap between elements, which

is not considered here.

5.4.4 Solver runtimes for stretched meshes

Anisotropic meshes already allow to lower the number of degrees of freedom in one or two

directions. However, even in simple geometries, such as a plane channel flow, local mesh

refinement can significantly lower the number of degrees of freedom. Typically stretched

meshes are utilized to refine near walls while keeping the element width in the center constant.

The testcase from Section 4.4.4 allows to study the robustness against varying aspect ratios

in one grid and is, hence, utilized here.


Table 5.1: Number of iterations required for reducing the residual by 10 orders of magnitude forthe stretched grids using a constant expansion factor α.

p

α Solver 4 8 16 32

1 dtCG 71 87 108 129

1 tMG 5 3 3 3

1 ktMG 4 3 3 2

1 ktvMG 4 3 2 2

1.5 dtCG 98 117 126 144

1.5 tMG 21 11 7 5

1.5 ktMG 11 8 6 4

1.5 ktvMG 11 8 5 3

2 dtCG 105 133 158 180

2 tMG 36 26 18 12

2 ktMG 15 13 10 8

2 ktvMG 15 13 10 8

The domain Ω = (0, 2π)3 is considered and discretized with three grids consisting of 83 ele-

ments. These are constructed using a constant expansion factors α ∈ 1, 1.5, 2 in all three

directions and are shown in Figure 4.6. Setting α = 1 results in a homogeneous mesh, as

utilized in the previous sections and serves as baseline to compare to. Using α = 1.5 stretches

the grid in all three directions, leading to small cells near one end of the domain and very large

cells on the opposite end. As a result the maximum aspect ratio is ARmax = 17. The third

grid is constructed with α = 2, further magnifying the effect and leading to ARmax = 128.

The latter two grids are populated by a large variety of elements, typical to large-scale simu-

lations in computational fluid dynamics, with their shapes ranging from boxfish, over plaice,

to eels. Most high-order solvers are not capable of computing on such meshes at all: As the

operators will differ in every element, matrix-free methods are required for preconditioner as

well as the operator. Furthermore, the performance of most solvers degrades when increasing

the aspect ratios in the grid.

The number of iterations required to lower the residual by ten orders of magnitude are

presented in Table 5.1. The most robust solver is the locally preconditioned dtCG, for which

the number of iterations increases only by 50 % when raising the expansion factor to 2.

The multigrid algorithms do not fare as well. For tMG, the number of iterations increases

seven-fold for p = 4 and four-fold for p = 32. Using Krylov-acceleration significantly lowers

the factor, with only half the number of iterations being utilized at p = 4 and a third less

at p = 32. The intended decrease in condition from introducing varying smoothing has a


limited effect at low polynomial degrees for α = 1 and α = 1.5, and none at α = 2. While

the convergence rate gets lowered, the effect is not large enough to decrease the number

of iterations. Again, increasing the number of smoothing steps is not beneficial anymore.

This indicates, as in the previous section, that the geometrical overlap, not the number of

smoothing steps, is the main ingredient for convergence.

5.4.5 Parallelization

In the last sections every test was performed on one core of one node. However, real-life

simulations run on hundreds, if not thousands of cores. The decomposition of domain and

efficient sharing of relevant information is of paramount importance in the parallel case, lest

the parallelization diminishes the performance.

The presented solver utilizes a vertex-based smoother, which requires data from the sur-

rounding elements. The amount of work and, hence, the runtime scales with the number

of vertices per partition. These, in turn, scale with the number of elements per direction k

as nv = (k + 1)3. Moreover, data from neighboring processes is required. In finite difference

and finite volume methods ghost cells facilitate the sharing at the partition boundaries: On

each process the domain is padded by one layer of grid cells, which contain the data of other

processes. As a result, small partitions of the domain are not beneficial for the parallel

efficiency, as the amount of work as well as the required amount of communication increases.

For finite difference or finite volume methods, the difference is negligible, as the number of

data points is large compared to the amount of added cells. With SEM, however, it is not.

At k = 4 and p = 16, still 64 degrees of freedom are present per direction, but there are

three times more vertices than elements in a subdomain and, thus, also a factor of three

more work.

To attain a good parallel efficiency, a simple domain decomposition approach does not suf-

fice. A hybrid parallelization is required to retain large partition sizes on multi-core sys-

tems [49, 70]. The coarse level implements domain decomposition, via MPI, whereas Open-

MP facilitates the second layer which exploits data parallelism.

To verify the efficiency of the OpenMP parallelization, the tests from Section 5.4.2 were

repeated on one node using two processes. The domain was Ω = (0, 2 · 2π)×(0, 2π)2 with the

number of elements varying from ne = 2 · 8× 8× 8 to ne = 2 · 16× 16× 16 at a polynomial

degree of p = 16. The above allows for isotropic elements when decomposing in the x1-

direction, ensuring a comparable convergence rate. In addition to the number of elements,

the number of threads and, hence, number of cores per process was varied from 1 to the

number of cores per processor, 12, and the parallel speedup over using only one core as well

as the parallel efficiency measured.


Table 5.2: Parallel speedup and efficiency for the four solvers defined in Section 5.4.2 when in-creasing the number of threads per process and varying the number of elements perdirection k.

Speedup Efficiency [%]

Number of threads Number of threads

k Solver 4 8 12 4 8 12

8 dtCG 3.51 5.73 6.13 88 72 51

8 tMG 3.48 5.80 6.48 87 73 54

8 ktMG 3.49 6.06 7.06 87 76 59

8 ktvMG 3.54 5.75 7.04 89 72 59

12 dtCG 3.57 5.69 6.35 89 71 53

12 tMG 3.62 6.31 7.95 91 79 66

12 ktMG 3.61 6.28 7.82 90 78 65

12 ktvMG 3.59 6.21 7.74 90 78 65

16 dtCG 3.63 5.02 5.42 91 63 45

16 tMG 3.66 6.24 7.68 91 78 64

16 ktMG 3.61 6.07 7.46 90 76 62

16 ktvMG 3.60 6.10 7.43 90 76 62

Table 5.2 lists the measured speedups and efficiencies. At k = 8, the parallel efficiency for

four threads is at 80 %, but quickly deteriorates to only 40 % at twelve cores for dtCG

and 50 % for the multigrid solvers. Slight differences are present: The Krylov-accelerated

versions have a higher efficiency. At k = 12, the efficiency is still higher, with 90 % when

using four threads and 65 % for twelve threads for the multigrid and 44% for dtCG. When

increasing the elements per direction in a subdomain to k = 16 no further gains in parallel

efficiency are present. This indicates that the low efficiency does neither stems from latency

nor from communication.

The efficiency of the on-node parallelization is acceptable, but not extraordinary. However,

this mainly stems from the architecture: For the utilized node configuration, the L3-cache

does not scale linearly with the number of utilized cores. It saturates at ten times the

bandwidth of a single core [48]. Moreover, the memory bandwidth for twelve cores is only

five times the one when using one core, limiting the performance of operators and scalar

products in both, conjugate gradients and multigrid methods.

The previous test measured the speedup on one node by increasing the number of cores.

The test showed that utilizing 12 elements per direction for each partition yields an efficient

setup. To investigate the scaling for more than one node, the scaleup of the code is measured,

using 123 elements of polynomial degree p = 16 per partition and scaling the domain size


Table 5.3: Parallel efficiencies for the three solvers defined in Section 5.4.2 when increasing thenumber of processes with a constant number of elements per direction k = 12 on eachprocess when using 12 threads per process. Here, nN refers to the number of computenodes, nP to the number of processes and nC to the number of cores employed in thesimulation. The parallel efficiency is computed based on the run using one whole node,i.e.nN = 1, nP = 2 and nC = 24.

With coarse grid solver Without coarse grid solver

nN nP nC dtCG tMG ktMG dtCG tMG ktMG

1 1 12 1.40 1.05 1.07 1.40 1.02 1.04

1 2 24 1.00 1.00 1.00 1.00 1.00 1.00

2 4 48 0.86 0.87 0.90 0.86 0.96 0.97

4 8 96 0.78 0.74 0.82 0.78 0.91 0.92

8 16 192 0.63 0.63 0.75 0.63 0.87 0.90

16 32 384 0.50 0.49 0.65 0.50 0.83 0.86

32 64 768 0.48 0.42 0.55 0.48 0.79 0.80

accordingly. The number of nodes is varied from 1 to 16, with twelve threads per process. As

the domain shape varies, the convergence rate varies as well. This leads to ktvMG exhibiting

a convergence rate slightly over or under 10−5, leading to either two or three iterations being

used. Hence, it is excluded from this test.

Table 5.3 lists parallel efficiency computed from the scaleup over using one full node, i.e. the

run using nN = 1, nP = 2 and nC = 24. The numbers are presented two times, once with the

total runtime, and once when excluding the coarse grid solver. When increasing the number

of utilized nodes from one to four, the parallel efficiency drops from one to 90 % and 87 %

for tMG and ktMG, respectively. The deterioration continues, to only 63 % and 55 %

efficiency at 32 nodes. When excluding the runtime of the coarse grid solver, the efficiency

increases significantly to 80 %. This shows that two separate effects are present: First, the

coarse grid solver is not efficient in the parallel case. Second, the remaining solvers scale,

but not to multiple thousands of cores. The first stems from using a CG solver leading to

an increase in the number of iterations. In addition, the coarse grid incorporates far fewer

degrees of freedom and the optimal distribution among the nodes is not necessarily the node

count used for the fine grid. This can be remedied, for instance by redistributing the grid on

coarser levels to a lower number of processes. The decline in the efficiency when excluding

the coarse grid solver can be attributed to the blocking nature of the implementation of

the communication: Most of the communication stems from boundary data exchange, not

scalar products. Therefore, the communication pattern can be improved by overlapping

communication and computation, hiding latency, and transfer time. When implemented

correctly, this allows for ideal speedups even when using one element per core [56]. However,


the implementation effort is not negligible. While the parallel scalability achieved here is not

stellar, it is sufficient as a proof of concept and can be improved with the methods outlined

above.

5.5 Summary

This chapter investigated p-multigrid for the static condensed Helmholtz equation in order

to attain a constant iteration count for the solvers presented in Chapter 4. Schwarz-type

smoothers were considered due to their excellent convergence properties. However, the main

component of the smoother, the inverse of the 2 × 2× 2 element block, scales with O(p4)and is not matrix free in the condensed case. To generate a matrix-free solution technique,

the static condensed system was embedded into the full one, allowing the application of

fast diagonalization as solution technique. Subsequently, the operator was factorized to a

linear complexity, i.e. to scale with O(p3). The linearly scaling inverse allowed for a linearly

scaling Schwarz type smoother, which, in turn, led to a completely linearly scaling multigrid

cycle.

Multigrid solvers founding on the linearly scaling multigrid cycle were proposed and their

efficiency evaluated using the test case from Chapter 2. With homogeneous meshes, less than

three iterations are required to reduce the residual by ten orders, matching solvers with a

similar overlap [123], and for p ≥ 16 the number of iterations is constant, independent of the

number of elements. Combined with the linearly scaling smoother and residual evaluation,

this results in high efficiency when computing at high polynomial degree: Over wide ranges

of polynomial degrees, the solver requires less than one microsecond per degree of freedom,

even using only 0.6 µs at p = 16 and 0.5 µs at p = 32. This throughput is a factor of three

higher than in the Schwarz method which stimulated this research [51] and a factor of

four higher than in highly optimized multigrid methods, such as [82], when comparing the

method proposed in this chapter at p = 32 to the throughput from [82] at p = 8. Hence, the

multigrid method developed in this chapter allows to extend the applicability of high-order

methods from p = 8 to p = 32 without any loss in performance.

After investigating the performance for homogeneous meshes, inhomogeneous meshes were

considered. While the solvers perform very well in the homogeneous case, the performance

degrades with inhomogeneous ones. Krylov-acceleration mitigates the deterioration, but

still a factor of four in the number of iterations occurs when increasing the maximum aspect

ratio in the grid from α = 1 to α = 128, leading to a worse performance than attained

with the solvers from Chapter 4. A larger overlap is required to maintain the efficiency

of Schwarzmethods for high aspect ratios, and iterative substructuring promises a different

venue to attain robustness against the aspect ratio [27, 106].

5.5 Summary 93

To investigate the parallel scalability, runtimes were conducted, on one node as well as across

several nodes. On one node, memory bandwidth limits the scalability of the algorithm to,

however, still a parallel efficiency of 65 % is achieved when increasing the number of threads

per process to 12. When parallelizing over multiple nodes, acceptable efficiencies are attained

until 32 nodes and, hence 768 cores are utilized. The coarse grid solver limits the efficiency,

as no redistribution of the domain is performed on coarser levels.

Here, only the continuous spectral-element method was considered. For flow simulations,

the discontinuous Galerkin method is often prefered for its inherent diffusivity and there-

fore stabilization. In order to raise the applicability of the solver, future work will expand

the multigrid solver towards the discontinuous Galerkin method, exploiting the similar

structure provided by the hybridizable discontinuous Galerkin method [76, 138]. Further-

more, efficiency gains are to be expected from fusing operators and using cache-blocking to

decrease the impact of the limited memory bandwidth.

95

Chapter 6

Computing on wildly heterogeneous sys-

tems

6.1 Introduction

The last chapters developed algorithmic improvements for high-order methods using homo-

geneous hardware found in workstations and many HPCs. However, many of the latter al-

ready incorporate accelerators [94] and future HPCs will be even more heterogeneous [26]. To

future-proof the developed algorithms, this chapter investigates orchestration of the hetero-

geneous system most deployed in current high-performance computers: one node comprising

CPUs and GPUs [94].

Many programming concepts can be applied to program heterogeneous system. Decompos-

ing the application into many small subtasks, which can then be served to the compute

units, lead to task-based parallelism and is, for instance, implemented in StarPU [4] and

allows to harness large portions of the hardware without load balancing concerns. Using a

decomposition based on the operators and applying these on the hardware most suited for

the task leads to intra-operator parallelism and can lead to significant performance gains for

databases [73]. However, CFD applications do fit neither of these programming models as

data parallelism, not task parallelism, is present.

In CFD, domain decomposition is the main method to attain parallelism. Inside the result-

ing partitions, data parallelism can still be exploited, e.g. shared-memory capabilities with

OpenMP [70]. Similarly, GPU applications facilitate a two-level parallelization, combining

domain decomposition with data parallelization inside a domain. This chapter will expand

the concept to address multiple types of hardware using one single source code.

Programming the heterogeneous system is only one side of the problem. After attaining com-

putation, the issue of load balancing remains. The compute units need to be orchestrated to

96 6 Computing on wildly heterogeneous systems

compute in concert, not in disarray. While a perfect load balancing can harness the full per-

formance, a bad one can lower the performance below the one attained with only parts of the

hardware. Here, only static load balancing is considered. It allows for insights into the static

state for dynamic load balancing as well as the same performance after all information is re-

viewed. Proportional load balancing models allow for simple hardware models parametrized

with heuristics [137] or by auto-tuning [25], whereas functional performance models aim at

decomposing the runtime into contributors and even allow for incorporation of the commu-

nication time [30]. The above models are applicable in programs ranging from small matrix

products in [140], small CFD codes [59], or whole CFD solvers [86]. In the above references,

the runtime of the application was treated as a black box, averaging over all iterations or

time steps. While this simplified model can work well for some applications, it can prove

to be too simple in others and degrade the performance obtained in high-performance com-

putations. Hence, after proposing a programming model, load balancing between different

compute units is investigated using the solvers from Chapter 2 as examples.

This chapter summarizes the work presented in [59] and [58] and is is structured as fol-

lows: First, Section 6.2 discusses the model problem and develops a programming model for

heterogeneous hardware. Then, Section 6.3 investigates load balancing to harvest the full

potential of the hardware.

6.2 Programming wildly heterogeneous systems

6.2.1 Model problem

To investigate programming and the prospective performance gains unlocked through com-

puting on wildly heterogeneous systems, the heart of an incompressible flow solver is exam-

ined: The Helmholtz solver. It occupies the largest portion of the runtime, solving the

pressure and diffusion equations occurring in the time stepping scheme. Here, the solvers

from Chapter 2 and Chapter 3 are revisited to allow for an easier analysis of the load bal-

ancing.

The solver bfCG from Section 2.4 implements a preconditioned conjugate gradient (pCG)

method [118], with the preconditioner using the fast-diagonalization technique to apply

the exact inverse in the interior of the elements. Algorithm 6.1 summarizes the solution

method. Each iteration starts with a search vector p H-orthogonal to the space spanned

by previous search vectors. The first step computes the effect of p, called q, by applying

the element-wise Helmholtz operator (2.44) and assembling the result over the elements.

The optimal step width α is computed, requiring a scalar product, and afterwards solution

as well as residual are updated. Then, the preconditioner applies an approximate inverse to

the residual and the next search vector results via orthogonalization. The communication

6.2 Programming wildly heterogeneous systems 97

Algorithm 6.1: Preconditioned conjugate gradient method for an element-wise formulatedspectral-element method solving the Helmholtz equation Hu = F. Here, M−1

denotes the inverse multiplicity of the data points.

1: for all Ωe : re ← Fe −Heue ▷ local initial residual2: r← QQT r ▷ scatter-gather operation3: for all Ωe : ze ← P−1

e re ▷ preconditioner application4: for all Ωe : pe ← ze ▷ set initial search vector5: ρ←

∑e z

Te M−1

e re ▷ scalar product6: ε←

∑e r

Te M−1

e re ▷ scalar product7: while ε > εtarget do8: for all Ωe : qe ← Hepe ▷ local Helmholtz operator9: q← QQTq ▷ scatter-gather operation10: α← ρ/(

∑e q

Te M−1

e pe) ▷ scalar product11: for all Ωe : ue ← ue + αpe ▷ update solution vector12: for all Ωe : re ← re − αqe ▷ update residual vector13: ε←

∑e r

Te M−1

e re ▷ scalar product14: for all Ωe : ze ← P−1

e re ▷ preconditioner application15: ρ0 ← ρ16: ρ←

∑e z

Te M−1

e re ▷ scalar product17: for all Ωe : pe ← peρ/ρ0 + ze ▷ update search vector18: end while

structure of the solver resembles that of a typical linear solver: Operator evaluation is

followed by boundary data exchange and control variables, such as the current norm of the

residual, are computed via reductions over all elements and, hence, processes. Therefore,

the performance gains obtained for bfCG can be seen as representative for a larger class of

solvers.

6.2.2 Two-level parallelization of the model problem

Domain decomposition is a standard technique computational fluid dynamics, where locality

inherent to the physical problem transfers to locality of the operators. The locality of the

operator, in turn, allows for decomposition of the domain into non-overlapping partitions

and working on these in parallel. With a spectral-element method, the domain is already

decomposed into elements. At this stage, domain decomposition amounts to defining par-

titions of the set of elements and decomposing the operations onto these so that they can

run in parallel. In Algorithm 6.1 most operations are element-wise operations, such as the

local Helmholtz operator in line 8, updates of solution and residual in lines 11 and 12, or

the application of the preconditioner in line 17. While these operations remain unchanged,

the gather-scatter operation in line 9 and the scalar products (lines 10, 13 and 16) require

communication, as either boundary data exchange or reduction operations occur. The added

communication between partitions constitutes the key change.


Typically, MPI serves as message-passing layer [128], as it is widely employed and allows for

fine-tuning of communication patterns [49]. For instance latency hiding can be implemented,

overlapping communication and computation as done in [56], leading to near linear scaling up

to hundreds of thousands of cores. As often similar patterns occur, libraries, such as PETSc

and OP2, implement domain decomposition layers that can be directly utilized [2, 110]. Fur-

thermore, partitioned global address space languages, such as Coarray Fortran and Unified

Parallel C, allow for message passing with constructs inherent to the language [111, 20]. As

language constructs, instead of library calls, implement the message passing, the compiler

can optimize the communication structure. While traditionally utilized for exploitation of

shared-memory capabilities, OpenMP can attain speedups similar to those of MPI-based

domain decomposition layers [10]. Of these methods, MPI was chosen as implementation of

the message passing layer, as availability and compiler support were superior to the other

variants.

Domain decomposition implements a rather coarse level of parallelization, which further

incurs the need of boundary data exchange. When increasing the number of utilized cores,

the domains get ever smaller and the ratio of communication to computation increases,

leading to parallel inefficiencies. All operations in Algorithm 6.1 work on every element in

the partition, albeit some with communication involved, opening up further possibilities for

acceleration. To lower the amount of communication, larger partitions are required. Hence,

parallelism inside the partitions are exploited, using the shared-memory capabilities of multi-

core components or many-core devices, leading to a second, fine-grained, hardware-specific

parallelization.

One possibility to gain acceleration inside a subdomain lies in exploitation of multi-core

hardware and OpenMP is a thread-based parallelization for these [102]. The programmer

states how the work should be shared between the threads using compiler directives, also

referred to as pragmas. For Algorithm 6.1 this results in encapsulation of each occurring

element-wise operation in pragmas which designate the iterations to be shared among the

executing threads. To minimize initialization overhead, the threads are generated once at

the start of the program and kept alive through the whole computation, i.e. every solver call

uses the same threads. To prevent non-uniform memory accesses (NUMA), such as reading

data from the memory of a different socket, the data is initialized using the “first touch

principle”, enforcing that data lies in the memory of the core computing on it [49]. The

approach described above results in a hybrid parallelization consisting of a coarse-grained

MPI layer handling data distribution and communication, and a fine-grained OpenMP layer

exploiting the shared-memory capabilities of the hardware as, e.g., done in [70].

Similarly to using multi-cores, GPUs can be harnessed to exploit data parallelism. While tra-

ditionally programmed viaOpenCL or CUDA, the pragma-based language extensionOpen-

ACC allows for a programming style similar to OpenMP [101]. The extension was based

on OpenMP and therefore shares a lot of the the syntax as well as the process of code


1 do e = 1 , n e2 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P3 M u( i , j , k ) = M 123 ( i , j , k ) ∗ u( i , j , k , e )4 r ( i , j , k , e ) = M u( i , j , k ) ∗ d (0 , e )5 end do ; end do ; end do67 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P8 tmp = 09 do l = 1 , N P

10 tmp = tmp + L t i l d e T ( l , i ) ∗ M u( l , j , k )11 end do12 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (1 , e )13 end do ; end do ; end do1415 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P16 tmp = 017 do l = 1 , N P18 tmp = tmp + L t i l d e T ( l , j ) ∗ M u( i , l , k )19 end do20 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (2 , e )21 end do ; end do ; end do2223 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P24 tmp = 025 do l = 1 , N P26 tmp = tmp + L t i l d e T ( l , k ) ∗ M u( i , j , l )27 end do28 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (3 , e )29 end do ; end do ; end do30 end do

1 ! $omp do2 ! $ acc p a r a l l e l p re sent (u , r ) &3 ! $ acc& num workers (1 ) async (1)4 ! $ acc loop gang worker p r i va t e (M u)5 do e = 1 , n e6 ! $ acc cache (M u)78 ! $ acc loop c o l l a p s e (3) independent vector9 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P

10 M u( i , j , k ) = M 123 ( i , j , k ) ∗ u( i , j , k , e )11 r ( i , j , k , e ) = M u( i , j , k ) ∗ d (0 , e )12 end do ; end do ; end do1314 ! $ acc loop c o l l a p s e (3) independent vector15 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P16 tmp = 017 ! $ acc loop seq18 do l = 1 , N P19 tmp = tmp + L t i l d e T ( l , i ) ∗ M u( l , j , k )20 end do21 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (1 , e )22 end do ; end do ; end do2324 ! $ acc loop c o l l a p s e (3) independent vector25 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P26 tmp = 027 ! $ acc loop seq28 do l = 1 , N P29 tmp = tmp + L t i l d e T ( l , j ) ∗ M u( i , l , k )30 end do31 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (2 , e )32 end do ; end do ; end do3334 ! $ acc loop c o l l a p s e (3) independent vector35 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P36 tmp = 037 ! $ acc loop seq38 do l = 1 , N P39 tmp = tmp + L t i l d e T ( l , k ) ∗ M u( i , j , l )40 end do41 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (3 , e )42 end do ; end do ; end do43 end do44 ! $ acc end p a r a l l e l45 ! $omp end do

Figure 6.1: Fortran implementation of the element Helmholtz operator with the factorizationfrom (3.7). In the code, n e is the number of elements, N P the one-dimensionaloperator size np constant during compiliation, M 123 the diagonal of the three-dimensional mass matrix, L tilde T the transpose of the matrix in (3.9), d the metricfactors, u the input vector and r the output vector. Left: Implementation for singlecores. Right: Single-core implementation augmented with directives for OpenMPand OpenACC.

extension: Directives encapsulate the operations, designating them to be offloaded to the

accelerator. However, where CPUs typically incorporate tens of cores, GPUs are built with

thousands of them. Hence, exploitation of parallelism inside the elements is required. This

can be facilitated by encapsulating loops inside the elements in pragmas. Contrary to Open-

MP, where initialization of threads and false sharing of data leads to performance degra-

dation, the low-bandwidth interconnect to the GPU mandates minimization of memory

transfers [44], lest the performance ends up being limited by the transfer bandwidth and not

compute performance. Hence, the data is created once at the start of the algorithm, and kept

on the GPU for the whole duration, only updating boundary data. Finally, asynchronous

execution of GPU kernels further enhances the performance.


Figure 6.1 shows a loop-based implementation of the element Helmholtz operator in the

factorized form (3.9). The unaugmented variant consists of a loop over all elements. First,

the precomputed three-dimensional mass matrix weighs the input vector and the diagonal

contribution from the Helmholtz parameter is saved. Then, the stiffness matrix is applied

in all three coordinate directions, with one loop nest each. Overall, 27 lines of code are

used for the implementation. For OpenMP adding one “!$omp do” statement before and

an “!$omp end do” statement behind the loop suffices for acceleration. As the threads are

created outside of the operator, every variable created inside is located on the stack of the

respective thread and is therefor local to the thread. No statements for privatization are

needed. ForOpenACC, further code is required. First, a “!$acc parallel” region instantiates

a kernel and states whether the data already resides on the GPU or needs to be copied.

Then, “!$acc loop” enclosing the element loop instantiates a GPU kernel, designates it for

the “gang” and “worker” execution units and states that the temporary M u shall not be

shared among these. Every interior loop nest is collapsed to allow for parallelism for the

“vector” compute unit. And lastly, the !$acc loop seq directives prevents the compiler

from parallelizing the reduction loops. In total, 14 lines of code are required to accelerate

the kernel.

Both, OpenMP and OpenACC implement data parallelism using compiler directives.

These are comments and, hence, the program can be compiled without them. This in turn

allows to handle both types of fine level parallelization pragmas inside one code, compiling

only for the hardware at hand. Furthermore, with both variants inside one single source code,

code maintenance is easier than when supporting two completely separate implementations

of the same problem. Where the implementation of key operators, as described in Chapter 3,

might differ, the remainder of the code stays the same, with slight additions (“glue code”)

for data handling and thread instantiation.

The matching MPI communication patterns generate a second benefit: The multiple program

multiple data capabilities of MPI allows to couple different versions of the program, each

compiled for different hardware. For instance an OpenMP and an OpenACC versions of

the program can be used in a single simulation, generating a heterogeneous parallel program

working on different types of hardware. In this way, a heterogeneous run time environment

utilizing all available processing units can be generated directly from a single source without

any further work, eventually enabling a better speedup of the application. Moreover, other

compute units, for instance FPGAs, can be incorporated in a similar fashion, if pragma-based

language extensions allow to program them.

6.2.3 Performance gains for homogeneous systems

To gauge the performance of the parallelization, the solver bfCG was enhanced to be able to

compute on nodes comprising multi-core CPUs and GPUs: Each operator was augmented


Table 6.1: Speedup over using a single core when computing on either multiple cores with Open-MP or GK210 GPU chips on nS sockets. The corresponding runtimes per iterationare located in Table A.1 in the appendix.

p = 7 p = 11 p = 15

ne ne ne

nS Setup 83 123 163 83 123 163 83 123 163

1 4 cores 3.8 3.9 3.4 3.9 3.8 3.8 3.9 3.8 3.8

1 8 cores 7.3 7.2 6.0 5.4 6.9 6.8 7.0 6.9 6.9

1 12 cores 10.1 9.9 8.9 10.3 9.3 9.2 9.5 9.1 9.3

1 1 GK210 13.0 20.3 24.7 17.9 22.4 24.1 16.7 18.7 19.5

1 2 GK210 15.9 31.8 42.2 28.4 39.0 44.1 29.6 34.5 36.6

2 4 cores 7.2 8.0 7.8 8.0 7.8 7.7 7.9 7.6 7.6

2 8 cores 13.2 15.1 14.3 15.4 14.3 13.8 14.6 13.7 11.9

2 12 cores 17.3 21.3 17.3 21.5 19.5 18.6 19.7 18.5 18.0

2 1 GK210 14.8 31.2 41.9 28.2 39.2 45.1 29.6 34.6 37.2

2 2 GK210 15.9 43.9 66.5 44.6 64.5 81.0 51.2 62.8 70.5

with OpenACC and OpenMP directives, and conditional compilation leads to a process

either computing on one GPU via OpenACC, or on a set of CPU cores addressed with

OpenMP. To couple the processes, domain decomposition using static, block-structured

grids was implemented, with the inter-process communication realized with MPI.

Runtime tests were conducted for the homogeneous case of using either one CPU or one

GPU. As in previous chapters, one node of the HPC system Taurus served as measurement

platform. The node utilized in this chapter incorporates two NVIDIA K80 cards, consisting

of two Kepler GK210 GPU chips each, in addition to the two Intel Xeon E5-2680 processors.

To obtain programs for multi- and many-cores, the PGI compiler v. 17.7 compiled the pro-

gram, once with OpenMP and once with OpenACC. The runtime and performance data

presented here differ from those presented in Section 3.4 as a different compiler is utilized.

The periodic domain Ω = (−1, 1)3 was discretized with ne = 83, ne = 123, and ne = 163

with bfCG solving the Helmholtz equation. The process was repeated ten times, first

computing on socket with 1, 4, 8 or 12 cores or 1 or 2 GPUs, and then on two sockets

to evaluate the MPI implementation. As runtimes libraries, such as the implementation of

OpenACC, can autotune, the solver was called 50 times, with the time per iteration of the

last 20 solver calls being averaged, leading to a reproducable runtime.

Table 6.1 compiles the speedups over one core, with the corresponding runtimes residing

in Table A.1 in the appendix. For p = 7, an increase of the core count to four yields a near

linear speedup, which slightly degrades at ne = 163. This is to be expected, as the data

set for the CG solver fits into L3 cache, the bandwidth thereof scales near linearly with


the number of utilized threads [48]. The RAM bandwidth does, however, not scale linearly

with the number of utilized cores, leading to lower speedups at larger data sizes where the

performance degradation becomes more apparent with higher thread counts. For 8 cores,

the speedup is lowered from 7 to 6 and for 12 from 10 to 9. Losing the same amount of

performance stems from the CPU layout [48]: The first eight cores are located on a ring

interconnect with its own RAM interface, as are the last four. However, the latter are only

added when switching from 8 to 12 threads, leading to a near double the performance in

some cases.

For p = 7, computing on one GK210 chip generates a slightly better performance than one

whole CPU at ne = 83, with the speedup increasing with larger data sizes to 24, i.e. the

compute power of two whole CPUs. One part of this higher speedup stems from the CPU

not computing in the L3 cache, but rather loading data from RAM, leading to a larger

runtime for the CPU. However, adding a second GPU chip does not significantly increase

the performance at ne = 83, only raising the speedup from 13 to 16, but pays off at ne = 163,

where the 24 get complemented by 18. While for the two GPUs the memory bandwidth

scales linearly, the overhead in computing on GPUs is significant, leading to an offset in the

runtime.

When increasing the polynomial degree, the speedup obtained withOpenMP stays constant.

For the GPU it does not, as the larger data sizes allow to circumvent the offset in the runtime.

This becomes even more pronounced when using the two available GK210 chips, with the

speedup at ne = 83 increasing from 16 at p = 7 to 30 at p = 15.

When computing on both sockets, a near linear increase in speedup is present for the CPU

runs. With the utilized architecture the memory bandwidth and compute power double,

allowing for the linear increase while incurring very slight inefficiencies due to communication

of boundary data. The GPU computations, however, show little performance gains for p = 7.

Using one GK210 chips per sockets nets the performance of two on one socket, and increasing

to four chips only gains 50 % more performance for high element counts, but near none at

low numbers of elements. This, again, indicates that large offsets in the runtime of the GPU

are present and large data sizes are required for the implementation to become efficient.

While not generating a linear speedup, the implementation still lies inside the expected

ranges, both for CPUs as well as for GPUs. Furthermore, the adapted concept allows to

address different kinds of hardware with the same code base, attaining speedups in the

expected range. However, when the GPUs computes, the CPU stays underutilized and vice

versa. This will be addressed in the next section.

6.3 Load balancing model for wildly heterogeneous systems 103

OpenMP

OpenACC

Core 01 Core 02

Core 03 Core 04

Core 05 Core 06

Core 07 Core 08

Core 09 Core 10

Core 11 Core 12

PCIe

GK210 1 GK210 2

OpenMP

OpenACC OpenACC

Core 01 Core 02

Core 03 Core 04

Core 05 Core 06

Core 07 Core 08

Core 09 Core 10

Core 11 Core 12

GK210 1 GK210 2

Figure 6.2: Socket layout and utilized mapping of CPU and GPU resources to MPI, OpenMP,andOpenACC. Left: Two MPI processes, one steering 10 cores using OpenMP (lightgrey), with the first core steering the first GK210 chip (grey). Right: Whole CPUgets utilized by adding a further OpenACC process (dark grey).

6.3 Load balancing model for wildly heterogeneous sys-

tems

6.3.1 Single-step load balancing for heterogeneous systems

To attain effective load balancing of the computation, runtime models and therefore runtime

data is required. As static decompositions are sought here, the information is required prior

to heterogeneous computation. Hence, runtime tests were conducted for the homogeneous

case of using either one CPU or one GPU. Figure 6.2 depicts the setup of one socket, as

described in [48]. As the two GPU chips per GPU card are addressed separately, one process

per chip is utilized, allocating two of the twelve available cores to harness the GPUs. The

remaining ten cores were fused into one process via OpenMP.

Runtime tests were conducted using either ten consecutive cores or one GK210 chip. For a

polynomial degree of p = 7, the number of elements varied between ne = 100 and ne = 1000,

with the solver being called 50 times and the last 20 resulting runtimes averaged. Block-

structured grids restrict the possible mesh decompositions, leading to a low granularity

available for load balancing. Hence, a two-dimensional problem was considered with only

one element discretizing the third direction to increase the granularity available for load

balancing. The employed grids and their decompositions reside in Table A.2 in the appendix,

alongside the measured runtimes.


GPU CPU

0 200 400 600 800 1000


0

1

2

3

4

Tim

eper

iterationt Iter[m

s]

0 200 400 600 800 1000


0.0

0.5

1.0

1.5

2.0

Speedup

Figure 6.3: Computation using either one GPU or ten cores of a CPU. Left: Times per iteration,tIter, for the homogeneous setup. Right: Speedups over one GK210 chip.

Figure 6.3 depicts the runtimes in combination with the speedups. As the goal lies in adding

the compute power of the CPU to that of the GPU, the speedup is computed based on the

performance of one GK210 chip. For the CPU the iteration time is 0.5ms at ne = 100, and

except for some some small spikes at ne = 500 and ne = 800, a linear increase to 4ms is

taking place. The GPU chip is slower at first, using twice the time per iteration at ne = 100,

but attains equality at ne = 300 and is near twice as fast for large numbers of elements.

For both, CPU and GPU processes, the iteration times tIter behave mostly linear with regard

to ne and can be approximated using a slope and an offset, similar to the one-dimensional

case discussed in [59]. For compute unit m, the runtime is approximated with constants C(m)0

and C(m)1 such that

t(m)Iter = C

(m)0 + C

(m)1 n(m)

e . (6.1)

The approximation holds as long as the compute unit stays in the same state, e.g. loads data

from the same cache level. Furthermore, side effects such as memory contention are not

accounted for.

To gain a decomposition of the mesh, the load balancing model from [30, 59, 140] is ap-

plied. The respective iteration times of the processes are approximated using (6.1), with the

maximum iteration time of both units dominating the iteration time as

tIter = maxm

(t(m)Iter

). (6.2)

Minimization yields the optimal runtime

t∗Iter = maxm

(t(m)Iter

)→ min with ne =

∑m

n(m)e . (6.3)


GPU CPU Single-step measurement Single-step model

0 200 400 600 800 1000


0

1

2

3

4

Tim

eper

iterationt Iter[m

s]

0 200 400 600 800 1000


0.0

0.5

1.0

1.5

2.0

Speedup

Figure 6.4: Iteration times and speedups for computations using one GPU and ten cores.Left: Times per iteration, tIter, for the heterogeneous setup. Right: Speedups overone GK210 chip.

The above results in both units having approximately the same time for one iteration.

For load balancing, the runtime data from Figure 6.3 was approximated with (6.1) and

a decomposition in the x1 direction computed via (6.3). As the utilized solver requires

block-structured grids, only certain decompositions are possible and the realized element

distribution deviates slightly from the optimum. For these, heterogeneous computations

were carried out, using the ten cores in conjunction with the GPU chip. The employed

grid decompositions are shown in Table A.3 in the appendix, along with the modeled and

measured runtimes.

The runtimes and speedups are shown in Figure 6.4. The runtime model predicts that hetero-

geneous computation is inefficient at ne = 100 and, hence, no data is presented. At ne = 200,

the load balancing model deems heterogeneous computing efficient, with a predicted factor

of 1.7 over using just one GK210, remaining faster for all numbers of elements and predict-

ing a speedup of over 1.5 after ne = 500. However, a stark difference is present between

model and measurement. Where a speedup of over 1.5 was expected, the heterogeneous

computation hardly outperforms a single GPU chip. As Table A.3 summarizes, the relative

error between model and measurement is forty percent for all tests. While the approach is

well suited for homogeneous system and simple algorithms for heterogeneous systems, e. g.

explicit time stepping or lattice Boltzmann codes [30, 59], it evidently does not work for

the present problem.

6.3.2 Problem analysis

The last section show-cased that a one-step model does not suffice to solve the load balancing

problem for the Helmholtz solver. To glean some insights on why it does not work, a closer


n(C)e = 400 n

(G)e = 600

t(G)1

t(C)1

t(G)2

t(C)2

t(G)3

t(C)3

t(G)4

t(C)4

n(C)e = 400 n

(G)e = 600

t(C)1 t

(G)1

t(C)2 t

(G)2

t(C)3 t

(G)3

t(C)4 t

(G)4

n(C)e = 225 n

(G)e = 775

t(C)1 t

(G)1

t(C)2 t

(G)2

t(C)3 t

(G)3

t(C)4 t

(G)4

Figure 6.5: Influence of communication barriers on load balancing for one CPU (C) collaboratingwith one GPU (G) for the present solver with ne = 1000. Left: Load balancing whendisregarding communication barriers using the single-step model. Middle: Same casewith communication barriers retained. Right: Runtimes when accounting for barriers.

look onto Algorithm 6.1 is required. In the iteration process, most operations work on an

element-by-element basis. They work on local data, do not generate side effects, require no

communication and are easy to parallelize. For a load balancing model they only result in

a larger runtime to account for and, therefore, different model parameters. The operations

not adhering to this pattern are the scalar products and the gather-scatter operation. The

scalar products in lines 10, 13, and 16 implement reduction operations over all data points

and, hence, processes. As they compute, e.g., step widths that are required for further

steps in the algorithm, their evaluation constitutes a synchronization barrier. Furthermore

the gather-scatter operation in line 9 adds the residual on adjoining elements and, hence,

requires boundary data exchange with neighboring subdomains. For computing the scalar

product in line 10, the result is required, necessitating a fourth communication barrier.

When modelling the communications steps as barriers, Algorithm 6.1 decomposes into four

distinct substeps: The first starts after the scalar product in line 16 and ends before the

boundary data exchange in line 9. It incorporates the update of the search vector p, ap-

plication of the element-wise Helmholtz operator, the local part of the gather-scatter

operation, and collection and communication of boundary data. After the boundary data

exchange, the second substep begins at line 9 and ends after the local part of the scalar

product in line 10. It inserts the received boundary data into the subdomain and evaluates

the local part of the scalar product. The third substep, line 10 to line 13, consists of the

update of solution and residual vector. The last substep incorporates application of the

preconditioner and another scalar product. It extends from line 13 to 16.


To investigate the effect of these barriers, bfCG was augmented with runtime measurements

for the four substeps. One time stamp is taken before and one after each communication

and the accumulated runtime averaged over the number of iterations. The runtimes for the

homogeneous case were repeated, measuring the runtime of the substeps and approximating

these using (6.1). For ne = 1000, Figure 6.5 demonstrates the effect of the communication

barriers. The single-step load balancing does not take the barriers into account and tries

to optimize for the left case. In practice, the communication barriers lead to the middle

case, where a severe load imbalance occurs in each substep. This increases the runtime and

leads to a deviation from the prediction. As shown on the right, decreasing the number of

elements for the CPU can mitigate parts of the imbalances and decrease the overall runtime.

Obviously, load balancing with regard to solver communication barriers is required to achieve

optimal results.

6.3.3 Multi-step load balancing for heterogeneous systems

Using the insights from the last section, the load balancing model from Subsection 6.3.1 is

readily extended: For each substep of the solution process, a linear fit approximates the

runtime, i. e. for compute unit m and substep i

t(m)i = C

(m)0,i + C

(m)1,i n(m)

e , (6.4)

where C(m)0,i is a non-negative constant, C

(m)1,i a positive slope, and n

(m)e the number of ele-

ments assigned to the compute unit. Similarly to the single-step model, the approximation

requires the compute unit to stay in the same state, e. g. using the same cache hierarchy,

and side effects such as mmemory saturation are excluded. The end of each substep invokes

synchronization, e.g. collectively calculating a scalar product or exchanging boundary data.

With communication in CFD typically being latency-bound [49], differences in communica-

tion time are neglected here.

As before, the slowest process dominates the runtime of a substep ti:

ti = maxm

(t(m)i

)= max

m

(C

(m)0,i + C

(m)1,i n(m)

e

), (6.5)

and with multiple steps to consider, the optimum runtime now computes to

t∗Iter =∑i

maxm

(t(m)i

)→ min with ne =

∑m

n(m)e . (6.6)


Measurement Single-step model Multi-step model

0 200 400 600 800 1000


1

2

3

4

Tim

eper

iterationt Iter[m

s]

200 300 400 500 600


1.50

1.75

2.00

2.25

2.50

2.75

3.00

Tim

eper

iterationt Iter[m

s]

Figure 6.6: Runtimes predicted by the single-step and multi-step load balancing model

for ne = 1000 compared to measurements. Here, n(C)e denotes the number of ele-

ments assigned to the CPU. Left: Modeled runtimes for every possible distribution.Right: Closeup of the transition.

Where the single-step model from Subsection 6.3.1 utilized one linear system, here each

substep replicates one. The resulting coupled system is, however, non-linear and not easily

solved. Introducing auxiliary variables di for each substep with

∀m : di ≥ t(m)i ≥ 0 , (6.7)

casts it into a linear one:

t∗Iter =∑i

di → min with ne =∑m

n(m)e , (6.8)

and allows for a shorter time to solution.

To investigate whether the derived model more accurately replicates reality, computations

for every decomposition of the ne = 1000 mesh were conducted. Figure 6.6 depicts the result

in combination with predictions from single-step and multi-step model. With the single-

step model, the runtime decreases when increasing the number of elements on the CPU,

n(C)e , as the CPU takes work from the GPU. After reaching equilibrium between CPU and

GPU at n(C)e = 400, the iteration time increases, with the CPU now being slower. The

measurements, however, exhibit a different behaviour: While in agreement with the single-

step model at first, the runtime starts increasing at n(C)e = 225. A large discrepancy between

single-step model and measurement result, especially at the equilibrium point predicted by

the single-step model. With the extended model, the equilibria of the substeps divide the

plot into five distinct zones, each exhibiting a different slope. Overall, the modeled runtime

is higher, stemming from more restrictions in the optimization, but so is the accuracy:


GPU

CPU

Single-step measurement

Single-step model

Multi-step measurement

Multi-step model

0 200 400 600 800 1000


0

1

2

3

4

Tim

eper

iterationt Iter[m

s]

0 200 400 600 800 1000


0.0

0.5

1.0

1.5

2.0

Speedup

Figure 6.7: Runtimes for one iteration and speedups over one GPU with the new load balancingmodel. Left: iteration times, right: speedups.

near perfect agreement is present with the measurement until n(C)e = 350. The multi-step

model yields a more accurate reproduction of the measurements than the single-step version.

Furthermore, the minimum iteration time at n(C)e = 225 constitutes the most relevant feature

of the runtime and the multi-step model is capable of capturing it, while the single-step model

is not.

6.3.4 Performance with new load balancing model

The runtime tests from Subsection 6.3.1 were repeated including heterogeneous computation

using the multi-step load balancing model. Figure 6.7 depicts the resulting runtimes and

speedups. As the multi-step model introduces more restrictions, the predicted speedup is

lower compared to single-step model: Where previously a speedup of up to 1.5 was com-

puted, only 1.3 is estimated now. However, the predictions match the experiments. The

model is four times as accurate and the error is less than ten percent. Furthermore, the

attained speedup is higher for all utilized number of elements. Hence, the multi-step model

is preferable for this configuration.

To ensure that the behaviour of the two load balancing schemes is not reserved to the specific

test case and data size, the test was repeated for p = 11 and p = 15 in addition to the already

computed case of p = 7. Figure 6.8 depicts the resulting speedups over using one GPU and

the relative errors of the load balancing models versus their respective meausured runtimes.

In all three cases, the GPU does not compute significantly faster than the CPU for low

number of elements, but attains speedups near two for large numbers of elements. For p = 11,

a spike is present at ne = 300, stemming from a relatively high runtime for the CPU. For

all polynomial degrees, the single-step model predicts that the whole computational power


GPU

CPU

Single-step measurement

Single-step model

Multi-step measurement

Multi-step model

0 200 400 600 800 1000


0.0

0.5

1.0

1.5

2.0

Speedup

0 200 400 600 800 1000


0

10

20

30

40

50

Relative

error[%

]

0 200 400 600 800 1000


0.0

0.5

1.0

1.5

Speedup

0 200 400 600 800 1000


0

10

20

30

40

50

Relativeerror[%

]

0 200 400 600 800 1000


0.0

0.5

1.0

1.5

Speedup

0 200 400 600 800 1000


0

10

20

30

40

50

Relativeerror[%

]

Figure 6.8: Performance gains with heterogeneous computing. Left: speedups over using only 10cores, right: relative errors of the two considered load balancing models. Top: p = 7,middle: p = 11, bottom: p = 15.


GPU Multi-step measurement Multi-step model

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Speedup

2 GPUs

2 CPUs + 2 GPUs

4 GPUs

2 CPUs + 4 GPUs

Figure 6.9: Speedups over two GK210 chips using additionally 10 cores of each of the 2 CPUs,using four GPUs, and using all compute resources of a node.

of the CPU can be added to the one of the GPU, netting a speedup of 1.5 for all polynomial

degrees. However, this is not achieved in practice. For low numbers of elements the error

is near 50 %. It decreases with the number of elements and saturates at 40 % for p = 7,

20 % for p = 11, and 10 % for p = 15. This decrease stems from the scaling of the operator

costs: Where the element Helmholtz operator and preconditioner scale with O(p4), theremaining operations scale with O(p3), increasing the relevance of substeps 1 and 4 and

diminishing the impact of substeps 2 and 3. In effect, only two substeps remain relevant

for load balancing, decreasing the difference between the models. Moreover, Helmholtz

and fast diagonalization operator exhibit a similar runtime behaviour, further decreasing the

error. The multi-step model predicts more modest performance gains. For p = 7, the speedup

of the GPU is increased to 1.3, with the model closely fitting experiments after ne = 400 and

achieving an error lower than 5 percent. For lower number of elements the elements allocated

to the CPU are cached in L3 and different constants are required for accurate predictions

in this regime. In all three cases, the multi-step model outperforms the single-step model,

both in terms of prediction accuracy and attained speedup, adding 30 %, 50 % and 80 % of

the CPU performance to the one of the GPU for p = 7, 11, 15, respectively. Furthermore, no

performance degradation is encountered when using heterogeneous computing.

Until now, only parts of the node were utilized, only 11 of 24 cores and only one of four

GPU chips. To test the model in a real-life scenario, a simulations using 40 × 40× 1 = 1600

elements of polynomial degree p = 11, were performed on one node, leading to approxi-

mately 2.1M degrees of freedom. Four setups were investigated: The reference setup uses

two of the four GK210 chips, i. e. one per socket. The second one adds ten cores per socket

to the two chips. The third setup utilizes all four GK210 chips present, and the last one

leverages the whole node to solve the problem (see Figure 6.2). The time per iteration was

measured, averaged over 20 solver runs, and the resulting speedup over two GK210 chips

computed.


Figure 6.9 depicts the speedups over two GPUs. The setup employing all four GK210 chips

achieves a speedup of 1.8, which can be explained by the large offsets present in the runtimes

of the substeps on the GPU. Adding the compute power of twenty cores to the two GPUs,

the computation is 25% faster, reasonably matching the prediction. Similarly, the same

absolute speedup gain of 0.25 is achieved when using the 20 cores in conjunction with the

four GPUs.

6.4 Conclusions

In this chapter, the orchestration of heterogeneous systems was investigated on the main

ingredient of a flow solver, the Helmholtz solver. Following the traditional parallelization

concepts in CFD, a two-level parallelization was introduced. MPI realized domain decom-

position, while pragma-based language extensions exploited the data parallelism inside the

partitions. The model allowed to utilize one single source for generation of a code running

on a heterogeneous system, lowering the maintenance effort.

Thereafter, load balancing for the resulting heterogeneous system consisting of one CPU

and one GPU was investigated. Many references employ a single step model to compute the

resulting runtime. For the considered pCG solver, the model led to a relative error of 40 % in

the runtime. The synchronization barriers resulting from scalar products and boundary data

exchange introduced further constrains that the simplified model did not account for and,

in turn, led to large load imbalances. An improved model accounting for these barriers was

derived. With the new model, runtime and prediction matched closely, with the error lying

below 5 % for most cases. While the barriers restricted the model and, therefor, lowered the

attainable speedup, the heterogeneous computation was 30 % faster than solely computing

on the GPU for every tested polynomial degree.

Afterwards, computations harnessing the whole computational node were conducted. For

these, the runtime prediction matched the measurement well and the heterogeneous system

generated 25 % more performance than only using the GPUs would have provided. While

the attained speedups seem small, one has to keep in mind that while current hardware only

exhibits mild heterogeneity, future HPCs will be more heterogeneous [26].

Here, only the runtime of a pCG-based Helmholtz solver with local preconditioning was

considered. With multigrid, a multi-level approach to load balancing is required. Hence,

further work will consist of expanding the new load balancing model to multigrid, possibly

including redistribution between the levels.

113

Chapter 7

Specht FS – A flow solver computing

on heterogeneous hardware

7.1 Introduction

In the previous chapters, large increases in the performance of operators and consecutively

elliptic solvers were demonstrated. These consisted of algorithms lowering the operations

count, optimizations of the operators themselves, and exploitation of heterogeneous hard-

ware. However, it is yet unclear whether these improvements carry over to a full flow solver.

Hence, this chapter will combine these techniques into a flow solver, validate it, and showcase

the attainable performance for flow simulations. The structure of this chapter follows this

order, with Section 7.2 introducing the notation and presenting the flow solver, Section 7.3

validating the implementation, and Section 7.4 investigating the performance of the solver.

A preliminary validation was performed in [109] and excerpts of the performance tests are

available in [63].

7.2 Flow solver

7.2.1 Incompressible fluid flow

Incompressible fluid flow is governed by the Navier-Stokes equations, see e.g. [81, 115]. In

non-dimensional form these can be written as

∂tu+∇T(uuT

)= −∇P +

1

Re∆u+ f (7.1a)

∇ · u = 0 (7.1b)

114 7 Specht FS – A flow solver computing on heterogeneous hardware

with u being the velocity vector, P the pressure, Re the Reynolds number, t the time vari-

able and f the body force. Equation (7.1a) constitutes the momentum balance of the fluid,

whereas the continuity equation (7.1b) restricts the velocity field to be free of divergence.

Multiple problems arise when discretizing the Navier-Stokes equations: First, the con-

vection terms are non-linear and typically lead to non-symmetrical equation systems, raising

the costs of implicit treatment. Second, the pressure possesses no evolution equation, baring

the usage of traditional discretization methods. And third, the continuity equation (7.1b) is

just a constraint on the velocity field, rather than a separate equation to solve.

7.2.2 Spectral Element Cartesian HeTerogeneous Flow Solver

This section presents the flow solver Spectral Element Cartesian HeTerogeneous Flow Solver,

in short Specht FS. The solver is the culmination of the previous chapters: It incorporates

the algorithmic advances for the static condensation Chapter 4, the linearly scaling multi-

grid solver Chapter 5, as well as the hardware-optimized operators from Chapter 3 and

is parallelized with the concept from Chapter 6. The goal of this code is not to generate

another general-purpose code, but rather to evaluate the performance attainable with high-

order methods. Therefore, the code serves as testbed for parallelization techniques, operator

optimization and algorithmic factorization techniques.

Specht FS solves the incompressible Navier-Stokes equations, using a spectral-element

method in space while using projection schemes in time. Multiple time stepping schemes

are implemented. These include velocity correction [75], consistent splitting schemes and

rotational incremental pressure-correction schemes [46]. While the former two are only im-

plemented using the same polynomial degree in pressure and velocity, the implementation

of the latter allows to use a lower polynomial order in the pressure. Furthermore, additional

variables, such as passive and scalars, can be added.

The spatial discretization of Specht FS is in accordance with the previous chapters: A

spectral-element discretization implements the weak forms resulting from the time-stepping

schemes. The discretization uses structured Cartesian grids composed of tensor-product ele-

ments using GLL points. On these, solution of Helmholtz equations is facilitated by every

of the solvers from Chapter 2, Chapter 4 and Chapter 5. The associated operators, rang-

ing from Helmholtz operator to preconditioner, were implemented using the techniques

described in Chapter 3, leading to hardware-optimized operators which extract at least a

quarter of the available performance. Furthermore, four different versions of the transport

operator are implemented: The convective, quasi-linear form, the conservative form, the

skew-symmetric form, and overintegration form [23]. Lastly, Specht FS allows stabiliza-

tion of the time-stepping scheme by either polynomial filtering [33] or the spectral vanishing

viscosity (SVV) model, which can be included in the diffusion solvers without any overhead.

7.2 Flow solver 115

The solver is written in modern Fortran using the parallelization concept from Chapter 6:

MPI implements domain decomposition with OpenMP and OpenACC providing a fine-

grain parallelization inside the partitions to exploit data parallelism. Therefore, conditional

compilation can be utilized to generate a single-core runtime computing solely with MPI,

one exploiting shared-memory via OpenMP, GPUs via OpenACC, or both to create a

heterogeneous runtime environment. For instance, the runtime measurements in Chapter 6

were conducted in a flow testcase in Specht FS.

7.2.3 Pressure correction scheme in Specht FS

Algorithm 7.1 summarizes the rotational incremental pressure-correction scheme [46]. The

scheme utilizes a backward-differencing formulas of order k (BDFk) as basis, treating con-

vection terms and the body forces explicitly in time, whereas the diffusion terms are treated

implicitly. In each time step, the pressure, convection terms, and body forces get extrapo-

lated from the previous time steps, with the body force being extrapolated using order k and

the pressure being extrapolated with order k−1 for stability reasons. Therefore, the scheme

requires the information from k previous time levels and is only self-starting for BDF1. For

BDF2, one prior step of BDF1 is required.

The first operation of the time step consists of extrapolating the pressure and velocities to

the end of the time step. Then, the pressure derivative, convection terms, and body forces

are treated explicitly by serving as right-hand side for the implicit treatment of the diffu-

sion terms. During this treatment, the boundary conditions for the velocities are enforced.

The resulting velocity u ⋆ incorporates the correct boundary conditions but is not free of di-

vergence. Therefore, a potential ∆ϕ ⋆ is computed with homogeneous Neumann boundary

conditions and the velocity is corrected to be divergence-free. Lastly, the pressure is updated

using the potential.

With the BDF1 scheme treating diffusion implicitly, only the time step restriction for the

convective term remains. Therefore, the time step is limited by the CFL criterion [74]

∆t =CCFL

p2

mine|he|

maxΩ|u|

, (7.2)

where he denotes the extent of the element Ωe, and CCFL the CFL number. For the BDF2

scheme and higher orders of BDFk, an in-depth stability analysis can be found in [66].


Algorithm 7.1: Pressure correction method of kth-order in rotational form. For sake of readabil-ity, the convection operator was short-handed to N (u) = ∇T (uuT ).

for n = 1, nt doP ⋆ ←

∑k−1i=1 β

k−1i P n−i ▷ pressure extrapolation of order k − 1

F ← − 1∆tP ⋆ − 1

∆t

∑ki=1 α

ki u

n−i

+∑k

i=1 βki

(f n−i − N (un−i)

)Solve:

αk0Re

∆tu ⋆ −∆u ⋆ = Re · F ▷ Helmholtz equations for velocities

Solve: −∆ϕ ⋆ = − 1∆t∇ · u ⋆ ▷ Laplace equation for correction

un ← u ⋆ −∆t∇ϕ ⋆ ▷ correct velocitiesP n ← P ⋆ + ϕ ⋆ − 1

Re∇ · u ⋆ ▷ update pressure

end for

Table 7.1: Utilized coefficients αki for backward-differencing formulas of order k and weights βk

i

for extrapolation of order k.

k αk0 αk

1 αk2 βk

1 βk2

1 1 −1 1

2 32−4

212

2 −1

7.3 Validation

7.3.1 Test regime

To validate the flow solver Specht FS, three tests are conducted. The first one investigates

the behavior for periodic domains using the Taylor-Green vortex. Due to the periodicity,

the implementation of convection terms and coupling between velocity and pressure can be

evaluated separately from the implementation of boundary conditions and the corresponding

splitting error. The second one adds boundary conditions and tests these. After validating

for small analytical test cases, the turbulent plane channel flow at Reτ = 180 serves to

validate simulation of turbulent flows.

Preliminary versions of the tests mentioned above were carried out in [109] for the velocity

correction scheme from [75]. Here, only the incremental pressure-correction scheme is tested,

with the same polynomial order for pressure and velocity. Furthermore, exact integration of

the convection terms is performed.

7.3 Validation 117

−1.0 −0.5 0.0 0.5 1.0x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0 −0.5 0.0 0.5 1.0x1

−1.0

−0.5

0.0

0.5

1.0

x2

−0.60

−0.45

−0.30

−0.15

0.00

0.15

0.30

0.45

0.60

Figure 7.1: Taylor-Green vortex for a vanishing transport velocity and a spatial frequencyof ω = π at t = 0. Left: streamlines, right: contour plot of the pressure.

7.3.2 Taylor-Green vortex in a periodic domain

In a periodic domain Ω = (−1, 1)3 with a body force of f(x, t) = 0,

uex,1(x, t) = u1,0 + F (t) cos(ω (x1 − t · u1,0)) · sin(ω (x2 − t · u2,0)) (7.3a)

uex,2(x, t) = u2,0 − F (t) sin(ω (x1 − t · u1,0)) · cos(ω (x2 − t · u2,0)) (7.3b)

uex,3(x, t) = u3,0 (7.3c)

Pex(x, t) = −(F (t))2

4(cos(2ω (x1 − t · u1,0)) + cos(2ω (x2 − t · u2,0))) (7.3d)

with

F (t) = exp

(−ω2 2t

Re

), (7.3e)

is a solution for the Navier-Stokes equations of the class called Taylor-Green vor-

tices [43]. In the above ω is the spatial frequency of the problem, with ω ∈ π · z, z ∈ Z,and (u1,0, u2,0, u3,0)

T is the transport velocity of the vortices. Figure 7.1 depicts the stream-

lines of the flow and the pressure field for ω = π, t = 0 and a transport velocity of u0 = 0.

The test case is suited to investigate the correctness of the implementation of the time-

stepping scheme regarding transport and diffusion terms, as well as the pressure coupling,

whereas the implementation of the body force and boundary conditions can not be tested.

In the following tests the solution parameters are set to ω = π and (u1,0, u2,0, u3,0)T = (2, 3, 4).

TheReynolds number is Re = 10 and lies in the stable regime of the flow as Re/ω < 50 [43].


u1 P BDF1 BDF2

10−7 10−6 10−5 10−4 10−3

∆t

10−10

10−8

10−6

10−4

10−2

∥ε∥ L

2

1

1

1

2

10−7 10−6 10−5 10−4 10−3

∆t

10−8

10−6

10−4

10−2

100

|ε| H

1

1

1

1

2

Figure 7.2: Error of the two time-stepping schemes BDF1 and BDF2 for the Taylor-Greenvortex in a periodic domain when using ne = 8×8×2 spectral elements of degree p = 8.Left: L2 error for u1 and the pressure P , right: errors in the H1 semi norm.

Two computations are performed, one using the scheme of first order, one with the second

order scheme. For these, the initial condition is set at t = 0 and the numerical solution

at t = 0.01 computed using ne = 8× 8× 2 spectral elements of polynomial degree p = 8. The

initial CFL number was scaled from 0.5 · 100 to 0.5 · 10−3, leading to smaller and smaller time

step widths and allowing to investigate the convergence rates of the time stepping scheme.

After attaining a solution on the mesh, the L2 and H1 errors for the velocities

∥ε∥2L2=

∫x∈Ω

(u− uex)2(x) dx (7.4)

|ε|2H1=

∫x∈Ω

(∇(u− uex) · ∇(u− uex))(x) dx (7.5)

and similarly for the pressure are computed on the same grid after interpolating to p = 31,

which is deemed sufficient for resolving the exact solution.

Figure 7.2 depicts the the L2 and H1 error of the first velocity component as well as the

the pressure. For the velocity as well as the pressure, the BDF1 method exhibits a slope of

one, lowering the error by three orders. The BDF2 scheme attains second order and starts

with a lower error and is, furthermore, able to lower it to 10−9, where rounding errors stall

further convergence. Thus, both time-stepping schemes are implemented correctly regarding

pressure coupling, transport terms, and diffusion. While only the results for a homogeneous

x3 direction are shown, computations with permutations of the coordinate directions were

performed with similar results.

7.3 Validation 119

u1 P BDF1 BDF2

10−7 10−6 10−5 10−4 10−3

∆t

10−10

10−8

10−6

10−4

10−2

100∥ε∥ L

2

11

1

2

10−7 10−6 10−5 10−4 10−3

∆t

10−8

10−6

10−4

10−2

100

|ε| H

1

1

1

1

2

Figure 7.3: Error of the two time-stepping schemes BDF1 and BDF2 for the Taylor-Greenvortex with Dirichlet boundary conditions when using ne = 8 × 8 × 2 spectralelements of degree p = 8. Left: L2 error for u1 and the pressure P , right: errors inthe H1 semi norm.

7.3.3 Taylor-Green vortex with Dirichlet boundary conditions

To validate the boundary condition implementation, the tests from the last section were

repeated, but imposing no-slip walls in the first and second direction, whereas the third

direction stayed periodic. Figure 7.3 depicts the L2 and H1 errors resulting from the com-

putations. As in the periodic case, the velocity converges with the expected order for BDF1

and BDF2 and attain similar error margins. The pressure, however, does not converge: The

error stagnates at 10−5 in the L2 norm, with the error being localized in edge modes. These

can be eliminated by further post processing of the solution after the projection step [108].

However, as the convergence rate of the velocities is not impeded by these modes, the issue

is disregarded here.

7.3.4 Turbulent plane channel flow

Until now analytical test cases were utilized to validate the time-stepping scheme and spa-

tial discretization thereof. These, however, do not fully capture the behavior in large-

scale flow simulations where both, low errors and low computing time are key. To vali-

date that Specht FS is capable of resolving turbulent flows, both temporally as well as

spatially, this section considers the turbulent channel flow at Reτ = 180 [96]. In a do-

main Ω = (0, 4π)× (0, 2)× (0, 2π), with periodicity in x1 and x3 direction. The fluid flows

in positive x1 direction and is bounded by walls at x2 = 0 and x2 = 2, with the resulting

boundary layer necessitating a body force to drive the flow. This body force is implemented

with a PI controller fixing the mean velocity to 1. In combination with a bulk Reynolds


number of 5600, and therefore Re = 2800, a turbulent channel flow develops with Reτ = 180.

Here, the linear profile from [99] serves as initial condition:

u1(x2) = 2 (1− |1− x2|) (7.6a)

u2(x2) = 0 (7.6b)

u3(x2) = 0 . (7.6c)

This distribution is perturbed using element-wise pseudo-random numbers with a maximum

value of 0.2, with one pseudo-random being used per direction, allowing for a maximum

variation by 0.4 between the elements.

A grid consisting of 32× 12× 12 spectral elements of degree p = 16 discretized the channel,

leading to the first mesh point lying at y+1 ≈ 0.4 and the first seven below y+ < 10, which is a

bit coarser than advocated [74], but still lies in the regime of direct numerical simulation [36].

The resulting grid contains approximately 19 million grid points, and therefore 75 million

degrees of freedom.

Here, the time-stepping scheme of second order is utilized, with the CFL number chosen

as CCFL = 0.125 and consistent integration of the convection terms ensured via overintegra-

tion. The Helmholtz equations were solved to a residual of 10−10 in every time step, with

the multigrid solver ktvMG solving the pressure equation. The diffusion equations occurring

in the time stepping scheme are, however, well conditioned and in addition have a good

initial guess. As a result dtCG is faster in solving these equations than the multigrid solver

and therefore used here. Furthermore, the spectral vanishing viscosity model (SVV) was

employed to stabilize the simulation. Here, the power kernel by Moura [97] is utilized with

parameters set to the proposed parameters pSVV = p/2 and εSVV = 0.01.

Figure 7.4 depicts the temporal evolution of kinetic energy and dissipation rate. The pertur-

bation of the initial condition leads to a higher mean energy, represented in high frequency

modes which dissipate until t = 20. Afterwards, turbulence slowly develops, leading to an

increase in the kinetic energy, which stagnates after t = 70. The dissipation rate exhibits

two peaks. At t = 0, the element-wise perturbation leads to a high gradient at the element

boundaries, which quickly gets smoothed out. Thereafter, a drop occurs until the triangu-

lar initial distribution starts transitioning into the velocity profile of a turbulent channel

flow, leading to the peak near t = 10. At t = 70, a mostly constant dissipation rate results.

However, small changes remain, for instance near t = 100 and t = 150. These indicate

that while the domain is large, it is not sufficient to contain a statistically homogeneous

flow at Reτ = 180. Furthermore, the mean dissipation rate does not completely match the

target value of 4.14 · 10−3 but lies below it, resulting in Reτ = 177 instead of the target value

of Reτ = 180. Figure 7.5 depicts the vortex structures occurring in the flow as well as the

instantaneous velocity field at t = 200. Vortices are being generated in the boundary layer.

These lead to variations in the velocity field seen in the background, the boundary layer

7.3 Validation 121

0 50 100 150 200

t

0.50

0.55

0.60

0.65

0.70Meankinetic

energyE

k

0 50 100 150 200

t

0

5

10

15

20

Meandissipationrate

ε·103

Figure 7.4: Temporal evaluation of kinetic energy and dissipation rate in the turbulent channelflow in combination with the time intervals utilized for averaging. Left: mean ki-netic energy over time, with the large rectangle marking the interval with temporalaveraging for the velocities and the pressure and the inner rectangle marking the in-terval with averaging for the Reynolds stresses. Right: mean dissipation rate withthe line denoting the mean energy input required for attaining Reτ = 180 in thissetup ε = 4.14 · 10−3.

Figure 7.5: Vortex structures in the turbulent plane channel at t=200. The fluid flows fromthe front left to the back right. In the foreground instantaneous isosurfaces of the λ2

vortex criterion at λ2 = −1.5 are shown, whereas the background planes of the channelare coloured with the magnitude of the velocity vector. Red corresponds to themaximum velocity and blue to zero.

breaking up, and low-velocity regions reaching towards the middle of the channel. The flow

is turbulent, which was the goal of the simulation.

After reaching a statistically steady state at t = 100, averages for the velocities and the

pressure were computed in the time interval t ∈ [100, 200]. For averaging of the Reynolds

stresses, high quality averages of the velocities are required to accurately compute the fluc-

tuations. Therefore, averaging of the Reynolds stresses began later, at t = 150. The

averaging intervals are shown in Figure 7.4.

Figure 7.6 depicts the temporal averages, which were further reduced by averaging in the

statistically homogeneous x1 and x3 direction, and compares these to the reference data

from [96]. Both u1 and the pressure show a good agreement with the reference, with the

velocity being slightly higher in the lower regions of the channel. The Reynolds stresses


⟨u1⟩⟨P ⟩

⟨u′1u′1⟩⟨u′1u′2⟩

⟨u′2u′2⟩⟨u′3u′3⟩

KMM

0 5 10 15

⟨u1⟩

0.0

0.2

0.4

0.6

0.8

1.0

x2

0 2 4 6

⟨u′iu′j⟩

0.0

0.2

0.4

0.6

0.8

1.0

x2

Figure 7.6: Averages for the plane channel flow at Reτ = 180 computed with Specht FS, de-noted by the markers, compared to the data from Kim, Moin, and Moser (KMM,denoted by the lines) [96]. Here ⟨·⟩ denotes temporal averaging and ·′ the correspond-ing fluctuation u′i = ui − ⟨ui⟩. The velocity average was normalized with uτ and thepressure and Reynolds stresses with u2τ . Left: Mean values for the x1-velocity andthe pressure. As the spatial mean value of the pressure is arbitrary, the pressurefrom Specht FS at x2 = 0 was corrected to the value from [96] Right: Reynoldsstresses.

show deviations from the reference: While ⟨u′1u

′2⟩ and ⟨u′

2u′2⟩ fit the data perfectly, ⟨u′

1u′1⟩

exhibits a deviation in the middle of the channel and ⟨u′3u

′3⟩ is slightly underpredicted.

However, the overall agreement is good, validating Specht FS for simulation of turbulent

flows.

7.4 Performance evaluation

7.4.1 Turbulent plane channel flow

So far only the solution error from Specht FS was inspected. While low errors are key to

capture the main features of the flow, in practice time to solution is just as important. This

section inspects the efficiency of the components of Specht FS using the turbulent plane

channel flow at Reτ = 180 from Subsection 7.3.4 as testcase. The measuring platform was

one node of the high-performance computer Taurus, consisting of two Intel Xeon E5-2680

CPUs with twelve cores each, running at a fixed frequency of 2.5GHz. To enable runtime

measurements inside the code, Specht FS was instrumented with Score-P [78], allowing

for gathering of in-depth data for the runtime contribution of the different operators in the

time-stepping scheme. After instrumentation, the Intel Fortran compiler v. 2018 compiled

the source with OpenMP enabled.

7.4 Performance evaluation 123

Table 7.2: Speedup for the plane channel flow test case. Setup data and total runtime obtainedwith the flow solver Specht FS when using the new multigrid solver for computing 0.1dimensionless units in time. The number of unknowns is computed as nDOF = 4p3ne.

Number of threads 1 12

Number of time steps ntime 429 429

Number of degrees of freedom nDOF 18 874 368 18 874 368

Number of processes 2 2

Number of cores ncores 2 24

Runtime tWall [s] 3336 491

(nDOF · ntime)/(tWall · ncores) [1/s] 1 213 600 687 126

Table 7.3: Accumulated runtimes for the time stepping procedure of Specht FS when using thenew multigrid solver for computing a time interval of 0.1 for the channel flow over thenumber of threads per process. Two MPI processes were used and only componentsdirectly in the time-stepping procedure were profiled with Score-P.

1 thread 12 threads

Component [s] [%] [s] [%]

Convection terms 2138 34.6 2294 22.0

Diffusion solver 1454 23.6 3160 30.3

Poisson solver 2407 39.0 4551 43.6

Other 175 2.8 395 4.1

Total 6174 100.0 10 400 100.0

Runtime data was gathered using a run on a smaller domain of Ω = (0, 2π)× (0, 2)× (0, π)

which was discretized using 16× 12× 6 elements and therefore comprises a quarter of the

grid points and degrees of freedom of the grid from Subsection 7.3.4. After attaining a

statistically steady state, 0.1 dimensionless time units were computed, requiring nt = 429

time steps. Two processes decomposed the grid in the x1-direction and for these, two runs

were performed. The first utilized only one core per socket, such that the maximum amount

of memory performance per core can be utilized, whereas the second one used all available

twelve cores on the socket to harness the full available compute power.

Table 7.2 summarizes the wall clock time tWall as well as the runtime per degree of freedom,

whereas Table 7.3 lists the contributions of the different components of the flow solver.

When computing with one thread per process, the runtime decomposes into three main

contributions: The diffusion step requires one quarter of the runtime, the calculation of

the convection terms 35 %, and the pressure treatment the remaining 40 % of the runtime.

For this testcase, the main goal of the thesis has been accomplished: Treating the pressure

becomes as cheap as an explicit treatment for the convection terms. When using one thread

per core, Specht FS attains a throughput of over 1 200 000 time steps times degrees of


Figure 7.7: Isosurfaces of the λ2 vortex criterion for the Taylor-Green vortex at Re = 1600with λ2 = −1.5. The isosurfaces are colored with the magnitude of the velocityvector, where blue corresponds zero and red to a magnitude of U0. The data wastaken from a simulation using ne = 163 spectral elements of polynomial degree p = 16.Left: t = 5T0, middle: t = 7T0, right: t = 9T0.

freedom per core and second. However, the number halves when increasing the core count

to twelve. While the convection terms require few memory accesses and parallelize well,

the Helmholtz solvers do not, as shown in Subsection 5.4.5. The low parallel efficiency

results in an increase of the CPU time spent in the pressure solver. In consequence, the

percentage of the runtime for the pressure treatment increases to 44 %, twice the value for

the convection terms, and the throughput per core drops to 680 000 degrees of freedom per

core and second.

7.4.2 Turbulent Taylor-Green vortex benchmark

The previous benchmark investigated the runtime contributions of the different components

of the flow solver. Here, the overall performance of the flow solver is compared to other flow

solvers using the under-resolved turbulent Taylor-Green vortex benchmark [38, 29]. For

a length scale L, a reference velocity U0, and a periodic domain Ω = (−Lπ, Lπ)3 the initial

velocity components are

u1(x) = +U0 sin(x1

L

)cos(x2

L

)cos(x3

L

)(7.7a)

u2(x) = −U0 cos(x1

L

)sin(x2

L

)cos(x3

L

)(7.7b)

u3(x) = 0 . (7.7c)

Here, L = 1 and U0 = 1. Furthermore, the Reynolds number is set to Re = 1600, resulting

in an unstable flow that quickly transitions to turbulence, as shown in Figure 7.7.

For solution, four meshes are considered. The first one discretizes the domain with ne = 163

elements of polynomial degree p = 8, with the second one refining to ne = 323. The third

and fourth grid use p = 16 with ne = 83 and 163 elements, respectively, leading to the same


ne = 83, p = 16

ne = 163, p = 8

ne = 163, p = 16

ne = 323, p = 8

Reference ne = 1283, p = 7

0 5 10 15 20

t/T0

0

5

10

15

-∂tE

k·103

0 5 10 15 20

t/T0

0

5

10

15

ε·1

03

0 5 10 15 20

t/T0

0

5

10

15

ε num·103

Figure 7.8: Results for the turbulent Taylor-Green vortex. Left: Time derivative of the meankinetic energy, middle: mean dissipation rate captured by the grid, right: numericaldissipation. Reference data courtesy of M.Kronbichler [29].

number of degrees of freedom utilized for p = 8. For all four meshes, simulations were car-

ried out until a simulation time T = 20L/U0 = 20T0 was reached. The time-stepping scheme

of second order was utilized with consistent integration of the convection terms and the

time step width imposed by the CFL condition using CCFL = 0.125. As the grids for this

benchmark are deliberately chosen to be very coarse, not every feature of the flow can be

captured by them. Hence, a subgrid-scale (SGS) model is required. Where for the discon-

tinuous Galerkin methods utilized in [38, 29], the flux formulation infers an implicit SGS

model which generates the required dissipation and stabilization. The continuous Galerkin

formulation in Specht FS, however, does not. As a remedy, the SVV model was employed,

using the Moura kernel with the parameters set to pSVV = p/2 and εSVV = 0.01.

Figure 7.8 compares the derivative of the mean kinetic energy and dissipation to DNS data

from [29]. The coarsest grid with ne = 163 and p = 8 resolves the flow until t = 4T0, where

it starts deviating from the reference solution. It is not capable of resolving the peak in time

derivative and resolves only two thirds of the occurring dissipation, with the remaining third

stemming from numerical dissipation and, hence, attributes dissipation to wrong modes.

With a larger number of elements, more dissipation is resolved, lowering the maximum

numerical dissipation by half. However, raising the polynomial degree at a constant number

of degrees of freedom points leads to similar accuracy gains and the grid with p = 16 and 163

elements has the highest accuracy in the dissipation rate and, in turn, the lowest numerical

dissipation.


Table 7.4: Grids utilized for the turbulent Taylor-Green benchmark in conjunctionwith the respective number of degrees of freedom nDOF, number of datapoints n⋆

DOF = 4(p+ 1)3ne, number of time steps nt, wall clock time tWall, numberof cores nC, CPU time, and computational throughput (nt · n⋆

DOF)/(tWall · nC).

p = 8 p = 16

ne 163 323 43 83 163

nDOF 8 388 608 67 108 864 1 048 576 8 388 608 67 108 864

n⋆DOF 11 943 936 95 551 488 1 257 728 11 943 936 80 494 592

nt 26 076 52 152 26 076 52 152 104 304

e2Ek· 103 3.57 0.487 17.8 1.37 0.229

Runtime tWall [s] 13 964 74 104 2297 25 855 110 058

nP 2 8 2 2 8

nC 24 96 24 24 96

CPU time [CPUh] 93 1976 15 172 2935

Throughput [1/s] 929 356 700 480 594 827 845 664 794 650

The relative L2-error of the time derivative of the kinetic energy Ek serves as quantification

of the accuracy. It computes to

e2Ek=

T∫0

(∂tEk(τ)− ∂tEk,ref(τ))2dτ

T∫0

(∂tEk,ref(τ))2dτ

, (7.8)

where Ek,ref is the kinetic energy from the reference data. Table 7.4 lists the meshes, the

respective accuracy attained by simulations on them as well as the computational through-

put. For comparability with [29], the throughput is computed based on the number of

element-local grid points, (p+ 1)3ne. Both, continuous and discontinuous formulation con-

verge towards the same, smooth result. The extra degrees of freedom allowed for in the

discontinuous case vanish and, hence, using the same basis for comparison is reasonable.

For a constant number of time steps and, hence, the same time step width, using more

degrees of freedom leads to a smaller error. For instance using p = 8 with ne = 323 leads to

a factor of three in accuracy compared to p = 16 and ne = 83. This indicates that the testcase

is spatially under-resolved, as designed, due to the error not solely depending on the time

step width. Except for the coarsest mesh using 43 elements, the computational throughput

is near 800 000 timestep times the number of data points per second and core, with a slightly

higher value for the coarse mesh with p = 8 and a slightly lower one for the same polynomial

degree but 323 elements. The throughput is nearly constant for both, p = 8 as well as p =


16. With the multigrid solver using only two iterations for the pressure equation in all

simulations, the constant throughput is a result of the higher computational cost for the

convection terms at higher polynomial degrees offsetting the more efficient multigrid cycle.

Furthermore, slight parallelization losses are present when increasing the number of nodes

from one to four, leading to a higher throughput for computations with fewer number of

elements. Due to homogeneous meshes being utilized, the computational throughput is

higher than in the last section.

In [29] the testcase was run on comparable hardware using deal.II. The computational

throughput attained by Specht FS is a factor of two higher. To the knowledge of the

author, this makes Specht FS the fastest solver for incompressible flow employing high

polynomial degrees, at the time of writing.

7.4.3 Parallelization study

To investigate the efficiency of the parallelization of Specht FS, the turbulent Taylor-

Green vortex from Subsection 7.4.2 is revisited. With the pressure solver relying on ghost

elements for boundary data exchange, the attainable speedup is limited, increasing the do-

main size with the number of processes. Therefore, the weak scalability is investigated.

The number of nodes is scaled from 1 to 128, using two processes per node each processing 83

spectral elements of degree p = 16 using twelve threads. To keep the problem size constant

per processor, the domain size is scaled as

Ω = (−nP,1Lπ, nP,1Lπ)× (−nP,2Lπ, nP,2Lπ)× (−nP,3Lπ, nP,3Lπ) , (7.9)

where nP,i refers to the number of processes decomposing the domain in direction xi. Due

to the 2π periodicity of the problem, the choice of domain suffices to ensure that the same

computations are carried out. The other parameters are kept as in the previous section. The

runtime for computing until t = T0 is measured and the scaleup compared to using one node

as well as the resulting parallel efficiency computed [49].

Table 7.5 lists the grid decompositions and the measured runtimes, whereas Figure 7.9 depicts

the scaleup and parallel efficiency compared to the optimal linear case. The parallel efficiency

stays above 80% until using eight nodes. Due to choice of grid decomposition, this is the

point where every process communicates with at least one neighbour in a different node and

the communication in that direction exhibits a higher latency as well as a lower bandwidth.

Afterwards, the parallel efficiency decreases, to 70% when using 32 nodes, 48% at 64 and

then 37% at 128. While the amount of runtime spent in the coarse grid solver increases

from 1% on one node to 3% at 32 and 7% at 128 nodes, it is not large enough to explain

the low efficiency. The most probable cause for the low scalability is the communication

structure in the multigrid solver. While the halo data is only sent to neighbouring processes,


Table 7.5: Runtime data for the turbulent Taylor-Green benchmark using 83 elements perprocess. Here nN refers to the number of nodes, nP,i to the number of processes ineach coordinate direction, nC to the total number of cores utilized, tWall to the elapsedwall clock time during the computation and tCoarse to the amount of wall clock timespent in the coarse grid solver. Both, scaleup and parallel efficiency, are based on therun using one node.

nN nP,1 nP,2 nP,3 nC nt tWall [s] Scaleup Efficiency [%] tCoarse

tWall[%]

1 2 1 1 24 2608 2231 1.0 100 0.7

2 2 2 1 48 2608 2345 1.9 95 0.9

4 2 2 2 96 2608 2475 3.6 90 1.1

8 4 2 2 192 2608 2681 6.7 83 1.5

16 4 4 2 384 2608 2985 12.0 75 2.1

32 4 4 4 768 2608 3207 22.3 70 2.9

64 8 4 4 1536 2608 4623 30.9 48 3.7

128 8 8 4 3072 2608 6095 46.8 37 6.3

SPECHT FS Optimal scaleup

101 102 103 104

Number of cores

102

103

104

Scaleup

101 102 103 104

Number of cores

0

20

40

60

80

100

Paralleleffi

ciency

[%]

Figure 7.9: Scaleup and resulting parallel efficiency for the turbulent Taylor-Green benchmarkusing 83 elements per process.

there are 28 neighbours in a structured grid. Sending one halo layer to a facial neighbour lead

to 80MB of data being transfered at p = 16 and 83 elements on the partition. With one pre-

and one postsmoothing step and two iterations, every partition sends more than 2GB of data

per solver call, on the finest grid level alone. At the time of writing, this communication

is implemented in an explicit fashion, encapsulating the communication in a subroutine

to allow for verification of the communication structure. Overlapping of computation and

communication, i.e. overlapping the smoother with the halo transfer, is expected to increase

the scalability significantly and will be investigated in futher work.

7.5 Conclusions 129

7.5 Conclusions

This chapter presented the flow solver Specht FS. It combines the algorithmic improve-

ments for the spectral-element method developed in Chapter 4 and Chapter 5 with the

optimized operators from Chapter 3 and the parallelization scheme proposed in Chapter 6.

Therefore it can capitalize on solving elliptic equations with the runtime scaling linearly with

the number of degrees of freedom, independent of the polynomial degree, exploit hardware-

optimized operators, and harness heterogeneous hardware. Furthermore it is written in an

extendible fashion, allowing to include further variables such as passive and active scalars.

Using multiple test cases, Specht FS was validated. First, analytical test cases served to

validate the coupling between velocity and pressure, the convection terms and then boundary

conditions and the implementation of right-hand sides. In each of these, the time-stepping

scheme converged with the expected order for the velocities, whereas the pressure had a

higher remaining error due to the boundary conditions. This, however, does not impede

the convergence of the velocities. Afterwards, the turbulent plane channel flow at Reτ = 180

served to validate Specht FS for the simulation of turbulent flows. While minor differences

were present between reference and computed Reynolds stresses, the overall aggreement

is good and can probably be decreased by higher resolution and longer averaging times.

Therefore, Specht FS can be seen as validated for simulations of turbulent flows.

After the validation, the runtime behaviour of Specht FS was analyzed. For a polynomial

degree of p = 16 and the maximum RAM bandwidth per core, the explicitly treated con-

vection terms take as much time as the solution of the pressure equation does. This parity

in runtime achieves a goal of this dissertation: The large portion of the runtime spent in

elliptic solvers typically renders time-stepping schemes with more substeps too expensive.

The usage of the multigrid solvers presented in this work allow for efficient solution of elliptic

equations and reduce the portion to that of exlipict terms.

The turbulent Taylor-Green vortex benchmark was then utilized to evaluate the efficiency

For this benchmark Specht FS achieves twice the throughput of the well-tuned deal.II

code on similar hardware. The performance difference stems mainly from the linear scaling

of the multigrid solvers and makes Specht FS, to the best knowledge of the authors, the

fastest high-order flow solver at the time of writing. Lastly, a scalability study showed

that Specht FS scales reasonably well up to 1000 cores, albeit only for weak scalability.

Overlapping of communication and computation is required to improve the scalability and

will be implemented in further work.

131

Chapter 8

Conclusions and outlook

This work contributes to high-order methods. It aimed at improving the algorithms as well

as future-proofing these with regard to the growing memory as well as to the expected het-

erogeneity in upcoming high-performance computers. To this end, the central component for

computing incompressible fluid flow, the Helmholtz solver, was investigated as it occupies

up to 95% of the runtime [29]. First, regarding the attainable performance using standard

tensor-product operators, then considering static condensation to lower the operation count,

multigrid to lower the number of iterations, and, lastly, exploiting heterogeneity. There-

after, the improvements were implemented in a fully-fledged flow solver for high order and

the attained performance measured.

Chapter 3 investigated the attainable performance with tensor-product operators in three

dimensions. While sum factorization allows to evaluate these with O(p4) operations, matrix

multiplication often ends up being faster [17] and straight-forward implementations show-

cased exhibited the same behavior for the interpolation operator. Adding parametrization

and hardware-adapted optimizations, improved these to, leading to 20GFLOP s−1, half of

the performance available per core being extracted, with the remainder being spent on

loading and storing data. The methods were thereafter applied to larger operators, such

as Helmholtz and fast diagonalization operators. The resulting operators showed a high

degree of efficiency and served as baseline for the further chapters.

Chapter 4 investigated static condensation as a means to bridge the growing memory gap

and lower number of operations. With static condensation, the equation system is reduced to

the boundary nodes of the elements. This decreases the number of unknowns in the equation

system to O(p2). However, these are coupled more tightly, resulting in the operators often

being implemented using matrix-matrix computation and therefore scaling with O(p4) [21,51]. Moreover, the result is not matrix-free by definition and even raises the required memory

bandwidth for inhomogeneous meshes. A tensor-product decomposition of the operator led

to a matrix-free variant of the Helmholtz operator and applying sum factorization as well

132 8 Conclusions and outlook

as product factorization reduced it to linear complexity, i.e.O(p3). The implementation

thereof outpaced highly optimized matrix-matrix products for any polynomial degree and a

tensor-product operator for the full case for p > 7.

To investigate the prospective performance gains obtained by the linearly scaling operator,

solvers using the preconditioned conjugate gradient method solvers were devised. For these

an increasing operator efficiency offsets the growing number of iterations, leading to a linearly

scaling runtime when increasing the polynomial degree. The solvers scratched the 1 µs mark

per degree of freedom, when using only one CPU core and with standard programming

techniques. However, while the solver were faster than expected and exhibited an exceptional

robustness against the aspect ratio, the number of iterations still increased with the number

of elements, rendering them infeasible as pressure solvers for large-scale computations.

To attain robustness with regard to the number of elements, Chapter 5 investigated p-

multigrid. Previous studies showed that overlapping Schwarz methods can generate a

constant iteration count [88, 52]. Moreover, using only six smoothing steps can suffice to

lower the residual by ten orders of magnitude [123] and this holds for the condensed case [51].

However, the overlapping Schwarz smoothers require inversion on small subdomains. Fa-

cilitating these with fast diagonalization method in the full case, or with matrix inversion

in the condensed one, O(p4) operations result. Embedding the static condensed system into

the full one allowed to capitalize on fast diagonalization as tensor-product inverse and gen-

erated a matrix-free inverse, albeit with super-linear scaling. Further factorization led to

a linearly scaling inverse on the Schwarz domains and, in turn, a smoother achieving a

constant runtime per degree of freedom.

The combination of linearly scaling operator and smoother allowed for a multigrid cycle with

a constant runtime per degree of freedom. Furthermore, the resulting multigrid method in-

herited the convergence properties from [122, 51] and therefore attained a constant number

of iterations. This, in turn, led to a constant runtime per degree of freedom, regardless of

the polynomial degree or number of elements. The method improved upon the condensed

solvers and attained 0.5 µs as runtime per unknown when computing on one sole core of a

standard CPU. Moreover, the solver outperformed the deal.II library by a factor of four,

when comparing their p = 8 to p = 32 with the proposed multigrid solver [82]. There-

fore the proposed method unlocked computation with extremely high-orders. Moreover, it

constituted a three-fold improvement over the work that stimulated this research [51].

To address the increasing heterogeneity in the high-performance computers, Chapter 6

investigated heterogeneous computing on the CPU-GPU coupling. The hybrid program-

ming model [49] was expanded to a two-level parallelization with the coarse layer decom-

posing the mesh and the fine layer harnessing the capabilities of the specific hardware.

When using pragma-based language extensions, such as OpenMP [102], OpenACC [101],

or OmpSs [15], the model keeps variants for every kind of hardware inside a single source.

133

Not only does this increase the maintainability, in addition the different variants can share

the same communication pattern. Usage of the same communication patterns allows the

domain decomposition layer to then fuse the different programs into a single heterogeneous

system.

With a heterogeneous system available, load balancing was investigated on the Helmholtz

solvers. While using a linear fit as functional performance model and treating the runtime

as a black box suffices for many applications [30, 140, 85], this was not the case for the pCG

solver. Modelled and resulting runtimes exhibited large differences, with the prediction

error ranging up to 40%. These errors were induced by disregarding the communication

pattern: Exchange of boundary data and scalar products incured synchronization, splitting

the computation into multiple parts and digressing with the assumptions of a monolithic

runtime. A load balancing model accounting for these further synchronizations was devised.

While leading to further restrictions and therefore a lower predicted speedup, the runtime

expectation and result were in agreement: Where formerly the error ranged up to 40%, it was

now 5-10%, allowing accurate predictions for the heterogeneous system and outperforming

both homogeneous configurations.

Chapter 7 combined the previous advances into the flow solver Specht FS: The static

condensed Helmholtz solvers from Chapter 4, the multigrid solver from Chapter 5, and

the programming model from Chapter 6. Therefore, the resulting solver is capable of solv-

ing Helmholtz equations in linear runtime and can compute on heterogeneous systems.

Analytical test cases validated components of Specht FS, whereas the turbulent channel

flow served as validation for time-resolved simulation of turbulent flows. Thereafter, the

runtime distribution inside the solver was investigated. Treating the pressure can account

for 90% of the runtime, in Specht FS, however, the implicit treatment of the pressure

required only as much as the explicit treatment of the convection terms. Lastly, the under-

resolved turbulent Taylor-Green benchmark investigated the performance of the proposed

flow solver, where Specht FS achieved twice the throughput on similar hardware compared

to deal.II [29].

The algorithmic improvements presented in this work allow for solutions of the Helmholtz

equation in linear runtime. Not only does this allow for the implicit treatment of the pressure

equation to take as long as the explicit treatment of the convection terms, it also allows to

use far higher polynomial degrees and therefore higher convergence rates. Furthermore, the

algorithms bridge the growing memory gap, as loading and storing in the static condensed,

matrix-free solvers scales with O(p2), whereas the number of operations scale with O(p3).However, the methods are currently restricted to structured Cartesian meshes and the con-

tinuous spectral-element method. Therefore, future work will expand these to discontinuous

spectral-element methods and investigate similar structure exploitation for curvilinear ele-

ments and first results for the former are available in [65].

134 8 Conclusions and outlook

135

Appendix A

Further results for wildly heterogeneous

systems

Table A.1: Iteration time per degree of freedom measured in ns when computing on either multiplecores with OpenMP or GK210 GPU chips on ns sockets.

p = 7 p = 11 p = 15

ne ne ne

ns Setup 83 123 163 83 123 163 83 123 163

1 1 core 97.1 102.8 106.5 119.5 122.7 123.8 113.1 114.9 114.8

1 4 cores 25.4 26.6 31.4 30.6 32.1 32.4 29.4 30.1 29.9

1 8 cores 13.4 14.3 17.8 22.2 17.8 18.1 16.1 16.6 16.6

1 12 cores 9.7 10.4 11.9 11.6 13.1 13.5 11.9 12.6 12.4

1 1 GK210 7.5 5.1 4.3 6.7 5.5 5.1 6.8 6.1 5.9

1 2 GK210 6.1 3.2 2.5 4.2 3.1 2.8 3.8 3.3 3.1

2 1 core 49.3 49.1 52.0 57.7 60.7 61.9 55.8 57.3 57.4

2 4 cores 13.4 12.9 13.6 14.9 15.8 16.2 14.4 15.1 15.1

2 8 cores 7.3 6.8 7.4 7.8 8.6 9.0 7.8 8.4 9.7

2 12 cores 5.6 4.8 6.1 5.6 6.3 6.6 5.7 6.2 6.4

2 1 GK210 6.6 3.3 2.5 4.2 3.1 2.7 3.8 3.3 3.1

2 2 GK210 6.1 2.3 1.6 2.7 1.9 1.5 2.2 1.8 1.6

136 A Further results for wildly heterogeneous systems

Table A.2: Grids utilized for the homogeneous computations and resulting runtimes per itera-tion tIter when using either ten cores of the CPU (CPU) or one GK210 chip (GPU).

tIter [ms]

ne Decomposition CPU GPU

100 10× 10× 1 0.49 1.02

200 20× 10× 1 0.88 1.20

300 20× 15× 1 1.26 1.32

400 20× 20× 1 1.63 1.47

500 25× 20× 1 2.24 1.58

600 30× 20× 1 2.39 1.73

700 35× 20× 1 2.79 1.93

800 40× 20× 1 3.35 2.11

900 36× 25× 1 3.62 2.29

1000 40× 25× 1 4.11 2.47

Table A.3: Element distribution for the heterogeneous case using the single-step model for loadbalancing and ten cores of the CPU (C) as well as one GK210 chip (G) with modelledand measured times per iteration tIter as well as the resulting relative error.

Distribution tIter [ms]

ne n(C)e n

(G)e Measurement Model Relative error [%]

200 160 40 1.01 0.71 43.7

300 180 120 1.17 0.83 40.8

400 220 180 1.35 0.95 42.0

500 240 260 1.53 1.08 41.8

600 280 320 1.72 1.20 43.2

700 300 400 1.83 1.33 37.4

800 340 460 2.00 1.45 37.8

900 350 550 2.23 1.60 39.6

1000 400 600 2.34 1.70 38.0

137

Bibliography

[1] NVIDIA CUDA programming guide (version 1.0). NVIDIA: Santa Clara, CA, 2007.

[2] S. Abhyankar, J. Brown, E. M. Constantinescu, D. Ghosh, B. F. Smith, and

H. Zhang. PETSc/TS: A modern scalable ODE/DAE solver library. arXiv preprint

arXiv:1806.01437, 2018.

[3] M. Atak, A. Beck, T. Bolemann, D. Flad, H. Frank, and C.-D. Munz. High fi-

delity scale-resolving computational fluid dynamics using the high order discontinuous

Galerkin spectral element method. In High Performance Computing in Science and

Engineering ´15, pages 511–530. Springer, 2016.

[4] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a unified plat-

form for task scheduling on heterogeneous multicore architectures. Concurrency and

Computation: Practice and Experience, 23(2):187–198, 2011.

[5] W. Bangerth, R. Hartmann, and G. Kanschat. deal.II — a general-purpose object-

oriented finite element library. ACM Transactions on Mathematical Software (TOMS),

33(4):24, 2007.

[6] A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh. A package for OpenCL based het-

erogeneous computing on clusters with many GPU devices. In Cluster Computing

Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Con-

ference on, pages 1–7. IEEE, 2010.

[7] P. Bastian, C. Engwer, D. Goddeke, O. Iliev, O. Ippisch, M. Ohlberger, S. Turek,

J. Fahlke, S. Kaulmann, S. Muthing, and D. Ribbrock. EXA-DUNE: Flexible PDE

solvers, numerical methods and applications. In L. Lopes, J. Zilinskas, A. Costan,

R. G. Cascella, G. Kecskemeti, E. Jeannot, M. Cannataro, L. Ricci, S. Benkner, S. Pe-

tit, V. Scarano, J. Gracia, S. Hunold, S. L. Scott, S. Lankes, C. Lengauer, J. Car-

retero, J. Breitbart, and M. Alexander, editors, Euro-Par 2014: Parallel Processing

Workshops, pages 530–541, Cham, 2014. Springer International Publishing.

[8] A. D. Beck, T. Bolemann, D. Flad, H. Frank, G. J. Gassner, F. Hindenlang, and C.-

D. Munz. High-order discontinuous Galerkin spectral element methods for transitional

and turbulent flow simulations. International Journal for Numerical Methods in Fluids,

76(8):522–548, 2014.

[9] T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. Memory access patterns: the missing

piece of the multi-GPU puzzle. In Proceedings of the International Conference for High

Performance Computing, Networking, Storage and Analysis, page 19. ACM, 2015.

138

[10] M. Berger, M. Aftosmis, D. Marshall, and S. Murman. Performance of a new CFD

flow solver using a hybrid programming paradigm. Journal of Parallel and Distributed

Computing, 65:414–423, 2005.

[11] H. M. Blackburn and S. Sherwin. Formulation of a Galerkin spectral element–Fourier

method for three-dimensional incompressible flows in cylindrical geometries. Journal

of Computational Physics, 197(2):759–778, 2004.

[12] J. Bramble. Multigrid methods. Pitman Res. Notes Math. Ser. 294. Longman Scientific

& Technical, Harlow, UK, 1995.

[13] A. Brandt. Multi-level adaptive technique (MLAT) for fast numerical solution to

boundary value problems. In H. Cabannes and R. Temam, editors, Proceedings of

the 3rd International Conference on Numerical Methods in Fluid Dynamics (Berlin),

pages 82–89. Springer-Verlag, 1973.

[14] A. Brandt. Guide to multigrid development. In Multigrid Methods, volume 960 of

Lecture Notes in Mathematics, pages 220–312. Springer Berlin/Heidelberg, 1982.

[15] J. Bueno, J. Planas, A. Duran, R. Badia, X. Martorell, E. Ayguade, and J. Labarta.

Productive programming of GPU clusters with OmpSs. In Parallel Distributed Process-

ing Symposium (IPDPS), 2012 IEEE 26th International, pages 557–568, May 2012.

[16] P. E. Buis and W. R. Dyksen. Efficient vector and parallel manipulation of tensor

products. ACM Transactions on Mathematical Software (TOMS), 22(1):18–23, 1996.

[17] C. Cantwell, S. Sherwin, R. Kirby, and P. Kelly. From h to p efficiently: Strategy

selection for operator evaluation on hexahedral and tetrahedral elements. Computers

& Fluids, 43(1):23–28, 2011.

[18] C. D. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo,

D. De Grazia, S. Yakovlev, J.-E. Lombard, D. Ekelschot, et al. Nektar++: An open-

source spectral/hp element framework. Computer Physics Communications, 192:205–

219, 2015.

[19] J. Castrillon, M. Lieber, S. Kluppelholz, M. Volp, N. Asmussen, U. Assmann,

F. Baader, C. Baier, G. Fettweis, J. Frohlich, A. Goens, S. Haas, D. Habich, H. Hartig,

M. Hasler, I. Huismann, T. Karnagel, S. Karol, A. Kumar, W. Lehner, L. Leuschner,

S. Ling, S. Marcker, C. Menard, J. Mey, W. Nagel, B. Nothen, R. Penaloza, M. Raitza,

J. Stiller, A. Ungethum, A. Voigt, and S. Wunderlich. A hardware/software stack

for heterogeneous systems. IEEE Transactions on Multi-Scale Computing Systems,

4(3):243–259, July 2018.

139

[20] C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mo-

hanti, Y. Yao, and D. Chavarrıa-Miranda. An evaluation of global address space

languages: Co-array Fortran and Unified Parallel C. In Proceedings of the tenth ACM

SIGPLAN symposium on Principles and practice of parallel programming, pages 36–47.

ACM, 2005.

[21] W. Couzy and M. Deville. A fast Schur complement method for the spectral element

discretization of the incompressible Navier-Stokes equations. Journal of Computational

Physics, 116(1):135 – 142, 1995.

[22] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. Design

of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of

Solid-State Circuits, 9(5):256–268, 1974.

[23] M. Deville, P. Fischer, and E. Mund. High-Order Methods for Incompressible Fluid

Flow. Cambridge University Press, 2002.

[24] T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra. Hydrodynamic

computation with hybrid programming on CPU-GPU clusters. Innovative Computing

Laboratory, University of Tennessee, 2013. Online available at http://citeseerx.

ist.psu.edu/viewdoc/download?doi=10.1.1.423.4016&rep=rep1&type=pdf.

[25] T. Dong et al. A step towards energy efficient computing: Redesigning a hydrodynamic

application on CPU-GPU. In Parallel and Distributed Processing Symposium, 2014

IEEE 28th International, pages 972–981. IEEE, 2014.

[26] J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J.-C. Andre, D. Barkai,

J.-Y. Berthou, T. Boku, B. Braunschweig, et al. The international exascale software

project roadmap. International Journal of High Performance Computing Applications,

25(1):3–60, 2011.

[27] M. Dryja, B. F. Smith, and O. B. Widlund. Schwarz analysis of iterative substructur-

ing algorithms for elliptic problems in three dimensions. SIAM journal on numerical

analysis, 31(6):1662–1694, 1994.

[28] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark

silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th

Annual International Symposium on, pages 365–376. IEEE, 2011.

[29] N. Fehn, W. A. Wall, and M. Kronbichler. Efficiency of high-performance discontinuous

Galerkin spectral element methods for under-resolved turbulent incompressible flows.

International Journal for Numerical Methods in Fluids, 2018.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.423.4016&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.423.4016&rep=rep1&type=pdf

140

[30] C. Feichtinger, J. Habich, H. Kostler, U. Rude, and T. Aoki. Performance modeling

and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters.

Parallel Computing, 46:1–13, 2015.

[31] X. Feng and O. A. Karakashian. Two-level additive Schwarz methods for a discon-

tinuous Galerkin approximation of second order elliptic problems. SIAM Journal on

Numerical Analysis, 39(4):1343–1365, 2001.

[32] J. H. Ferziger and M. Peric. Computational Methods for Fluid Dynamics. Springer-

Verlag, 2002.

[33] P. Fischer and J. Mullen. Filter-based stabilization of spectral element methods.

Comptes Rendus de l’Academie des Sciences-Series I-Mathematics, 332(3):265–270,

2001.

[34] P. F. Fischer. An overlapping Schwarz method for spectral element solution of the

incompressible Navier–Stokes equations. Journal of Computational Physics, 133(1):84–

101, 1997.

[35] P. F. Fischer, J. W. Lottes, and S. G. Kerkemeier. nek5000 web page, 2008.

[36] J. Frohlich. Large Eddy Simulation turbulenter Stromungen. Teubner, Wiesbaden,

2006. In German.

[37] M. J. Gander et al. Schwarz methods over the course of time. Electron. Trans. Numer.

Anal, 31(5):228–255, 2008.

[38] G. J. Gassner and A. D. Beck. On the accuracy of high-order discretizations for

underresolved turbulence simulations. Theoretical and Computational Fluid Dynamics,

27(3-4):221–237, 2013.

[39] M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. Kelly. Performance

analysis of the OP2 framework on many-core architectures. ACM SIGMETRICS Per-

formance Evaluation Review, 38(4):9–15, 2011.

[40] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski,

and S. Turek. Exploring weak scalability for FEM calculations on a GPU-enhanced

cluster. Parallel Computing, 33(10-11):685–699, 2007.

[41] G. H. Golub and Q. Ye. Inexact preconditioned conjugate gradient method with inner-

outer iteration. SIAM Journal on Scientific Computing, 21(4):1305–1320, 1999.

[42] K. Goto and R. Van De Geijn. High-performance implementation of the level-3 BLAS.

ACM Transactions on Mathematical Software, 35(1), Jul 2008.

141

[43] A. E. Green and G. I. Taylor. Mechanism of the production of small eddies from larger

ones. Proc. Royal Soc., 158, 1937.

[44] C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU

vs. GPU performance without the answer. In Proceedings of the IEEE International

Symposium on Performance Analysis of Systems and Software, ISPASS ’11, pages

134–144, Washington, DC, USA, 2011. IEEE Computer Society.

[45] L. Grinberg, D. Pekurovsky, S. Sherwin, and G. E. Karniadakis. Parallel performance of

the coarse space linear vertex solver and low energy basis preconditioner for spectral/hp

elements. Parallel Computing, 35(5):284–304, 2009.

[46] J.-L. Guermond, P. Minev, and J. Shen. An overview of projection methods for

incompressible flows. Computer methods in applied mechanics and engineering,

195(44):6011–6045, 2006.

[47] W. Hackbusch. Multigrid Methods and Applications, volume 4 of Computational Math-

ematics. Springer, 1985.

[48] D. Hackenberg, R. Schone, T. Ilsche, D. Molka, J. Schuchart, and R. Geyer. An

energy efficiency feature survey of the Intel Haswell processor. In Parallel Distributed

Processing Symposium Workshops (IPDPSW), 2015 IEEE International, 2015.

[49] G. Hager and G. Wellein. Introduction to High Performance Computing for Scientists

and Engineers. CRC Press, Boca Raton, FL, USA, 1st edition, jul 2010.

[50] R. Hartmann, M. Lukacova-Medvid’ova, and F. Prill. Efficient preconditioning for the

discontinuous Galerkin finite element method by low-order elements. Applied Numer-

ical Mathematics, 59(8):1737 – 1753, 2009.

[51] L. Haupt. Erweiterte mathematische Methoden zur Simulation von turbulenten

Stromungsvorgangen auf parallelen Rechnern. PhD thesis, Centre for Information Ser-

vices and High Performance Computing (ZIH), TU Dresden, Dresden, 2017.

[52] L. Haupt, J. Stiller, and W. E. Nagel. A fast spectral element solver combining static

condensation and multigrid techniques. Journal of Computational Physics, 255(0):384

– 395, 2013.

[53] E. Hermann et al. Multi-GPU and multi-CPU parallelization for interactive physics

simulations. Euro-Par 2010-Parallel Processing, pages 235–246, 2010.

[54] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear

systems. Journal of Research of the National Bureau of Standards, 49(6):409–436,

1952.

142

[55] J. S. Hesthaven and T. Warburton. Nodal discontinuous Galerkin methods: algorithms,

analysis, and applications. Springer Science & Business Media, 2007.

[56] F. Hindenlang, G. J. Gassner, C. Altmann, A. Beck, M. Staudenmaier, and C.-D.

Munz. Explicit discontinuous Galerkin methods for unsteady problems. Computers &

Fluids, 61:86–93, 2012.

[57] I. Huismann, L. Haupt, J. Stiller, and J. Frohlich. Sum factorization of the static

condensed Helmholtz equation in a three-dimensional spectral element discretization.

PAMM, 14(1):969–970, 2014.

[58] I. Huismann, M. Lieber, J. Stiller, and J. Frohlich. Load balancing for CPU-GPU

coupling in computational fluid dynamics. In Parallel Processing and Applied Mathe-

matics, pages 371–380. Springer, 2017.

[59] I. Huismann, J. Stiller, and J. Frohlich. Two-level parallelization of a fluid mechanics

algorithm exploiting hardware heterogeneity. Computers & Fluids, 117(0):114 – 124,

2015.

[60] I. Huismann, J. Stiller, and J. Frohlich. Fast static condensation for the Helmholtz

equation in a spectral-element discretization. In Parallel Processing and Applied Math-

ematics, pages 371–380. Springer, 2016.

[61] I. Huismann, J. Stiller, and J. Frohlich. Building blocks for a leading edge high-order

flow solver. PAMM, 17(1), 2017.

[62] I. Huismann, J. Stiller, and J. Frohlich. Factorizing the factorization – a spectral-

element solver for elliptic equations with linear operation count. Journal of Computa-

tional Physics, 346:437–448, oct 2017.

[63] I. Huismann, J. Stiller, and J. Frohlich. Scaling to the stars – a linearly scaling elliptic

solver for p-multigrid. Journal of Computational Physics, 398:108868, 2019.

[64] I. Huismann, J. Stiller, and J. Frohlich. Efficient high-order spectral element dis-

cretizations for building block operators of CFD. Computers & Fluids, 197:104386,

2020.

[65] I. Huismann, J. Stiller, and J. Frohlich. Linearizing the hybridizable discontinuous

Galerkin method: a linearly scaling operator. arXiv preprint arXiv:2007.11891, 2020.

submitted.

[66] W. Hundsdorfer and S. J. Ruuth. IMEX extensions of linear multistep methods with

general monotonicity and boundedness properties. Journal of Computational Physics,

225(2):2016–2042, 2007.

143

[67] H. Iwai. Future of nano CMOS technology. Solid-State Electronics, 112:56–67, 2015.

[68] D. Jacob, J. Petersen, B. Eggert, A. Alias, O. B. Christensen, L. M. Bouwer, A. Braun,

A. Colette, M. Deque, G. Georgievski, et al. EURO-CORDEX: new high-resolution

climate change projections for European impact research. Regional Environmental

Change, 14(2):563–578, 2014.

[69] A. Jameson. Time dependent calculations using multigrid, with applications to un-

steady flows past airfoils and wings. In 10th Computational Fluid Dynamics Confer-

ence, page 1596, 1991.

[70] H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman. High per-

formance computing using MPI and OpenMP on multi-core parallel systems. Parallel

Computing, 37:562–575, 2011.

[71] L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system

based on C++. In ACM Sigplan Notices, volume 28, pages 91–108. ACM, 1993.

[72] G. Kanschat. Robust smoothers for high-order discontinuous Galerkin discretizations

of advection–diffusion problems. Journal of Computational and Applied Mathematics,

218(1):53–60, 2008.

[73] T. Karnagel, D. Habich, and W. Lehner. Limitations of intra-operator parallelism

using heterogeneous computing resources. In ADBIS 2016, pages 291–305, 2016.

[74] G. Karniadakis and S. Sherwin. Spectral/hp Element Methods for CFD. Oxford Uni-

versity Press, 1999.

[75] G. E. Karniadakis, M. Israeli, and S. A. Orszag. High-order splitting methods for the

incompressible Navier-Stokes equations. Journal of Computational Physics, 97(2):414–

443, 1991.

[76] R. M. Kirby, S. J. Sherwin, and B. Cockburn. To CG or to HDG: a comparative study.

Journal of Scientific Computing, 51(1):183–212, 2012.

[77] A. Klockner, T. Warburton, J. Bridge, and J. Hesthaven. Nodal discontinuous Galerkin

methods on graphics processors. Journal of Computational Physics, 228(21):7863 –

7882, 2009.

[78] A. Knupfer, C. Rossel, D. a. Mey, S. Biersdorff, K. Diethelm, D. Eschweiler, M. Geimer,

M. Gerndt, D. Lorenz, A. Malony, W. E. Nagel, Y. Oleynik, P. Philippen, P. Saviankou,

D. Schmidl, S. Shende, R. Tschuter, M. Wagner, B. Wesarg, and F. Wolf. Score-P: A

joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU,

and Vampir. In H. Brunst, M. S. Muller, W. E. Nagel, and M. M. Resch, editors,

144

Tools for High Performance Computing 2011, pages 79–91, Berlin, Heidelberg, 2012.

Springer Berlin Heidelberg.

[79] D. Komatitsch, G. Erlebacher, D. Goddeke, and D. Michea. High-order finite-element

seismic wave propagation modeling with MPI on a large GPU cluster. Journal of

Computational Physics, 229(20):7692 – 7714, 2010.

[80] D. Koschichow, J. Frohlich, R. Ciorciari, and R. Niehuis. Analysis of the influence

of periodic passing wakes on the secondary flow near the endwall of a linear LPT

cascade using DNS and U-RANS. In ETC2015-151, Proceedings of the 11th European

Conference on Turbomachinery Fluid Dynamics and Thermodynamics, 2015.

[81] E. Krause. Fluid mechanics. Springer, 2005.

[82] M. Kronbichler and W. A. Wall. A performance comparison of continuous and dis-

continuous Galerkin methods with fast multigrid solvers. SIAM Journal on Scientific

Computing, 40(5):A3423–A3448, 2018.

[83] Y.-Y. Kwan and J. Shen. An efficient direct parallel spectral-element solver for sepa-

rable elliptic problems. Journal of Computational Physics, 225(2):1721 – 1735, 2007.

[84] F. Lemaitre and L. Lacassagne. Batched Cholesky factorization for tiny matrices. In

Design and Architectures for Signal and Image Processing (DASIP), 2016 Conference

on, pages 130–137. IEEE, 2016.

[85] C. Liu and J. Shen. A phase field model for the mixture of two incompressible fluids

and its approximation by a fourier-spectral method. Physica D: Nonlinear Phenomena,

179(3–4):211 – 228, 2003.

[86] X. Liu et al. A hybrid solution method for CFD applications on GPU-accelerated

hybrid HPC platforms. Future Generation Computer Systems, 56:759–765, 2016.

[87] J.-E. W. Lombard, D. Moxey, S. J. Sherwin, J. F. A. Hoessler, S. Dhandapani, and

M. J. Taylor. Implicit large-eddy simulation of a wingtip vortex. AIAA Journal, pages

1–13, Nov. 2015.

[88] J. W. Lottes and P. F. Fischer. Hybrid multigrid/Schwarz algorithms for the spectral

element method. Journal of Scientific Computing, 24(1):45–78, 2005.

[89] R. Lynch, J. Rice, and D. Thomas. Direct solution of partial difference equations by

tensor product methods. Numerische Mathematik, 6(1):185–199, 1964.

[90] M. Manna, A. Vacca, and M. O. Deville. Preconditioned spectral multi-domain dis-

cretization of the incompressible Navier–Stokes equations. Journal of Computational

Physics, 201(1):204 – 223, 2004.

145

[91] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Don-

garra. High-performance matrix-matrix multiplications of very small matrices. In

European Conference on Parallel Processing, pages 659–671. Springer, 2016.

[92] C. P. Mellen, J. Frohlich, and W. Rodi. Lessons from LESFOIL project on large-eddy

simulation of flow around an airfoil. AIAA journal, 41(4):573–581, 2003.

[93] E. Merzari, W. Pointer, and P. Fischer. Numerical simulation and proper orthogonal

decomposition of the flow in a counter-flow T-junction. Journal of Fluids Engineering,

135(9):091304, 2013.

[94] H. Meuer, E. Trohmaier, J. Dongarra, and H. Simon. Top500 list – june 2018. June

2018. available online at www.top500.org.

[95] G. E. Moore. Electronics. Electronics, 38:114, 1965.

[96] R. D. Moser, J. Kim, and N. N. Mansour. Direct numerical simulation of turbulent

channel flow up to Reτ = 590. Physics of Fluids, 11(4):943–945, 1999.

[97] R. Moura, S. Sherwin, and J. Peiro. Eigensolution analysis of spectral/hp continu-

ous Galerkin approximations to advection–diffusion problems: Insights into spectral

vanishing viscosity. Journal of Computational Physics, 307:401–422, 2016.

[98] D. Moxey, C. Cantwell, R. Kirby, and S. Sherwin. Optimising the performance of

the spectral/hp element method with collective linear algebra operations. Computer

Methods in Applied Mechanics and Engineering, 310:628–645, 2016.

[99] K. Nelson and O. Fringer. Reducing spin-up time for simulations of turbulent channel

flow. Physics of Fluids, 29(10):105101, 2017.

[100] C. W. Oosterlee and T. Washio. An evaluation of parallel multigrid as a solver and a

preconditioner for singularly perturbed problems. SIAM Journal on Scientific Com-

puting, 19(1):87–110, 1998.

[101] OpenACC-Standard.org. The OpenACC application programming interface version

2.6, 2017. Published: online available specification.

[102] OpenMP Architecture Review Board. OpenMP application program interface version

4.5, 2015. Published: online available specification.

[103] S. Pall et al. Tackling exascale software challenges in molecular dynamics simulations

with GROMACS. In EASC 2014, pages 3–27. 2015.

[104] R. Pasquetti and F. Rapetti. p-multigrid method for Fekete-Gauss spectral element

approximations of elliptic problems. Communications in Computational Physics, 5(2-

4):667–682, Feb 2009.

146

[105] A. T. Patera. A spectral element method for fluid dynamics: laminar flow in a channel

expansion. Journal of Computational Physics, 54(3):468 – 488, 1984.

[106] L. F. Pavarino and O. B. Widlund. A polylogarithmic bound for an iterative substruc-

turing method for spectral elements in three dimensions. SIAM journal on numerical

analysis, 33(4):1303–1335, 1996.

[107] L. F. Pavarino and O. B. Widlund. Iterative substructuring methods for spectral

elements: Problems in three dimensions based on numerical quadrature. Computers

& Mathematics with Applications, 33(1):193–209, 1997.

[108] R. Peyret. Spectral methods for incompressible viscous flow, volume 148. Springer

Science & Business Media, 2013.

[109] C. Rahm. Validierung und Erweiterung eines Stromungsloser hoher ordnung fur het-

erogene HPC-systeme. Master’s thesis, Institut fur Stromungsmechanik, TU Dresden,

Dresden, Germany, 2017. in German.

[110] I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. Kelly, and

D. Radford. Acceleration of a full-scale industrial CFD application with OP2. IEEE

Transactions on Parallel and Distributed Systems, 27(5):1265–1278, 2016.

[111] J. Reid. Coarrays in Fortran 2008. In Proceedings of the Third Conference on Parti-

tioned Global Address Space Programing Models, PGAS ’09, pages 4:1–4:1, New York,

NY, USA, 2009. ACM.

[112] N. A. Rink, I. Huismann, A. Susungi, J. Castrillon, J. Stiller, J. Frohlich, and C. Ta-

donki. CFDlang: High-level code generation for high-order methods in fluid dynam-

ics. In Proceedings of the Real World Domain Specific Languages Workshop 2018,

RWDSL2018, pages 5:1–5:10, New York, NY, USA, 2018. ACM.

[113] M. P. Robson, R. Buch, and L. V. Kale. Runtime coordinated heterogeneous tasks in

Charm++. In Extreme Scale Programming Models and Middlewar (ESPM2), Inter-

national Workshop on, pages 40–43. IEEE, 2016.

[114] E. M. Rønquist and A. T. Patera. Spectral element multigrid. I. formulation and

numerical results. Journal of Scientific Computing, 2(4):389–406, 1987.

[115] H. Schlichting and K. Gersten. Boundary-layer theory. Springer, 9 edition, 2017.

[116] J. Schmidt. NAM – network attached memory. In Doctoral Showcase summary

and poster at International Conference for High Performance Computing, Network-

ing, Storage and Analysis, SC, 2016.

147

[117] S. J. Sherwin and M. Casarin. Low-energy basis preconditioning for elliptic substruc-

tured solvers based on unstructured spectral/hp element discretization. Journal of

Computational Physics, 171(1):394–417, 2001.

[118] J. R. Shewchuk. An introduction to the conjugate gradient method without the ago-

nizing pain. Technical report, Pittsburgh, PA, USA, 1994.

[119] J. Slotnick, A. Khodadoust, J. Alonso, D. Darmofal, W. Gropp, E. Lurie, and

D. Mavriplis. CFD vision 2030 study: a path to revolutionary computational aero-

sciences. 2014.

[120] B. F. Smith. A parallel implementation of an iterative substructuring algorithm for

problems in three dimensions. SIAM Journal on Scientific Computing, 14(2):406–423,

1993.

[121] T. Steinke, A. Reinefeld, and T. Schutt. Experiences with high-level programming of

fpgas on cray xd1. Cray Users Group (CUG 2006), 2006.

[122] J. Stiller. Robust multigrid for high-order discontinuous Galerkin methods: A fast

Poisson solver suitable for high-aspect ratio Cartesian grids. Journal of Computational

Physics, 327:317–336, 2016.

[123] J. Stiller. Nonuniformly weighted Schwarz smoothers for spectral element multigrid.

Journal of Scientific Computing, 72(1):81–96, 2017.

[124] J. Stiller. Robust multigrid for cartesian interior penalty DG formulations of the

Poisson equation in 3D. In Spectral and High Order Methods for Partial Differential

Equations ICOSAHOM 2016, pages 189–201. Springer, 2017.

[125] J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for

heterogeneous computing systems. Computing in science & engineering, 12(3):66–73,

2010.

[126] H. Sundar, G. Stadler, and G. Biros. Comparison of multigrid algorithms for high-order

continuous finite element discretizations. Numerical Linear Algebra with Applications,

22(4):664–680, 2015.

[127] A. Susungi, N. A. Rink, J. Castrillon, I. Huismann, A. Cohen, C. Tadonki, J. Stiller,

and J. Frohlich. Towards compositional and generative tensor optimizations. In Conf.

Generative Programming: Concepts & Experience (GPCE’17), pages 169–175, 2017.

[128] The MPI Forum. MPI: A message passing interface version 3.1, 2015.

[129] T. N. Theis and H.-S. P. Wong. The end of Moore’s law: A new beginning for infor-

mation technology. Computing in Science & Engineering, 19(2):41–50, 2017.

148

[130] U. Trottenberg, C. Oosterlee, and A. Schuller. Multigrid. Academic Press, 2001.

[131] M. Volp, S. Kluppelholz, J. Castrillon, H. Hartig, N. Asmussen, U. Assmann,

F. Baader, C. Baier, G. Fettweis, J. Frohlich, A. Goens, S. Haas, D. Habich, M. Hasler,

I. Huismann, T. Karnagel, S. Karol, W. Lehner, L. Leuschner, M. Lieber, S. Ling,

S. Marcker, J. Mey, W. Nagel, B. Nothen, R. Penaloza, M. Raitza, J. Stiller,

A. Ungethum, and A. Voigt. The Orchestration Stack: The impossible task of designing

software for unknown future post-CMOS hardware. In Proceedings of the 1st Interna-

tional Workshop on Post-Moore’s Era Supercomputing (PMES), Co-located with The

International Conference for High Performance Computing, Networking, Storage and

Analysis (SC16), Salt Lake City, USA, Nov. 2016.

[132] Wikipedia contributors. List of Intel CPU microarchitectures — Wikipedia, the free

encyclopedia, 2018. [Online available at https://en.wikipedia.org/w/index.php?

title=List_of_Intel_CPU_microarchitectures&oldid=864965902; accessed 27th

Octobre 2018].

[133] M. V. Wilkes. The memory gap and the future of high performance memories. ACM

SIGARCH Computer Architecture News, 29(1):2–7, 2001.

[134] S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual perfor-

mance model for multicore architectures. Communications of the ACM, 52(4):65–76,

2009.

[135] E. L. Wilson. The static condensation algorithm. International Journal for Numerical

Methods in Engineering, 8(1):198–203, 1974.

[136] C. S. Woodward. A Newton-Krylov-multigrid solver for variably saturated flow prob-

lems. WIT Transactions on Ecology and the Environment, 24, 1998.

[137] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang,

Z. Wang, W. Liu, and X. Cheng. Collaborating CPU and GPU for large-scale high-

order CFD simulations with complex grids on the TianHe-1A supercomputer. Journal

of Computational Physics, 278:275–297, 2014.

[138] S. Yakovlev, D. Moxey, R. Kirby, and S. Sherwin. To CG or to HDG: A comparative

study in 3D. Journal of Scientific Computing, pages 1–29, 2015.

[139] C. Yang et al. Adaptive optimization for petascale heterogeneous CPU/GPU com-

puting. In Cluster Computing (CLUSTER), 2010 IEEE International Conference on,

pages 19–28. IEEE, 2010.

[140] Z. Zhong, V. Rychkov, and A. Lastovetsky. Data partitioning on multicore and multi-

GPU platforms using functional performance models. IEEE T Comput, 64(9):2506–

2518, 2015.

https://en.wikipedia.org/w/index.php?title=List_of_Intel_CPU_microarchitectures&oldid=864965902

https://en.wikipedia.org/w/index.php?title=List_of_Intel_CPU_microarchitectures&oldid=864965902

S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ...

Documents

Transcript of S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ...