S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ...
Transcript of S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ...
S c h rift e n r ei h e a u s d e m I n stit ut f ü r St r ö m u n g s m e c h a ni k
H e r a u s g e b e r
J. Fr ö hli c h, R. M ail a c h
I n stit ut f ü r St r ö m u n g s m e c h a ni k
T e c h ni s c h e U ni v e r sit ät D r e s d e n
D - 0 1 0 6 2 D r e s d e n
B a n d 3 1
T U D pr ess2 0 2 0
I m m o H ui s m a n n
C o m p ut ati o n al fl ui d d y n a mi c s
o n wil dl y h et e r o g e n e o u s s y st e m s
Bi bli o g r afi s c h e I nf o r m ati o n d e r D e ut s c h e n N ati o n al bi bli ot h e k
Di e D e ut s c h e N ati o n al bi bli ot h e k v e r z ei c h n et di e s e P u bli k ati o n i n d e r
D e ut s c h e n N ati o n al bi bli o g r afi e; d et ailli e rt e bi bli o g r afi s c h e D at e n si n d
i m I nt e r n et ü b e r htt p:// d n b. d - n b. d e a b r uf b a r.
Bi bli o g r a p hi c i nf o r m ati o n p u bli s h e d b y t h e D e ut s c h e N ati o n al bi bli ot h e k
T h e D e ut s c h e N ati o n al bi bli ot h e k li st s t hi s p u bli c ati o n i n t h e D e ut s c h e
N ati o n al bi bli o g r afi e; d et ail e d bi bli o g r a p hi c d at a a r e a v ail a bl e i n t h e
I nt e r n et at htt p:// d n b. d - n b. d e.
I S B N 9 7 8 - 3 - 9 5 9 0 8 - 4 2 4 - 6
© 2 0 2 0 T U D p r e s s
b ei T h el e m U ni v e r sit ät s v e rl a g
u n d B u c h h a n dl u n g G m b H & C o. K G
D r e s d e n
htt p:// w w w.t u d p r e s s. d e
All e R e c ht e v o r b e h alt e n. | All ri g ht s r e s e r v e d.
G e s et zt v o m A ut o r. | T y p e s et b y t h e a ut h o r.
P ri nt e d i n G e r m a n y.
Di e v o rli e g e n d e A r b eit w u r d e a m 2 7. N o v e m b e r 2 0 1 8 a n d e r F a k ult ät M a s c hi n e n w e s e n
d e r T e c h ni s c h e n U ni v e r sit ät D r e s d e n al s Di s s e rt ati o n ei n g e r ei c ht u n d a m 2 9. J a n u a r 2 0 2 0
e rf ol g r ei c h v e rt ei di gt .
T hi s w o r k w a s s u b mitt e d a s a P h D t h e si s t o t h e F a c ult y of M e c h a ni c al S ci e n c e a n d
E n gi n e e ri n g of T U D r e s d e n o n 2 7 N o v e m b e r 2 0 2 0 a n d s u c c e s sf ull y d ef e n d e d o n
2 9 J a n u a r y 2 0 2 0.
G ut a c ht e r | R e vi e w e r s
1. P r of. D r. -I n g. h a bil. J o c h e n Fr ö hli c h
2. P r of. S p e n c e r J. S h e r wi n
Technische Universitat Dresden
Faculty of Mechanical Science and Engineering
Computational fluid dynamics
on
wildly heterogeneous systems
Dissertation
in order to obtain the degree
Doktor-Ingenieur (Dr.-Ing.)
by
Immo Huismann
born on 1st October 1988 in Hamburg
Referees: Prof. Dr.-Ing. habil. Jochen Frohlich
Technische Universitat Dresden
Prof. Spencer J. Sherwin
Imperial College London
Date of submission: 27th November 2018
Date of defence: 29th January 2020
i
Abstract
In the last decade, high-order methods have gained increased attention. These combine
the convergence properties of spectral methods with the geometrical flexibility of low-order
methods. However, the time step is restrictive, necessitating the implicit treatment of diffu-
sion terms in addition to the pressure. Therefore, efficient solution of elliptic equations is of
central importance for fast flow solvers. As the operators scale with O(p · nDOF), where nDOF
is the number of degrees of freedom and p the polynomial degree, the runtime of the best
available multigrid algorithms scales with O(p · nDOF) as well. This super-linear scaling lim-
its the applicability of high-order methods to mid-range polynomial orders and constitutes
a major road block on the way to faster flow solvers.
This work reduces the super-linear scaling of elliptic solvers to a linear one. First, the static
condensation method improves the condition of the system, then the associated operator
is cast into matrix-free tensor-product form and factorized to linear complexity. The low
increase in the condition and the linear runtime of the operator lead to linearly scaling solvers
when increasing the polynomial degree, albeit with low robustness against the number of
elements. A p-multigrid with overlapping Schwarz smoothers regains the robustness, but
requires inverse operators on the subdomains and in the condensed case these are neither
linearly scaling nor matrix-free. Embedding the condensed system into the full one leads to
a matrix-free operator and factorization thereof to a linearly scaling inverse. In combination
with the previously gained operator a multigrid method with a constant runtime per degree
of freedom results, regardless of whether the polynomial degree or the number of elements
is increased.
Computing on heterogeneous hardware is investigated as a means to attain a higher perfor-
mance and future-proof the algorithms. A two-level parallelization extends the traditional
hybrid programming model by using a coarse-grain layer implementing domain decomposi-
tion and a fine-grain parallelization which is hardware-specific. Thereafter, load balancing
is investigated on a preconditioned conjugate gradient solver and functional performance
models adapted to account for the communication barriers in the algorithm. With the new
model, runtime prediction and measurement fit closely with an error margin near 5%.
The devised methods are combined into a flow solver which attains the same throughput
when computing with p = 16 as with p = 8, preserving the linear scaling. Furthermore, the
multigrid method reduces the cost of implicit treatment of the pressure to the one for explicit
treatment of the convection terms. Lastly, benchmarks confirm that the solver outperforms
established high-order codes.
ii
iii
Kurzzusammenfassung
In den letzten Jahrzehnten lagen Methoden hoherer Ordnung im Fokus der Forschung. Diese
kombinieren die Konvergenzeigenschaften von Spektralmethoden mit der geometrischen Flex-
ibilitat von Methoden niedriger Ordnung. Dabei entsteht eine restriktive Zeitschrittbegren-
zung, die die implizite Behandlung von Diffusionstermen zusatzlich zu der des Druckes
erfordert. Aus diesem Grund ist die effiziente Losung elliptischer Gleichungen von zen-
tralem Interesse fur schnelle Stromungsloser. Da die Operatoren mit O(p · nDOF) skalieren,
wobei nDOF die Anzahl der Freiheitsgrade und p der Polynomgrad ist, skaliert die Laufzeit
der besten derzeit verfugbaren Mehrgitterloser ebenso mit O(p · nDOF). Diese super-lineare
Skalierung beschrankt die Anwendbarkeit von Methoden hoherer Ordnung auf mittlere Poly-
nomgrade und stellt eine große Hurde auf dem Weg zu schnelleren Stromungslosern dar.
Diese Arbeit senkt die super-lineare Skalierung elliptischer Loser auf eine lineare. Zuerst
verbessert die statische Kondensation die Kondition des Systems. Der dazu benotigte Op-
erator wird in Matrix-freier Tensorproduktform dargestellt und auf lineare Komplexitat
faktorisiert. Die Kombination aus langsam wachsender Kondition und linearer Operator-
laufzeit erzeugt Loser, die linear skalieren wenn der Polynomgrad angehoben wird, aller-
dings nicht wenn die Anzahl der Elemente erhoht wird. Eine p-Mehrgittermethode mit
uberlappendem Schwarz Glatter stellt die Robustheit gegenuber der Anzahl der Elemente
her, benotigt allerdings den inversen Operator auf Teilgebieten und im kondensierten Fall
sind diese weder linear skalierend noch Matrix-frei. Eine Einbettung des kondensierten Sys-
tems in das volle System liefert einen Matrix-freien Operator, der anschließend auf lineare
Komplexitat faktorisiert wird. In Kombination mit dem Operator resultiert eine Mehrgit-
termethode mit konstanter Laufzeit pro Freiheitsgrad, egal ob der Polynomgrad oder die
Anzahl der Elemente gesteigert wird.
Heterogenes Rechnen wird zur Steigerung der Rechenleistung und Zukunftssicherung der Al-
gorithmen und untersucht. Eine Zweischicht-Parallelisierung erweitert die traditionelle hy-
bride Parallelisierung, wobei eine grobe Schicht Gebietszerlegung implementiert und die feine
Hardware-spezifisch ist. Daraufhin wird die Lastverteilung auf solchen Systemen anhand
einer prakonditionierten konjugierte Gradienten Methode untersucht und ein funktionales
Leistungsmodell adaptiert um mit den auftretenden Kommunikationsbarrieren umzugehen.
Mit dem neuen Modell liegen Laufzeitvoraussage und -messung nahe beieinander, mit einem
Fehler von 5%.
Die entwickelten Methoden werden zu einem Stromungsloser kombiniert, der den gleichen
Durchsatz bei Rechnungen mit p = 16 und p = 8 erreicht, also die lineare Skalierung
beibehalt. Des weiteren reduziert der Mehrgitterloser die Rechenzeit zur impliziten Be-
handlung des Druckes auf die der expliziten fur die Konvektion. Zu guter Letzt zeigen
Benchmarks, dass der Loser eine hohere Performanz erreicht als etablierte Codes.
iv
v
Acknowledgements
First and foremost, I want to thank those persons who made this work possible: First,
Prof. Jochen Frohlich who not only supported this work, but also endeared fluid mechanics
to me in the first place and allowed it to flourish at the chair, from curvilinear beginnings to
Cartesian endings and, second, Dr. Jorg Stiller who was always an encouraging supervisor
and contributed immensely with stimulating discussions as well as in-depth knowledge.
Thereafter, I would like to thank those who accompanied me over the years: My friends
and family who were always supportive in the long roller coaster of ups and downs that
culminated in this thesis. But directly after, I want to thank the whole Chair of Fluid
Mechanics, without whose help, support, and enticing discussions this endeavour would
not have been even half as productive and not even half as much fun. Furthermore, the
contribution of the productive environment of the Orchestration path of the Center for
Advancing Electronics Dresden requires explicit mentioning.
Then, I would like to thank those who took it upon them to help this work on the last
miles: Prof. Spencer Sherwin who accepted to be co-referee of the thesis and Prof. Jeronimo
Castrillon who offered to be on the commission.
But let us not forget those who drew me into mechanics in the first place and, over multiple
detours, to fluid mechanics: Prof. Balke and Prof. Ulbricht who not only held joyful lectures
ranging from statics, over strength theory to fracture mechanics but always accompanied
these with applications, such as fracture mechanics of a pressurized bratwurst.
vi
vii
Contents
Acknowledgements v
List of symbols xi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 High-order discretization methods . . . . . . . . . . . . . . . . . . . . 3
1.2.2 High-performance computing . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Goal and structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . 7
2 The spectral-element method 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 A spectral-element method for the Helmholtz equation . . . . . . . . . . . 9
2.2.1 Strong and weak form of the Helmholtz equation . . . . . . . . . . 9
2.2.2 Finite element approach . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Convergence properties . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Tensor-product elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Tensor-product matrices . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Tensor-product bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Tensor-product operators . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Fast diagonalization method . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Performance of basic Helmholtz solvers . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Considered preconditioners and solvers . . . . . . . . . . . . . . . . . 19
2.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Performance optimization for tensor-product operators 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Basic approach for interpolation operator . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Baseline operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Runtime tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
viii
3.3 Compiling information for the compiler . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Enhancing the interpolation operator . . . . . . . . . . . . . . . . . . 29
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Extension to Helmholtz solver . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Required operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Operator runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Performance gains for solvers . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Fast Static Condensation – Achieving a linear operation count 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Static condensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Principal idea of static condensation . . . . . . . . . . . . . . . . . . 42
4.2.2 Static condensation in three dimensions . . . . . . . . . . . . . . . . . 43
4.3 Factorization of the statically condensed Helmholtz operator . . . . . . . . 46
4.3.1 Tensor-product decomposition of the operator . . . . . . . . . . . . . 47
4.3.2 Sum-factorization of the operator . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Product-factorization of the operator . . . . . . . . . . . . . . . . . . 52
4.3.4 Extensions to variable diffusivity . . . . . . . . . . . . . . . . . . . . 54
4.3.5 Runtime comparison of operators . . . . . . . . . . . . . . . . . . . . 56
4.4 Efficiency of pCG solvers employing fast static condensation . . . . . . . . . 60
4.4.1 Element-local preconditioning strategies . . . . . . . . . . . . . . . . 60
4.4.2 Considered solvers and test conditions . . . . . . . . . . . . . . . . . 61
4.4.3 Solver runtimes for homogeneous grids . . . . . . . . . . . . . . . . . 62
4.4.4 Solver runtimes for inhomogeneous grids . . . . . . . . . . . . . . . . 66
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Scaling to the stars – Linearly scaling spectral-element multigrid 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Linearly scaling additive Schwarz methods . . . . . . . . . . . . . . . . . . 70
5.2.1 Additive Schwarz methods . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Embedding the condensed system into the full system . . . . . . . . . 72
5.2.3 Tailoring fast diagonalization for static condensation . . . . . . . . . 73
5.2.4 Implementation of boundary conditions . . . . . . . . . . . . . . . . . 76
5.2.5 Extension to element-centered block smoothers . . . . . . . . . . . . 77
5.3 Multigrid solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Multigrid algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Complexity of the resulting algorithms . . . . . . . . . . . . . . . . . 80
5.4 Runtime tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.1 Runtimes for the star inverse . . . . . . . . . . . . . . . . . . . . . . 81
ix
5.4.2 Solver runtimes for homogeneous meshes . . . . . . . . . . . . . . . . 84
5.4.3 Solver runtimes for anisotropic meshes . . . . . . . . . . . . . . . . . 86
5.4.4 Solver runtimes for stretched meshes . . . . . . . . . . . . . . . . . . 87
5.4.5 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Computing on wildly heterogeneous systems 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Programming wildly heterogeneous systems . . . . . . . . . . . . . . . . . . 96
6.2.1 Model problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Two-level parallelization of the model problem . . . . . . . . . . . . . 97
6.2.3 Performance gains for homogeneous systems . . . . . . . . . . . . . . 100
6.3 Load balancing model for wildly heterogeneous systems . . . . . . . . . . . . 103
6.3.1 Single-step load balancing for heterogeneous systems . . . . . . . . . 103
6.3.2 Problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.3 Multi-step load balancing for heterogeneous systems . . . . . . . . . . 107
6.3.4 Performance with new load balancing model . . . . . . . . . . . . . . 109
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Specht FS – A flow solver computing on heterogeneous hardware 113
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Flow solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.1 Incompressible fluid flow . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.2 Spectral Element Cartesian HeTerogeneous Flow Solver . . . . . . . . 114
7.2.3 Pressure correction scheme in Specht FS . . . . . . . . . . . . . . . 115
7.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.1 Test regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.2 Taylor-Green vortex in a periodic domain . . . . . . . . . . . . . . 117
7.3.3 Taylor-Green vortex with Dirichlet boundary conditions . . . . 119
7.3.4 Turbulent plane channel flow . . . . . . . . . . . . . . . . . . . . . . 119
7.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.1 Turbulent plane channel flow . . . . . . . . . . . . . . . . . . . . . . 122
7.4.2 Turbulent Taylor-Green vortex benchmark . . . . . . . . . . . . . 124
7.4.3 Parallelization study . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8 Conclusions and outlook 131
A Further results for wildly heterogeneous systems 135
Bibliography 137
x
xi
List of symbols
Abbreviations
AVX2 Advanced Vector Extension 2 (instruction set)
BDF Backward Differencing Formula
BLAS Basic Linear Algebra Subroutines (software)
CFD Computational Fluid Dynamics
CFL Courant-Friedrich-Levy
CG Conjugate Gradient Method
CMOS Complementary Metal–Oxide–Semiconductor
CPU Central Processing Unit
DG Discontinuous Spectral-Element Method
DGEMM Double precision General Matrix Matrix Multiplication from BLAS
DOF Degrees Of Freedom
FEM Finite Element Method
FMA Fused Multiply Add (instruction)
FPGA Field-Programmable Gate Array
GEMM General Matrix Matrix Multiplication from BLAS
GLL Gauß-Lobatto-Legendre
GPU Graphics Processing Unit
HPC High-Performance Computer
ipCG Inexact Preconditioned Conjugate Gradient Method
LES Large-Eddy Simulation
MKL Intel Math Kernel Library (software)
MPI Message Passing Implementation (Software)
PDE Partial Differential Equation
xii
pCG Preconditioned Conjugate Gradient Method
RAM Random Access Memory
RHS Right-Hand Side
SEM Continuous Spectral-Element Method
SIMD Single Instruction Multiple Data (instruction)
SVV Spectral Vanishing Viscosity
Greek Symbols
α Expansion factor
αki Coefficient for BDFk time derivative for time level n− i
βki Extrapolation coefficient for order k and time level n− i
δij Component ij of Kronecker delta
∆u Correction for u in the condensed system
∆ui Correction for u on Schwarz subdomain Ωi in the condensed system
ε Mean dissipation rate
εSVV Polynomial degree for the SVV model
κ Condition number
Λ Diagonal matrix of eigenvalues
λ Helmholtz parameter λ ≥ 0
νpost Number of post-smoothing steps
νpre Number of pre-smoothing steps
Ω Computational Domain
Ωe Domain of element e
Ωi Schwarz domain i
ΩS Standard element ΩS = [−1, 1]
ϕ Pressure potential
φi i-th basis function
xiii
ρ Convergence rate of multigrid method
ξ coordinates in one-dimensional standard element
Latin Symbols
AR Aspect ratio
ARmax Maximum aspect ratio in a mesh
C Constant
CCFL CFL number
D Standard element differentiation matrix
De Diagonal matrix comprising the eigenvalues in three dimensions
De Differentiation matrix for element Ωe
d Number of dimensions
di,e Geometry coefficient for direction i in element Ωe
Ek Mean kinetic energy
e Element index
f Body force
f Right-hand side for the Helmholtz equation
F Discrete right-hand side
H Static condensed Helmholtz operator
H Helmholtz operator
H1 Sobolev (norm)
He Helmholtz operator for element Ωe
Hi Helmholtz operator for Schwarz subdomain Ωi in the condensed system
h Element width in one dimension
hi Element width in direction i
I Set of face indices I = e,w, n, s, t, b
I Identity matrix
xiv
i Integer
J l Grid transfer operation from level l − 1 to l
j Integer
k Integer
L Standard element stiffness matrix
L Number of levels for the multigrid method minus one
L2 Euclidean (norm)
Le Stiffness matrix for element Ωe
l Integer
M Standard element mass matrix
Me Mass matrix for element Ωe
M−1 Diagonal matrix containing the inverse multiplicity for the data points
m Integer
n10 Number of iterations to reduce the residual by ten orders
nDOF Number of degrees of freedom
ne Number of elements
nI Number of inner points nI = p− 1
np Number of points np = p+ 1
nS Number of points in a Schwarz subdomain Ωi
nt Number of time steps
nv Number of vertices in a mesh
O() Landau symbol
P (Pseudo)-Pressure
p Polynomial degree
pSVV Polynomial degree for the SVV model
pW Polynomial degree of the weight function
xv
Q Scatter operation mapping global to local data
Re Reynolds number
Reτ Wall Reynolds number
r Residual vector
S Transformation matrix
t Time variable
tIter Time per iteration
u Solution variable for the Helmholtz equation
u Vector containing the (current) solution u
u Velocity vector
uex Exact solution
uh Approximation for u
ui,e Coefficient for φi in element Ωe
Wi Weights associated with Schwarz subdomain Ωi in the condensed system
wi Quadrature weights for collocation point i
x Coordinate vector
x Coordinates in one dimension
y+ Distance from the wall in wall units
Mathematical Symbols
∂(·) Partial derivative
∆(·) Laplacian operator
∇(·) Vector derivative operator
∥ · ∥ Norm of vector / variable
∥ · ∥∞ Maximum norm of vector / variable
· ⊗ · Tensor product operator
xvi
Sub- and superscripts
B Boundary part of a matrix / vector
b Bottom in compass notation
C Variable for a CPU
Cond Condensed
E Eigenspace
e East in compass notation
Fe East face
Fw West face
G Variable for a GPU
I Inner part of a matrix / vector
i Vector / matrix for Schwarz subdomain Ωi
l Variable on multigrid level l
n North in compass notation
(·)n Variable at time step n
Prim Primary
s South in compass notation
T Transpose
t Top in compass notation
w West in compass notation
u / A Vector u or matrix A for the condensed system
(·) ⋆ Variable at intermediate time level
(·) Vector with three components
1
Chapter 1
Introduction
1.1 Introduction
For millennia mankind tried to unravel the mysteries posed by their surroundings, from the
philosophers from ancient Greece over those of enlightenment to the large-scale research
institutes of today. Over the centuries we arrived at models describing our environment,
such as equations of motion and laws of thermodynamics. These models not only allow to
gain insights into the inner workings of the world, but also further technological progress in
civil engineering, mechanical engineering, and many other applied sciences by allowing for
accurate predictions and models, granting us electricity, combustion engines, and flight.
For fluid mechanics, the Navier-Stokes equations allow us to describe the behaviour of
flows. However, most cases possess no known analytical solution. Therefore, experiments
were the dominant method to predict flow structures at the start of the 20th century. But
nowadays simulations complement them, and are the preferred option when turning to op-
timization. These have permeated the whole engineering landscape opening the field of
computational fluid dynamics (CFD), ranging from pipe flows [96, 93], turbines [80] and
wings [69] to weather forecast as well as earthquake [79] and climate predictions [68]. But
while the simulations involve more and more details and their scope widens, they are still
limited by the available compute power. For instance, while the aerodynamics of flight is long
understood and airfoils have been simulated with large-eddy simulation for more than 15
years [92], the time-resolving simulation of the flow around an aircraft is only expected to be
possible in 2030 [119]. For this to become reality, however, improvements in hardware, spa-
tial discretization methods, time-stepping schemes, and turbulence modelling are required
and this work contributes to enabling such computations.
To reach the goal of simulating whole aircraft, improvements in temporal and spatial dis-
cretization schemes are mandatory. The latter changed significantly in the last decades.
Where once Fourier methods and low-order finite difference discretizations were common,
2 1 Introduction
finite volume schemes are used throughout industry nowadays and high-order methods are
the current focus of research. These combine high convergence rates with the a finite element
approach, leading to the geometrical flexibility of finite volume schemes while attaining high
convergence orders [23, 74]. The high convergence rate, in turn, allows to lower the number
of degrees of freedom while attaining the same error margin as low-order techniques. But
it incurs a higher operator cost with the operator scaling super-linearly with the number
of degrees of freedom when increasing the polynomial order. The drawback becomes highly
relevant for the solution of elliptic equations, where the iteration count increases with the
polynomial degree as well. This super-linear scaling renders a high polynomial order infea-
sible in practice. To attain simulation of aircraft until 2030, improvements on the operator
costs and the resulting solvers are required, with linear scaling of the operator and a constant
iteration count being the optimal result.
Where the operator costs limit the convergence rate of the discretization scheme, the bound-
aries of physics limit the hardware capabilities. Current processors are fabricated in the
Complementary Metal–Oxide–Semiconductor (CMOS) process, where, typically, silicon gets
doped in order to render it semiconducting and create circuits. From the 1960s to 2010,
miniaturization doubled the number of transistors per chip every two years, with the heuris-
tic being called Moore’s law [95]. The lower structure width in combination with the
increasing transistor count leads to more performance without an increase in the required
power [22]. The performance gains, sometimes referred to “free lunch”, allowed programs to
perform better without requiring any change and, in turn, led to ever more intricate simula-
tions being performed. But free lunch is over: The current transistor width of 10 nm scrapes
at the physical limits of the technology [67], and the end of Moore’s law is nigh.
To circumvent the lack of performance gains through miniaturization, different compute
infrastructures are investigated. For instance using accelerator cards for numerics or so-called
dark silicon, where only parts of the processor are powered [28]. The resulting computers will
be more heterogeneous than the ones available today [26]. The first herald of this transition
is the increasing number of accelerator cards built into high-performance computers [94],
and the number is only growing larger. Where accelerators have been well understood and
employed for CFD simulations [40, 79], more heterogeneity provides further challenges from
programmability to load balancing. These need to be addressed to allow simulations to
capitalize on future compute structures, as for instance done in the Orchestration Path of
the Center For Advancing Electronics Dresden [19, 131], where this work resided in.
This work aims to provide improvements for high-order spatial discretization schemes by
lowering the operation count of elliptic solvers to linear complexity. Furthermore, running
these algorithms on heterogeneous hardware will be investigated, once from the programming
side and once from the load balancing perspective. The combination thereof will provide a
contribution to competitive high-order computational fluid dynamics and future-proving it
for the hardware to come.
1.2 State of the art 3
1.2 State of the art
1.2.1 High-order discretization methods
The numerical methods utilized in Computational Fluid Dynamics have gone a long way
since its advent. Today a large variety of schemes exist, ranging from low-order methods,
such as the Finite Volume Method which offer high spatial flexibility at a low convergence
order [32], to spectral methods that allow for very high convergence orders at the expense
of spatial flexibility [108]. Current research focuses on more flexible high-order methods
such as the Discontinuous Galerkin method (DG) [55] and the continuous Spectral Ele-
ment Method (SEM) [23, 74]. These combine the advantages of both, allowing for a do-
main decomposition by using a finite-element approach, as well as high convergence orders
via a polynomial ansatz, and are now widely accepted with general-purpose codes such
as Nek5000 [35], Nektar++ [18], Semtex [11], DUNE [7], and the deal.II library [5]
being readily available.
The main benefit of these high-order methods is the spectral convergence property. The
error scales with hp+1, where h is the element width and p the polynomial order. Raising
the polynomial degree allows for a higher convergence rate and, in turn, to attain the same
error margin using less degrees of freedom, assuming that the solution possesses sufficient
smoothness. This fueled a race to ever higher polynomial degrees: Where in the first publi-
cation on SEM a polynomial degree of p = 6 was employed [105], typical polynomial orders
in current simulations range up to 11 [3, 8, 87, 93] and even polynomial degrees of p = 15
are not unheard of [29]. The higher polynomial degrees, however, come at a cost. While the
convergence order increases with the polynomial degree, the complexities of the operators
do as well. In each element, (p+ 1)3 degrees of freedom are coupled with each other, leading
to O(p6) multiplications when implementing the operators via matrix-matrix multiplication
and O(p4) when exploiting tensor-product bases [23]. In both cases, the operation count in-
creases super-linearly with the number of degrees of freedom when increasing the polynomial
order.
The increased convergence rate of high-order methods is bought with more expensive op-
erator evaluations. While the increased costs are bearable for operations occurring once or
twice per time step, the iterative solution of elliptic equations requires tens if not hundreds
of iterations and therefore operator evaluations. Moreover, the iteration count increases with
the polynomial degree as well due to the condition of the system [23, 21]. The combina-
tion of higher operator costs and increasing iteration count leads to the solution process
of the Poisson equation for the pressure occupying up to 90 % of the runtime of a flow
solver [29] and constitutes a major roadblock on the path to high convergence rates.
4 1 Introduction
For elliptic solvers, a constant iteration count when increasing the number of elements is
mandatory for large-scale simulations, and global coupling is required. A two-level method
which triangulates the high-order mesh using linear finite elements provides good precondi-
tioning, even for unstructured grids [34, 126]. For structured grids, combining multigrid with
overlapping Schwarz type smoothers lowers the iteration count, but it is still dependent
upon the polynomial degree [88]. Additional refinement by weight functions reduces the num-
ber to three iterations [123, 122]. Where the cost of residual evaluation are similar as those
for a convection operator, the smoother requires an explicit inverse on a subdomain com-
prising multiple elements. However, in three dimensions all of these methods require O(p4)operations for both, residual evaluation and smoother.
A different venue to attain faster solvers lies in lowering the number of degrees of freedom. For
the spectral-element method, the static condensation method allows to eliminate element-
interior degrees of freedom, leading to a closer coupling of the remaining ones and a better
conditioned system [21]. It was already utilized in the first publication on the SEM [105] and
the hybridizable discontinuous Galerkin method allows for similar gains for DG [76, 138].
The method, however, still requires global coupling. For a static condensed system, iterative
substructuring reduces the number of unknowns on the faces to one, and leads to an even
smaller system [107, 120, 117]. However, the required number of iterations stays high.
Coupling static condensation with p-multigrid allows for the same number of iterations used
for the full case while using fewer unknowns and allowing for faster operators [52, 51]. But
again the smoother requires O(p4) operations.
So far, all solution methods for elliptic equations require O(p4) operations per iteration.
Lowering the operation count to O(p3) while attaining a constant iteration count promises
improvements of one or two orders of magnitude and removes a major obstacle to exploiting
high convergence orders.
1.2.2 High-performance computing
In the last decades, Moore’s law, as proclaimed in 1965 [95], allowed to double the number
of transistors per chip every two years. In turn, the peak performance of the top 500 reported
high-performance computers (HPC) doubled every two years, as Figure 1.1 shows. These
gains mostly resulted from miniaturization: Reduction of the structure width of the transis-
tors in conjunction with increased doping allowed for more transistors per chip at the same
power demand [129]. This “free lunch” generated performance gains for programs without
requiring any changes in the code. After 2012, however, the structure width stagnated as
the physical limits were reached in the production of Complementary Metal–Oxide–Semi-
conductor (CMOS) [67]. As shown in Figure 1.1, the structure width of Intel processors
decreased every two years for more than two decades, but stalled in 2012, with the 10 nm
1.2 State of the art 5
Sum #1 # 500
1990
1995
2000
2005
2010
2015
2020
Year
100
102
104
106
108
1010
Perform
ance
[GFLOP/s]
1990
1995
2000
2005
2010
2015
2020
Year
101
102
103
Structure
width
[nm]
Figure 1.1: Development of processors and high-performance computers over time. Left: Devel-opment of peak performance of the top 500 reported high-performance computersin the world over time. Here, “#1” refers to the most performant supercomputer,“#500” to the 500th, and “sum” to the sum of the 500. The data was extractedfrom [94]. Right: Structure width of transistors in Intel CPUs over time, extractedfrom [132].
process being delayed until late 2018. As a result, the performance gains in HPCs declined
and with the end of Moore’s law in sight, new venues to more performance are investigated.
One way to achieve higher performance lies in using specialized hardware instead of the
general-purpose CPUs utilized beforehand. For instance, accelerator hardware, such as Field
Programmable Gate Arrays (FPGAs) [121] and Graphics Processing Units (GPUs) [40],
generate the current performance increase in HPCs. At the moment of writing, 7 of the top
10 HPCs in the world incorporate accelerator hardware [94]. With the HPC environment
being utilized for other tasks than simulations, one prime example being machine learning,
different accelerator hardware can benefit different areas of research. Therefore, the HPC
of the future will be heterogeneous, consisting of different processing units specialized in
different tasks and CFD needs to adapt if it wants to keep up with the changing hardware.
For CFD, the increasing heterogeneity poses problems. Current codes are developed with-
out heterogeneity in mind. Plenty of codes are optimized for the CPU [105, 17], sometimes
even with scaling up to 300 000 cores [56], and GPU implementations of high-order codes
exist as well [77, 79]. But all of these stick to the model of completely homogeneous hard-
ware. This is in part due to the different set of programming languages being utilized. For
programming multi-core CPUs many are applicable. Multi-process parallelization can be
facilitated, with libraries such as MPI [128] or with partitioned global address space (PGAS)
languages, e.g.CoArray Fortran [111] or Unified Parallel C [20]. Furthermore,
directive-based shared-memory parallelization can be facilitated via OpenMP [102]. When
turning to GPUs, the programming landscape is fractured as well, with the programming
6 1 Introduction
languages ranging from OpenCL [125], over CUDA [1], to directive-based languages such as
OpenACC [101] and meta-languages such as MAPS [6, 9]. These programming languages
are often not compatible with each other, leading to one program being capable of comput-
ing on one set of hardware, whereas a completely different implementation is required for
a different one. And while some languages, such as OpenCL [125] and OmpSs [15], and
libraries such as OP2 [39, 110] allow to address multiple kinds of hardware, coupling these
can become a problem.
Multiple models already exist to compute on heterogeneous systems. From the computer
science side, task-based parallelism easily tackles heterogeneity by decomposing every op-
eration into small tasks which can then be sent to the hardware queues. Libraries such
as StarPU [4] and Charm++ [71, 113] implement it and the model is well-suited for
molecular dynamics simulations [103], for example. Similarly, decomposing the problem into
operators and distributing these to the hardware best suited for them can lead to perfor-
mance gains for databases [73]. However, these programming patterns do not match the
reality of CFD, where data parallelism is present in the operators but inter-dependencies
exist between consecutive operators and elements. Moreover, with current systems data
movement constitutes the main bottleneck for an algorithm and can render computing on
GPUs inefficient [44]. When using domain decomposition, creating programs addressing
multiple kinds of hardware with one source remains a problem. Using MPI in conjunction
with OpenMP and CUDA is one approach [137], but requires two programming paradigms
in one program. Simple expansions of the hybrid programming concept for shared-memory
programming [70] are needed in order to lower the require both maintenance and program-
ming effort.
After attaining a running heterogeneous program, the problem of load balancing remains,
as the heterogeneous system can be slower than any component alone. In the simplest
case, a constant load ratio is established via heuristics, as done for aerodynamics in [137].
Dynamic load balancing is a further option [139, 53, 24, 25], but requires constant reeval-
uation and costly redistribution. Furthermore it stays unclear whether the optimum is
attained. Functional performance models, for instance assuming that the runtime scales lin-
early with the number of elements, are employed from parallel matrix multiplication [140],
over lattice Boltzmann simulations [30], to finite volume codes [86] and spectral-element
simulations [59]. However, all these references take only the total runtime into account,
independent of the algorithm. When taking the growing complexity of CFD algorithms into
account, the approach seems too simplistic and evaluation of whether the assumptions hold
are required.
The heterogeneity of the hardware constitutes a major challenge, but other problems exist as
well: While the peak performance of the CPUs has still been increasing, the memory band-
width has not kept up, leading to the so-called memory gap [133]. For low-order discretization
schemes, such as Finite Difference, Finite Volume or low-order Finite Element schemes, very
1.3 Goal and structure of the dissertation 7
few operations are required per degree of freedom, e.g. 7 for a stencil for the Laplacian on 3D
Cartesian grids. Current CPUs, however, require a factor of 40 in computational intensity
to attain peak performance [48], otherwise the memory remains the bottleneck [134]. As the
gap is widening [116], ever larger portions of the performance remain unutilized for these
algorithms and improvements for high order methods need to account for the memory gap
if they want to stay future proof.
1.3 Goal and structure of the dissertation
For high polynomial degrees, the largest portion of the runtime of solvers for incompressible
fluid flow is spent in the pressure solver. The main goal of this dissertation lies in lowering
the time spent in the pressure solver to the one spent in convection operators. Not only
does this significantly lower the runtime of high-order solvers, but in addition allows for the
usage of high-order time-stepping schemes, which require at least one solution of the pressure
equation per convergence order. The second goal is to prove the resulting algorithms against
changes in the hardware – the increasing heterogeneity as well as the memory gap.
To lower the runtime of the pressure solver, the runtime of elliptic operators needs to be
factorized to linear complexity. However, this can not come alone. In the full system, an op-
erator scaling with nDOF ≈ p3ne incurs O(nDOF) loads and stores and the memory gap limits
the performance in the foreseeable future. Therefore, the static condensation technique is
employed, where only the boundaries of the elements remain in the equation system. As
the number of memory operations scales with O(p2ne) = O(1/p · nDOF), a linearly scaling
operator allows to circumvent the memory gap. After attaining a linearly scaling operation
count for the operator, the overlapping Schwarz methods proposed in [122, 51] are em-
ployed to attain a constant iteration count. However, the smoother in these references scales
with O(p4) and requires factorization to linear complexity. The combination of linear com-
plexity in operator and smoother and a constant iteration count results in a solver scaling
linearly with the number of degrees of freedom, independent of the polynomial degree.
After devising solvers which allow one to bridge the growing memory gap, the aspect of
heterogeneity is investigated on the most commonly encountered heterogeneous system: the
CPU-GPU coupling. First, a programming model allowing to compute on it utilizing one
single source is demonstrated. Thereafter, load balancing of the resulting heterogeneous
systems is investigated in order to extract the maximum attainable performance.
The layout of this work is as follows: Chapter 2 will introduce the notation and the spectral-
element method, then Chapter 3 investigates the attainable performance of tensor-product
operators. These serve as baseline to compare with for the remainder of the dissertation.
Chapter 4 investigates the static condensation method and derives a linearly scaling operator,
allowing to achieve a constant runtime per degree of freedom when increasing the polynomial
8 1 Introduction
degree. Thereafter, Chapter 5 expands this concept to a full multigrid solver with linear
complexity in the number of degrees of freedom. Then Chapter 6 provides a method for
orchestrating heterogeneous systems, considering the programming side as well as the load
balancing side. Lastly, Chapter 7 proposes a flow solver incorporating all of these methods,
validates it and compares the attained performance with those of other available high-order
codes.
9
Chapter 2
The spectral-element method
2.1 Introduction
This chapter introduces the nomenclature used throughout the remainder of the work by
recapitulating the main points of the spectral-element method. More thorough introductions
can be found in [23, 74].
2.2 A spectral-element method for the Helmholtz equa-
tion
2.2.1 Strong and weak form of the Helmholtz equation
For an open domain Ω, the Helmholtz equation reads
∀x ∈ Ω : λu(x)−∆u(x) = f(x) , (2.1)
with u being the solution variable, f the right-hand side and λ a parameter. For λ ≥ 0, the
equation becomes elliptic and constitutes the basic building block for time-stepping of diffu-
sion equations or for pressure treatment in the case of incompressible fluid flow. While the
equation was originally formulated for λ < 0, i.e. the hyperbolic case, the term Helmholtz
equation is still used for the elliptic case of λ ≥ 0 throughout this work. Equation (2.1) is
a partial differential equation (PDE) of second order and, therefore, one boundary condi-
tion per boundary suffices. Both, Neumann and Dirichlet boundary conditions can be
imposed, for instance
∀x ∈ ∂ΩD : u(x) = gD(x) (2.2)
10 2 The spectral-element method
for a Dirichlet and
∀x ∈ ∂ΩN : n · ∇u(x) = gN(x) (2.3)
for a Neumann condition. Here, n denotes the outward-pointing normal vector on the
boundary, ∂ΩD and ∂ΩN the respective boundaries and gD and gN the functions of boundary
values on them. To create an equation system, the weighted residual method introduces a
test function v leading to
∀x ∈ Ω : (vλu)(x)− (v∆u)(x) = (vf)(x) (2.4)
⇒∫
x∈Ω
(vλu)(x) dx−∫
x∈Ω
(v∆u)(x) dx =
∫x∈Ω
(vf)(x) dx . (2.5)
The above is the strong form of the Helmholtz equation. Introducing function spaces for v
and u at this point leads to a minimization problem and, e.g., collocation methods. Inte-
gration by parts, however, allows to lower the differentiability requirement for u beforehand
while raising the one for v:
⇒∫
x∈Ω
(vλu)(x) dx+
∫x∈Ω
(∇Tv∇u
)(x) dx =
∫x∈Ω
(vf)(x) dx+
∫x∈∂Ω
(vn · ∇u)(x) dx . (2.6)
When enforcing Dirichlet boundary conditions in a strong fashion, the corresponding test
function v is set to zero on the boundary. The last term, therefore, implements Neumann
boundary conditions:∫x∈Ω
(vλu)(x) dx+
∫x∈Ω
(∇Tv∇u
)(x) dx =
∫x∈Ω
(vf)(x) dx+
∫x∈∂ΩN
(vgN)(x) dx . (2.7)
Compared to (2.5) three things changed. First and foremost, the differentiability require-
ment for u is now the same as for v, leading to the term weak form of the PDE. Second, using
a Galerkin formulation, i.e. the same function spaces for u and v, yields a symmetric op-
erator on the left-hand side. And third, the right-hand side incorporates the right-hand side
of the initial equation and the boundary conditions. The terms for the Dirichlet bound-
ary conditions are not present, as the test function is by construction zero on Dirichlet
boundaries.
2.2.2 Finite element approach
The previous section derived the weak form of the Helmholtz equation. Here, a one-
dimensional domain Ω is considered in order to derive the associated element matrices. The
2.2 A spectral-element method for the Helmholtz equation 11
domain is decomposed into ne non-overlapping subdomains Ωe called elements and on every
element Ωe, a polynomial ansatz of order p approximates the solution u with uh
∀x ∈ Ωe : uh(x) =
p∑i=0
ui,eφi,e(x) , (2.8)
where φi,e are the basis functions on the element and ui,e the respective coefficients. Typically,
these functions are constructed on the standard element ΩS = [−1, 1], and then mapped
linearly to Ωe, such that
∀x ∈ Ωe : uh(x) =
p∑i=0
ui,eφi(ξ(x)) . (2.9)
With interpolation polynomials, a set of collocation points ξipi=0 defines the basis functions
and leads to the interpolation property φi(ξj) = δij, where δij denotes theKronecker delta.
Using the ansatz (2.9), the integrals from (2.7) can be evaluated on each element. For
instance, the mass term equates to∫x∈Ωe
(vhuh)(x)dx =
∫x∈Ωe
p∑i=0
vi,eφi(ξ(x))
p∑j=0
uj,eφj(ξ(x))dx
⏞ ⏟⏟ ⏞vTe Meue
, (2.10)
where Me is the element mass matrix and ve and ue the respective coefficient vectors for vh
and uh in Ωe. Similarly, the stiffness term yields with a matrix product∫x∈Ωe
(∂xvh∂xuh)(x)dx =
∫x∈Ωe
p∑i=0
vi,e∂xφi(ξ(x))
p∑j=0
uj,e∂xφj(ξ(x))dx
⏞ ⏟⏟ ⏞vTe Leue
. (2.11)
Here, Le denotes the element stiffness matrix. On the standard element the components of
these matrices compute to
Mij =
1∫−1
(φiφj)(ξ)dξ (2.12)
Lij =
1∫−1
(∂ξφi∂ξφj)(ξ)dξ , (2.13)
with the latter being evaluated using the differentiation matrix
Dij = ∂ξφj(ξi) (2.14)
12 2 The spectral-element method
⇒ L = DTMD . (2.15)
The affine linear mapping from ΩS to Ωe introduces metric factors
Me =he
2M (2.16)
De =2
he
D (2.17)
Le =2
he
L , (2.18)
and result in element-local Helmholtz operators
He = λMe + Le (2.19)
and discrete right-hand sides
Fe = Mefe , (2.20)
where the latter can additionally include the effects of Neumann boundary conditions.
Typically, more than one element is desired in the computation. In the continuous spectral-
element method, continuity of the variable uh facilitates coupling between the elements.
With an element-local storage scheme, an equation system of the form
vTLQQTHLuL = vT
LQQTFL (2.21)
results, where HL denotes the block-diagonal matrix of element-local Helmholtz operators
and uL, vL and FL the vectors of discrete solution, test function, and right-hand side,
respectively. Furthermore,QT gathers the contributions for the global degrees of freedom
and Q scatters these to the element-local ones. If the set of collocation nodes incorporates
a separated element boundary, the operation QQT simplifies to adding contributions from
adjoining elements, as shown in Figure 2.1, and can be implemented in the local system.
While requiring the storage of multiply occurring data points, the element-local storage
allows for faster operator evaluation [17].
Throughout this work, Gauß-Lobatto-Legendre polynomials, as shown in Figure 2.2,
serve as basis functions. They are interpolation polynomials defined by theGauß-Lobatto
quadrature points which include the element boundaries. In conjunction with the interpo-
lation property, the boundary is completely separated, facilitating an easier gather-scatter
operation. Moreover, the respective system matrices possess a low condition number and,
lastly, the quadrature rules inherent to the points can be employed such that
Mij ≈ δijwi , (2.22)
2.2 A spectral-element method for the Helmholtz equation 13
Ω1
Ω1
Ω2
Ω2
Ω3
Ω3
Element-wise system
Global system
Element-wise system
QT
Q
Ω1 Ω2 Ω3
Element-wise system QQT
+ +
Figure 2.1: Gather-Scatter operation for an element-wise storage scheme in one dimension whenusing Gauß-Lobatto-Legendre basis functions for polynomial degree p = 4 and 4elements. The boundary nodes of the elements are drawn larger and arrows denotecommunication between the elements. Top: First, the gather operation QT gatherscontributions from neighboring elements, then the result gets scattered to the element-local storage via Q. Bottom: Implementation in a local-element system, foregoing theglobal system.
φ0 φ1 φ2 φ3 φ4
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
ξ
−0.25
0.00
0.25
0.50
0.75
1.00
φi(ξ)
Figure 2.2: Gauß-Lobatto-Legendre basis functions for polynomial degree p = 4 on thestandard element [−1, 1].
where wi is the weight for the point ξi. While the Gauß quadrature allows for exact inte-
gration of order 2p+ 1, only 2p− 1 is attained on Gauß-Lobatto points. This introduces
a discretization error. Its impact, however, diminishes with increasing polynomial degree.
Furthermore, the convergence properties presented in the next section still hold and the
lowered implementation and computational effort outweigh the slightly lower accuracy.
2.2.3 Convergence properties
Let V denote the function space containing the exact solution uex to (2.1), Vh ⊂ V the
function space spanned by the basis functions of the spectral-element mesh using width h
14 2 The spectral-element method
and polynomial degree p, and uh the solution on the mesh. Further, let Ih denote the
interpolant from V to Vh. Then, the error estimate for the spectral element solution is [23]
∥uex − uh∥V ≤ C minu∈Vh
∥u− uex∥V ≤ C ∥Ihuex − uex∥V , (2.23)
where C is a constant. In the above, the interpolation error generates an upper bound
for the discretization error ∥uex − uh∥V . It depends upon the polynomial degree p, element
width h, as well as the smoothness of the solution and the chosen norm. With sufficient
differentiability, the interpolation error in the maximum norm ∥·∥V,∞ approximates to
C ∥Ihuex − uex∥V,∞ ≤C
(p+ 1)!
u(p+1)ex
V,∞ hp+1Λ(p) , (2.24)
where Λ is the Lebesque constant of the polynomial system. While Λ depends on the
polynomial order, the corresponding value does not change significantly when using GLL
polynomials, e.g. only an increase by a factor of 1.5 is present when the polynomial degree
increases from p = 5 to p = 20. Inserting the above into (2.23) leads to
⇒ ∥uex − uh∥V,∞ ≤C
(p+ 1)!
u(p+1)ex
V,∞ hp+1Λ(p) . (2.25)
Equation (2.25) is the so-called spectral-convergence property: Any polynomial degree allows
for convergence. Asymptotically, however, lower polynomial degrees require more degrees of
freedom to attain the same accuracy as a high order approximation.
2.3 Tensor-product elements
2.3.1 Tensor-product matrices
Let C ∈ Rn2,n2,u,v ∈ Rn2
with n ∈ N. The evaluation of the matrix vector product
v = Cu (2.26)
requires n4 multiplications when implemented as a triple sum. If, however, the matrix C
possesses a substructure, such that for matrices A,B ∈ Rn,n
C =
⎛⎜⎜⎜⎜⎜⎝AB1,1 AB1,2 . . . AB1,n
AB2,1 AB2,2 . . . AB2,n
......
. . ....
ABn,1 ABn,2 . . . ABn,n
⎞⎟⎟⎟⎟⎟⎠ =: B⊗A , (2.27)
2.3 Tensor-product elements 15
the matrix can be decomposed
C = B⊗A = (B⊗ I) (I⊗A) = (I⊗A) (B⊗ I) . (2.28)
The above allows for
v = Cu = (B⊗ I) (I⊗A)u , (2.29)
which is the consecutive application of one-dimensional matrix products. First applying A,
then B, requires 2n3 multiplications instead of the prior n4. The decomposition C = B⊗A
denotes a so-called tensor-product matrix with the following properties:
(B⊗A)T = BT ⊗AT (2.30a)
(B⊗A)−1 = B−1 ⊗A−1 (2.30b)
(B⊗A) (D⊗C) = (BD)⊗ (AC) , (2.30c)
with further properties being presented in [89, 23]. While only square matrices were discussed
here, the extension to non-square matrices as well as multiple dimensions is straight-forward.
For the d-dimensional case, application of tensor product matrices utilize dnd+1 multiplica-
tions, whereas the direct matrix multiplication requires n2d multiplications. Hence, casting
matrix multiplications in the form of tensor products lowers the algorithmic complexity and
facilitates structure exploitation while utilization of (2.28) and (2.30) allows for factorization.
2.3.2 Tensor-product bases
Consider the hexahedral standard element ΩS in three dimensions, i.e. ΩS = [−1, 1]3. As in
the one-dimensional case, a function u can be approximated on ΩS using nDOF degrees of
freedom
uh
(ξ)=
nDOF∑m=1
umφ3Dm
(ξ)
, (2.31)
with basis functions φ3Dm : ΩS ↦→ R. In general, any kind of ansatz functions can be utilized in
multiple dimensions, generating the need to create sets of collocation points, differentiation
matrices, and integration weights associated with them. Tensor-product bases constitute a
general way to generate a particular set of these. A full polynomial ansatz serves as basis in
each direction, and a multi-index maps to a lexicographic indexing, as shown in Figure 2.3:
uh
(ξ)=
p1∑i=0
p2∑j=0
p3∑k=0
uijkφ3Dijk
(ξ)
, (2.32)
16 2 The spectral-element method
ξ1
ξ2(0, 0) (1, 0) (2, 0) (3, 0)
(0, 1) (1, 1) (2, 1) (3, 1)
(0, 2) (1, 2) (2, 2) (3, 2)
(0, 3) (1, 3) (2, 3) (3, 3)
(0, 4) (1, 4) (2, 4) (3, 4)
i
j
ξ1
ξ21 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
Figure 2.3: Left: Utilization of multi-index (i, j, k) in the plane k = 0 for GLL collocation nodes ofpolynomial degrees p1 = 3 and p2 = 4. Right: corresponding lexicographic indexing.
where p1, p2, and p3 are the polynomial degrees in the directions ξ1, ξ2, and ξ3, respectively.
Decomposing the basis functions in the respective directions leads to
φ3Dijk
(ξ)= φi(ξ1)φj(ξ2)φk(ξ3) (2.33)
⇒ uh
(ξ)=
p1∑i=0
p2∑j=0
p3∑k=0
uijkφi(ξ1)φj(ξ2)φk(ξ3) . (2.34)
While it is possible to use different polynomial degrees per direction, this work employs the
same polynomial degree in every direction, i.e. p = p1 = p2 = p3, simplifying the above to
⇒ uh
(ξ)=
p∑i=0
p∑j=0
p∑k=0
uijkφi(ξ1)φj(ξ2)φk(ξ3) . (2.35)
For a hexahedral element, the domain Ωe is [a1, b1]× [a2, b2]× [a3, b3] and linear functions
facilitate the mappings:
∀x ∈ Ωe : uh(x) =
p∑i=0
p∑j=0
p∑k=0
uijkφi(ξ1(x1))φj(ξ2(x2))φk(ξ3(x3)) . (2.36)
The above provides a regular structure consisting of the one-dimensional ansatz utilized in
all three directions. It allows to decompose many operators in the element directions and,
therefore, opens up possibilities for structure exploitation.
2.3.3 Tensor-product operators
Many operations can be decomposed into their action in different spatial directions. For
a tensor-product element, this leads to a decomposition into the actions in the respective
2.3 Tensor-product elements 17
element directions. For instance, the integration of the two functions uh and vh on the
standard element ΩS = [−1, 1]3 can be written as
∫ξ∈ΩS
(vhuh)(ξ)dξ =
1∫−1
1∫−1
1∫−1
(vhuh)(ξ)dξ1dξ2dξ3 (2.37)
with the integrand being
(vhuh)(ξ)=
∑0≤i,j,k≤p
vijkφi(ξ1)φj(ξ2)φk(ξ3)∑
0≤l,m,n≤p
ulmnφl(ξ1)φm(ξ2)φn(ξ3) . (2.38)
Equation (2.37) can be cast into
vTM3Du =
∫ξ∈ΩS
(vhuh)(ξ)dξ , (2.39)
where
M3Dijk,lmn =
1∫−1
1∫−1
1∫−1
φi(ξ1)φj(ξ2)φk(ξ3)φl(ξ1)φm(ξ2)φn(ξ3)dξ1dξ2dξ3 (2.40)
⇔M3Dijk,lmn =
1∫−1
φi(ξ1)φl(ξ1)dξ1
⏞ ⏟⏟ ⏞Mil
1∫−1
φj(ξ2)φm(ξ2)dξ2
⏞ ⏟⏟ ⏞Mjm
1∫−1
φk(ξ3)φn(ξ3)dξ3
⏞ ⏟⏟ ⏞Mkn
(2.41)
⇔M3D = M⊗M⊗M . (2.42)
The tensor-product structure of the basis infers a tensor-product structure into the element
matrices. As for the one-dimensional case, usage of the coordinate transformation induces
metric coefficients into the matrices, e.g.
M3De =
h1,eh2,eh3,e
8M⊗M⊗M , (2.43)
where hi,e denotes the element width in direction i.
In a similar fashion, the tensor-product notation of many operators can be derived. For
instance a tensor-product version of the Helmholtz operator in three dimensions reads
He = d0,eM⊗M⊗M+ d1,eM⊗M⊗ L+ d2,eM⊗ L⊗M+ d3,eL⊗M⊗M , (2.44)
18 2 The spectral-element method
where the coefficients di,e evaluate to
de =h1,eh2,eh3,e
8
(λ,
(2
h1,e
)2
,
(2
h2,e
)2
,
(2
h3,e
)2)T
. (2.45)
While the evaluation of the Helmholtz operator in general requires O(p6), (2.44) allows foran evaluation in O(p4) operations. The tensor-product structure of the operators constitutesone key element to well-performing spectral-element solvers and will be harnessed throughout
this work.
2.3.4 Fast diagonalization method
The fast diagonalization method is a standard technique for fast application of inverse op-
erators [89, 23]. The method is founded on the generalized eigenvalue decomposition of
two matrices, for instance M and L. While typically utilized to facilitate an inverse for
the Helmholtz operator, every tensor-product operator using only these two matrices is
treatable. Here, the Helmholtz operator (2.44)
He = d0,eM⊗M⊗M+ d1,eM⊗M⊗ L+ d2,eM⊗ L⊗M+ d3,eL⊗M⊗M
is considered. Due to M being symmetric positive definite, a generalized eigenvalue decom-
position of L with regard to M is possible. I.e. search for eigenvalues λ ∈ R such that
∃ v ∈ Rnp : Lv = λMv . (2.46)
As M is invertible, using the notation M1/2 with M1/2M1/2 = M, the above can be cast
into
⇒∃ v ∈ Rnp : M−1/2LM−1/2(M1/2v
)= λ
(M1/2v
)(2.47)
⇔∃ v ∈ Rnp : M−1/2LM−1/2v = λv . (2.48)
The above is an eigenvalue decomposition that additionally transforms the mass matrix to
identity. When storing the scaled eigenvectors in a matrix S and the eigenvalues in a diagonal
matrix Λ, the above can be written as
STMS = I (2.49a)
STLS = Λ . (2.49b)
2.4 Performance of basic Helmholtz solvers 19
With L being symmetric and positive semi-definite, existence of a solution is guaranteed
and the eigenvalues are non-negative. However, the resulting transformation matrix is non-
orthogonal and possesses the properties
SST = M−1 (2.50a)
S−1 = STM . (2.50b)
Using these identities, the operator (2.44) can now be written as
He =(S−TST
)⏞ ⏟⏟ ⏞I
⊗(S−TST
)⏞ ⏟⏟ ⏞I
⊗(S−TST
)⏞ ⏟⏟ ⏞I
He
(SS−1
)⏞ ⏟⏟ ⏞I
⊗(SS−1
)⏞ ⏟⏟ ⏞I
⊗(SS−1
)⏞ ⏟⏟ ⏞I
(2.51)
⇒ He =(S−T ⊗ S−T ⊗ S−T
)De
(S−1 ⊗ S−1 ⊗ S−1
)(2.52)
where
De =(ST ⊗ ST ⊗ ST
)He (S⊗ S⊗ S) (2.53)
⇒ De = d0,eI⊗ I⊗ I+ d1,eI⊗ I⊗Λ+ d2,eI⊗Λ⊗ I+ d3,eΛ⊗ I⊗ I . (2.54)
Here, De is the diagonal matrix comprising the generalized eigenvalues of the three-dimen-
sional element operator. If De is invertible, for instance due to λ being positive and there-
fore d0,e > 0 or when only the interior of the element is considered, (2.52) can be explicitly
inverted to
H−1e = (S⊗ S⊗ S)D−1
e
(ST ⊗ ST ⊗ ST
). (2.55)
While the presence of De leads to the operator not being in tensor-product form anymore,
the method allows construction of explicit inverses for operators and, furthermore, evaluation
of these in O(p4) operations.
2.4 Performance of basic Helmholtz solvers
2.4.1 Considered preconditioners and solvers
To show-case the behavior of the spectral-element method as well as validate the baseline
implementation used in this work, basic Helmholtz solvers are considered, employing the
preconditioned Conjugate Gradient (pCG) method [118]. The pCG method allows to directly
link the required number of iterations to the condition of the system matrix and therefore
enables qualitative and quantitative study of the operator and the effect of preconditioners
on it.
20 2 The spectral-element method
In the end, multigrid-based preconditioners will be utilized as they negate the increase of the
condition number with the number of elements. Similar to the pCG methods, multigrid re-
quires one residual evaluation as well as a smoothing operation on each level [12, 13]. Multiple
options exist for the smoothers. Block-inverses facilitated by the fast diagonalization method
are one choice [88], as are block-Jacobi or block-Gauss-Seidel smoothers [72]. For SEM,
these consist of tensor-product operations and their behavior for a constant number of ele-
ments can be mimicked by element-local preconditioners. Two main options exist for local
tensor-product preconditioning these: A diagonal Jacobi-type preconditioner generates an
efficient, easily implemented preconditioner, but also one which is limited in effectiveness.
Using the fast diagonalization operator from Subsection 2.3.4 allows for a more intricate pre-
conditioner. When using the generalized eigenvalue decomposition on the interior element
only, the operator maps into the eigenspace of the inner element, edges, and faces, where the
inverse eigenvalues are applied. On the vertices, the diagonal preconditioner remains. The
strategy results in a block-Jacobi preconditioner.
Three solvers result with the techniques described above: An unpreconditioned CG solver
working on the full set of data called fCG. Then, a diagonally preconditioned CG method
called dfCG. Finally, a block-preconditioner leads to the solver bfCG.
2.4.2 Setup
In a domain Ω = (0, 2π)3, the manufactured solution
uex(x) = cos(µ(x1 − 3x2 + 2x3)) sin(µ(1 + x1))
· sin(µ(1− x2)) sin(µ(2x1 + x2)) sin(µ(3x1 − 2x2 + 2x3)) ,(2.56)
is considered, which generalizes the one from [52] to three dimensions. The right-hand side
to the Helmholtz equation being evaluated analytically from
f(x) = λuex(x)−∆uex(x) . (2.57)
InhomogeneousDirichlet conditions are imposed on the boundary and the stiffness param-
eter µ is set to 5, leading to a heavily oscillating right-hand side. Furthermore, the Helm-
holtz parameter λ is chosen as λ = 0, corresponding to the Laplace equation and, hence,
a larger condition number of the system matrix and therefore a harder test case. The initial
guess is set to pseudo-random number inside the domain, and to the respective boundary
conditions on the boundary. This specific initial guess prevents overresolution of the resid-
ual, which would lead to fewer required iterations than expected in practice. After attaining
the numerical solution uh, it is interpolated to a mesh using polynomial degree 31, which is
deemed sufficient to resolve the solution, and the maximum error computed on the colloca-
tion nodes.
2.4 Performance of basic Helmholtz solvers 21
p = 4 p = 8 p = 12 p = 16
101 102
Elements per direction k
10−9
10−7
10−5
10−3
10−1
101
103
∥u−uex∥ ∞
1
5
1
917
1
102 103
Degrees of freedom per direction
10−9
10−7
10−5
10−3
10−1
101
103
∥u−uex∥ ∞
1
4
16
1
Figure 2.4: Errors when solving the Helmholtz equation with a spectral-element method andvarying the polynomial degree p. Left: Error over the number of elements per direc-tion k. Right: Error over the number of degrees of freedom nDOF per direction.
The solvers were implemented in Fortran 2008 using double precision floating point numbers
and compiled with the Intel Fortran compiler with MPI Wtime serving for time measurements.
The tests were conducted on one node of the HPC Taurus at ZIH Dresden. It consisted of two
sockets, each containing an Intel Xeon E5-2680 v3 with twelve cores running at 2.5GHz. Of
these, only one core computed during the tests, leading to the algorithms, not parallelization
efficiency, being measured.
Two test cases are considered: In the first, the domain is discretized with spectral elements
of degree p ∈ 4, 8, 12, 16, and the number of elements is scaled from ne = 43 to ne = 2563.
It allows to investigate the error of the discretization compared to the number of degrees of
freedom and, therefore, to validate the implementation. In the second case, ne = 83 spectral
elements of polynomial degrees p ∈ 2 . . . 16 discretize the domain. With the constant
number of elements, the lack of a multigrid preconditioner does not hinder the solvers,
allowing investigation of solution times as well as condition numbers. Hence, the number of
iterations required to reduce the residual by ten orders, n10, was measured. Furthermore,
runtimes tests were conducted. In order to attain reproducible runtimes, the solvers were
called 11 times, with the runtimes of the last 10 solver calls being averaged. This precludes
measurement of instantiation effects, e.g. library loading, that would only occur in the first
time step of a real-world simulation.
2.4.3 Results
Figure 2.4 depicts the maximum error of the discretization over the number of elements and
degrees of freedom. For small number of elements and p = 4, the error exhibits values two
22 2 The spectral-element method
fCG dfCG bfCG
2 4 8 16
Polynomial degree p
102
103
Number
ofiterationsn10
9
8
2 4 8 16
Polynomial degree p
10−2
10−1
100
101
102
Runtime[s]
1
4
4
1
2 4 8 16
Polynomial degree p
101
102
Runtimeper
DOF[µs]
3
4
4
3
2 4 8 16
Polynomial degree p
101
102
103
Iterationtimep.D
OF[ns]
Figure 2.5: Utilized number of iterations and required runtime per iteration and degree of freedomwhen solving the Helmholtz equation with locally preconditioned conjugate gradi-ent methods. Three solvers are compared: An unpreconditioned conjugate gradientmethod (fCG), a diagonally preconditioned one (dfCG), and a block-preconditionedone (bfCG). Left: Number of iterations to reduce the residual by ten orders n10.Right: Runtime per iteration and number of degrees of freedom (DOF).
orders of magnitude higher than the maximum value of the solution, which is probably due
to underresolution of the right-hand side leading to aliasing errors. With higher polynomial
degrees and, hence, more degrees of freedom to resolve the boundary and initial conditions,
the error decreases. Increasing the number of elements decreases the error as well until 10−9,
where machine precision is reached. For each tested polynomial degree, the error scales
with hp+1 after entering the asymptotic regime, replicating the spectral convergence prop-
erty (2.25) and therefore validating the implementation. When plotting over the number of
degrees of freedom per direction, the difference in accuracy gets less pronounced. For fewer
than 100 degrees of freedom, no significant difference in the errors is present. When increas-
ing the number of degrees of freedom per direction beyond 100, the asymptotic behavior
becomes dominant, with a slope of p.
2.5 Summary 23
Figure 2.5 compares the required number of iterations for the three solvers as well as their
runtime per iteration and degree of freedom for a constant number of elements of ne = 83
and a variable polynomial degree. As to be expected, the unpreconditioned solver requires
the largest number of iterations to reduce the residual by ten orders. Diagonal precondi-
tioning only slightly decreases the number, albeit with the same nearly linear increase in the
number of iterations. As the elements only possess one interior point at p = 2, diagonal and
block preconditioning amount to the same operation and, hence, generate the same number
of iterations. For higher polynomial degrees, block-wise preconditioning generates a lower
slope and only half the number of iterations are necessary at p = 16. These, however, do
not translate to a lower runtime: the block-preconditioned solver has the largest runtime
until p = 15 and generates only slight savings thereafter.
For all three solvers, the runtime per degree of freedom exhibits a linear increase with the
polynomial degree. In combination with the increasing number of iterations, this amounts
to a near constant runtime per iteration and degree of freedom, which is in stark contrast
to the expectation. While the number of multiplications for Helmholtz operator and the
preconditioner scale with O(p4), the asymptotic regime for these is not reached, indicating
that the implementations are not compute bound. Hence, optimization potential exists even
for these baseline solvers, stemming from operators which do not harness the full potential of
the hardware. This is a known issue in the spectral-element community: Often matrix-matrix
multiplication ends up faster than operators exploiting the tensor-product structure due to
the former being optimized for the hardware [17]. To attain a baseline variant to compare
more efficient algorithms with, the current implementations need to be streamlined.
2.5 Summary
This section briefly recapitulated the spectral-element method. Compared to low-order meth-
ods, the polynomial ansatz of order p allows for the error scaling with with hp and therefore
arbitrary convergence orders. However, the higher error reduction comes at a cost. With
tensor-product bases, the operators boil down to the consecutive application of 1D matrices
in the respective dimensions of the elements, lowering the multiplication count from O(p2d)
to O(dpd+1
). While the above constitutes a significant improvement over matrix-matrix
implementations, the operators still scale super-linearly with the number of degrees of free-
dom when increasing the polynomial degree. However, the simple implementation of such
operators does not follow this prediction due to inefficiencies.
24 2 The spectral-element method
25
Chapter 3
Performance optimization for tensor-
product operators
3.1 Introduction
Explicit time steping schemes and iterative solver consist of the recurring application of
operators, for instance gradient, divergence, interpolation, and Laplacian operator, which
occupy a large portion of the runtime. With spectral elements and tensor-product bases, the
operation count of these scale with O(pd+1ne
), where d is the number of dimensions. The
operations decompose into one-dimensional matrix products. While these bear similarity
to batched DGEMM [84, 91], they work on non-contiguous dimensions, barring the direct
usage of matrix multiplication implementations. Despite the operator complexity scaling
withO(p4) in 3D, a direct matrix-matrix implementation is often more efficient for reasonable
polynomial degrees, as showcased for two dimensions in [17].
This chapter investigates the performance of tensor-product operations in three dimensions,
providing a structured approach toward their optimization. Due to their high relevance
for applications in numerous scientific domains, the interpolation operator, Helmholtz
operator, and fast diagonalization operator serve as prototypical examples. For these, ef-
ficient implementations are proposed and possible performance gains investigated from the
single operator itself, to a full Helmholtz solver. To this end, Section 3.2 investigates
the performance of a loop-based implementation of the interpolation operator compared to
a library-based implementation of the matrix-matrix multiplication. In Section 3.3 opti-
mization strategies are discussed and their impact on the operator runtime evaluated. The
approach is afterwards extended to the Helmholtz and the fast diagonalization operators
in Section 3.4. These are afterwards utilized in the Helmholtz solvers from Chapter 2
to showcase the attainable performance gains for the basic building blocks of time-stepping
26 3 Performance optimization for tensor-product operators
schemes in CFD. This chapter summarizes the work presented in [64], where the approach
was then used on a fully-fledged PDE solver for combustion problems.
3.2 Basic approach for interpolation operator
3.2.1 Baseline operators
Continuous and Discontinuous spectral-element formulations typically employ tensor-product
elements with np points per direction. The approach allows to decompose operations into
smaller ones working separately on the directions and, hence, opens up possibilities of struc-
ture exploitation. On these elements, interpolation can be required, e.g. for visualization.
In the one-dimensional case, applying a matrix mapping from the np basis functions to n⋆p
new ones, A ∈ Rn⋆p,np , to the coefficient vector implements the operation. For the multi-
dimensional case, the matrices can differ, e.g.A is employed in the first, B in the second,
and C in the third dimension. The most prominent use case is np = n⋆p. The number of
unknowns are increased or decreased, implementing the prolongation or restriction operation
for multigrid. The case np = n⋆p is relevant as well and can, for instance, be employed for
creating a polynomial cutoff filter to stabilize flow simulations [33]. This section addresses
the latter case, assuming a constant polynomial degree p = np − 1 in every element and
the simpler choice A = B = C. The operator is applied to a vector u ∈ Rn3p,ne in all three
dimensions, computing the result vector v ∈ Rn3p,ne
∀ Ωe : ve ← A⊗A⊗A ue , (3.1)
which can be written as
v ← A⊗A⊗A u (3.2)
when interpreting v and u as matrices. While small, the operator incorporates matrix
products in every dimension and, hence, contains all features and components present in
larger operators.
As baseline implementations, two variants are considered. The first one assembles the oper-
ator to a matrix A⊗A⊗A =: A3D ∈ Rn3p,n
3p and applies it in every element. This allows
leveraging highly-optimized matrix multiplications from libraries, such as BLAS and the
resulting algorithm can be written in one line, i.e.v ← A3Du. It couples n3p values with n3
p,
requiring a total of n6pne multiplications. As it employs matrix-matrix multiplication imple-
mented via GEMM from BLAS, it is called MMG. The second variant implements the tensor-
product decomposition (3.1), consecutively applying three one-dimensional matrix products
in separate dimensions via Algorithm 3.1. With strided access in two dimensions, BLAS
3.2 Basic approach for interpolation operator 27
Algorithm 3.1: Loop-based implementation of the sum-factorization of the interpolation operatorusing temporary storage arrays.
1: for Ωe : do2: for 1 ≤ i, j, k ≤ np do3: uijk ←
∑np
l=1 Ailuljk,e ▷ u = I⊗ I⊗Aue
4: end for5: for 1 ≤ i, j, k ≤ np do6: uijk ←
∑np
l=1 Ajluilk ▷ u = I⊗A⊗Aue
7: end for8: for 1 ≤ i, j, k ≤ np do9: vijk,e ←
∑np
l=1 Akluijl,e ▷ ve = A⊗A⊗Aue
10: end for11: end for
is not directly applicable and loops implement the variant. The result is termed “tensor
product loop” (TPL). Due to the three one-dimensional matrix multiplication, 3n4pne mul-
tiplications occured. Hence, it requires less multiplications than MMG starting at np = 2.
3.2.2 Runtime tests
The two variants MMG and TPL were implemented in Fortran 2008 using double preci-
sion, with the Intel Fortran compiler v. 2018 serving as compiler and the corresponding
Intel Math Kernel Library (MKL) as BLAS implementation. For time measurement pur-
poses MPI Wtime was employed and the optimization level was set to O3.
To compare the performance of the operators, runtime tests were conducted on the high
performance computer Taurus at ZIH Dresden. One node containing two Intel Xeon E5-2680
running at a clock speed of 2.5GHz served as measuring platform. As this chapter aims to
improve the single-core performance, only one core was utilized, lest effects of parallelization,
not the performance, are measured. When using AVX2 vector instructions and counting
the fused multiply add instruction (FMA) as two performed floating point operations, the
maximum available performance of one core computes to 40GFLOP s−1. Furthermore, 64 kB
of L1 cache, 256 kB of L2 cache and 30MB of L3 cache are available [48].
The operator size is varied ranging from np = 2 to np = 17, lying in the range of polynomial
degrees currently employed in simulations, and the number of elements was changed from
ne = 83 to ne = 163 = 4096. For these parameters, the operators were run 101 times, with
the runtime of the last 100 times being averaged. This prevents measurement of one-time
effects, e.g. initialization of libraries such as MKL.
Figure 3.1 depicts the computational throughput of the operators, corresponding to the
number of computed entries in the result vector per second, called (Mega) Lattice Updates
28 3 Performance optimization for tensor-product operators
MMG TPL
100 101 102 103
Number of elements ne
0
200
400
600
800
1000
1200
1400
Throughput[M
LUP/s]
m = 2
100 101 102 103
Number of elements ne
0
25
50
Throughput[M
LUP/s]
m = 7
Figure 3.1: Performance of the two implementation variants for the interpolation operator whenvarying the number of elements ne for two 1D operator sizes np. Left: np = 2,right:np = 7. In both cases, the performance is measured in mega lattice updates persecond (MLUP/s).
Per Second (MLUP/s). At np = 2, a large overhead is present for both variants, leading to a
low perfomrance at low number of elements, with MMG achieving 100MLUP/s. Afterwards
the performance increases, saturating at 500 elements with 1300MLUP/s and experiencing
a slight drop at 3000 elements. However, TPL does not show such behaviour and stays
slower. The matrix-matrix based implementation is a factor of up to 30 faster than TPL,
resulting from the operator size: At np = 2, the matrix A3D has size 8× 8, which is near
optimal for the micro architecture: The instruction set AVX2 utilizes 256 bit = 32B wide
vector registers, allowing for simultaneous computation on four double precision values [48].
With these, implementation of an 8 × 8 matrix product maps well to the architecture and
incurs nearly no overhead from loops. After the peaking with 1300MLUP/s, a decline in
performance is present for MMG, corresponding to the change from loading data from L2
cache to loading data from the slower L3. In [48] a read bandwidth of 32GB s−1 was measured
for the L3 cache. When assuming that two doubles are loaded and one is stored and, hence,
computation of one point requires 24B, the roofline model predicts 1300MLUP/s [134].
Hence, the variant MMG attains 4/5 of the maximum possible performance for large numbers
of elements. The TPL variant, however, is not, only producing 50MLUP/s for both cases.
In the regime 2 < np < 7, using the assembled element matrix is faster than employing
the loop-based version. For np ≥ 7 TPL becomes faster, which is in agreement with other
studies: Typically, tensor-product variants are efficient only for high polynomial degrees and
matrix-matrix based implementations are utilized for the lower ones [17].
The measurements for ne = 4096 are shown in Figure 3.2, using throughput as well as the
rate of floating point operations measured in GFLOP/s. At np = 6, MMG attains a rate
3.3 Compiling information for the compiler 29
MMG TPL
2 4 8 12 16np
0
5
10
15
20
25
30
35
40
Perform
ance
[GFLOP/s]
2 4 8 12 16np
0
50
100
150
200
250
300
350
400
Throughput[M
LUP/s]
Figure 3.2: Performance of the two operators when using ne = 4096 elements and increasing the1D operator size np. Top: Performance measured in gigaflops GFLOP/s with thethreshold of 40GFLOP/s being the maximum rate of floating point operations theCPU core can execute, bottom: Performance in mega lattice updates per second(MLUP/s) with the threshold line corresponding to 50MLUP/s.
of 35GFLOP/s and from then on extracts 90 % of the maximum possible number of floating
point instructions per second. Still, a decreasing number of mesh updates results, as number
of operations required per mesh point increases with n3p. The variant TPL, on the other
hand, utilizes less than an eights of the available performance. However, it attains an
approximately constant throughput of 50MLUP/s, with np = 4, 8, 12 outperforming due to
the operator width being a multiple of the instruction set width. With the constant number
of updates and low floating point operation usage, the variant is far from compute-bound.
This is due to the compiler. As can be seen from the assembler code, many loop-control
constructs are present and the reduction loops are vectorized. The compiler optimizes the
code for very large loops and gets misled by the large amount of small, deeply-nested ones.
The result is a code that does not perform for any relevant polynomial degree and this
chapter illustrates ways to achieve efficient ones.
3.3 Compiling information for the compiler
3.3.1 Enhancing the interpolation operator
The last section showed what is common knowledge in the SEM community: For low to
medium polynomial degrees loop-based implementations are often slower than versions uti-
lizing optimized libraries [17, 98], with a factor of up to 30. The highly-optimized libraries
make complexity-wise inferior algorithms excel by exploiting the full potential of the hard-
ware. To be able to transplant the behaviour of these libraries to tensor-product operators,
30 3 Performance optimization for tensor-product operators
one has to understand why they shine. Most BLAS implementations received intensive
manual optimization, as documented for GotoBLAS [42]. These optimizations include using
input-size dependent code, cache-blocking for large operators, and manual unrolling of loops
for known loop bounds. In conjunction with architecture-dependent parameters, further
optimization can be achieved [49]. In the following, these techniques are explored, once by
manual optimization, once by exploiting the libraries.
Compiling the current implementation of the interpolation operator generates a binary in-
corporating a large number of loop control constructs, diminishing the performance. These
stem from the limited amount of information presented to the compiler, confining it to opti-
mizing for very large numbers of loop iterations. As a first measure, denoting the data size
directly lowers the required number of loop control operations and, futhermore, enables the
compiler to determine which kind of optimization is benefitial [23]. As a first step, Algo-
rithm 3.1 was implemented with compile-time known loop bounds and data sizes, i.e. once
for np = 1, 2, . . ., with the relevant variant being selected at runtime. While leading to a bet-
ter performance, the implementation either requires code replication or meta-programming,
both lowering the maintainability and increasing binary size. As a parametrization of the
operator is performed, the resulting variant of TPL is called TPL-Par. To further enable
compiler optimizations, all one-dimensional matrix products were extracted into separate
compilation units. Due to the lowered scope, these are easier to analyze, and as they get in-
lined afterwards, no function call overhead occurs during runtime. For more cache-blocking,
the matrix products over second and third dimension in Algorithm 3.1 were fused into one
loop over the last dimension and, hence, one compilation unit. As the code transformation is
the opposite of an inlining, it will be termed outlining here and the variant called TPL-ParO.
Over the last decades, the performance of compiler-generated code has significantly increased.
However, the optimizations applied by the compilers are not necessarily those that attain the
best result. In order to evaluate the effectiveness of compiler optimizations as well as to gauge
possible performance gains, manual refinement was performed on the the outlined variant,
with the product being TPL-Hand. For instance, the operands in the matrix products occur
multiple times, but are computed every single time. Reusing these leads to fewer required
loads and stores and, in turn, a better performance. Furthermore, unroll and jam [49] of
width 2 and 4, with 4 being used for np = 4, 8, 16 and 2 for the remaining even polynomial
degrees. For np = 12 the unroll width was set to two to ensure that at least one data point
remains in each cacheline of when using the stride of 96B. Furthermore, the compiler tried
vectorizing the reduction loops, which proved detrimental to the overall performance. To
circumvent this behaviour, the innermost non-reduction loop was vectorized by hand. Lastly,
for np = 16, blocking of the matrix access led to further performance gains. The mentioned
optimizations are directly applicable for even operator sizes, the implementations for uneven
ones in addition treated the remainder of the unroll and jam operation, leading to a lower
performance for these.
3.3 Compiling information for the compiler 31
Algorithm 3.2: (“TPG”) Variant of the interpolation operator optimized for contiguous dataaccess patterns in the matrix multiplications. The matrices u and v are inter-preted as Rnp,n2
pne matrices. Cyclic permutation allows evaluating the operatorwith matrix-matrix products. One temporary storage, w ∈ Rnp,n2
pne , is required.
1: w ← P(u) ▷ w = P(u)2: v ← Aw ▷ v = P(A⊗ I⊗ I u)3: w ← P(v) ▷ w = P2(A⊗ I⊗ I u)4: v ← Aw ▷ v = P2(A⊗A⊗ I u)5: w ← P(v) ▷ w = A⊗A⊗ I u6: v ← Aw ▷ v = A⊗A⊗A u
Algorithm 3.3: (“TPG-Bl”) Variant of Algorithm 3.2. The operator is used on subsets I that fitin the L2- or L3-cache.
1: for I do ▷ subset I fits into L2- or L3-cache2: w ← P(uI) ▷ w = P(uI)3: vI ← Aw ▷ vI = P(A⊗ I⊗ I uI)4: w ← P(vI) ▷ w = P2(A⊗ I⊗ I uI)5: vI ← Aw ▷ vI = P2(A⊗A⊗ I uI)6: w ← P(vI) ▷ w = A⊗A⊗ I uI7: vI ← Aw ▷ vI = A⊗A⊗A uI8: end for
The above approach mimics the pain-staking optimizations performed to gain well-performing
code for matrix multiplications. An orthogonal approach lies in casting the tensor-product
operations in a way that already optimized libraries are applicable. One matrix product per
dimension is applied in (3.1) to the input vector u. While in Fortran, the first dimension
lies contiguously in memory, the others do not. Working on these results in strided memory
accessses, barring the usage of DGEMM. However, cyclic permutation poses a remedy and
makes DGEMM applicable again [16]. Here, the permutation operator P() exchanges moves
the last direction towards the front:
v = P(u)⇔ ∀ 1 ≤ i, j, k ≤ np : vkij,e = uijk,e (3.3)
v = P2(u)⇔ ∀ 1 ≤ i, j, k ≤ np : vjki,e = uijk,e (3.4)
v = P3(u)⇔ ∀ 1 ≤ i, j, k ≤ np : vijk,e = uijk,e . (3.5)
Using P(), the interpolation operator transforms into a cascade of permutations, each fol-
lowed by a call to DGEMM. This results in a variant utilizing GEMM for tensor-products,
which is called TPG, and shown in Algorithm 3.2.
The loop-based variants work on one element after the other, which leads to inherent cache
blocking. The variant TPG, however, operates multiple times on the whole data set. If it
does not fit into the faster caches, it will be loaded multiple times from L3 cache, or even
32 3 Performance optimization for tensor-product operators
MMG
TPL
TPL-Par
TPG-Bl
TPG
TPL-Hand
TPL-ParO
100 101 102 103
Number of elements ne
0
200
400
600
800
1000
1200
1400
Throughput[M
LUP/s]
m = 2
100 101 102 103
Number of elements ne
0
100
200
300
400
500
600
700
800
900
Throughput[M
LUP/s]
m = 4
100 101 102 103
Number of elements ne
0
50
100
150
200
250
300
350
400
450
Through
put[M
LUP/s]
m = 8
100 101 102 103
Number of elements ne
0
50
100
150
200
250
Through
put[M
LUP/s]
m = 12
Figure 3.3: Performance of the interpolation operator with the new variants, depending on the1D operator size np and the number of elements ne.
from RAM. To exploit the higher bandwidth of the caches, the data set is split into subsets
of elements that fit into either L2, or if not applicable, at least into the portion of L3 cache
of the core, resulting in Algorithm 3.3. As a level of cache-blocking was added, the variant
is called TPG-Bl.
3.3.2 Results
The runtime tests from Section 3.2.2 were repeated including the new variants, resulting in
the data shown in Figure 3.3. While the matrix-matrix based variant remains the fastest
for np = 2, the parametrized variant at least attains a third of the performance. Using
3.3 Compiling information for the compiler 33
MMG
TPL
TPL-Par
TPG-Bl
TPL-Hand TPL-ParO
2 4 8 12 16np
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Perform
ance
[GFLOP/s]
2 4 8 12 16np
0
150
300
450
600
750
900
Throughput[M
LUP/s]
Figure 3.4: Performance of the variants of the interpolation operators with ne = 4096 ele-ments with increasing 1D operator size np. Left: performance measured in gi-gaflops (GFLOP/s), right: performance measured in mega lattice updates per sec-ond (MLUP/s).
smaller compilation units only leads to a slight increase. However, further enforcing vec-
torization allows TPL-Hand to reach half the performance of MMG. The tensor-product
variants employing GEMM use more loads and stores and, hence, do not reach the same
levels, but are faster than TPL, with gains up to a factor of ten. At np = 4, cache blocking
retains the performance TPG reaches at low numbers of elements for TPG-Bl. Both are
faster than MMG and TPL-Par, with only the outlined variant outperforming them slightly.
Further adding hand optimization allows for a factor of three in performance gains compared
to TPL-ParO, and nine to TPL. For np = 8, the operator width fits the architecture very
well, allowing variants using GEMM to attain twice the performance of the parametrized
variant. Cache-blocking is required to retain a significant performance for even small number
of elements. While for np = 12 the benefit of hand-optimization diminishes to one quarter,
it is required to gain good performance at np = 16.
Figure 3.4 depicts throughput as well as floating point performance with the new operator
variants. As beforehand, the matrix-matrix variant is fastest for np = 2. However just
for that one operator size. From np = 3 on, the hand-optimization variant extracts the
most performance, with distinct peaks present at np = 4, 8, 12, 16. There, near half of the
maximum performance is reached as no treatment for the remainder from unrolling and for
shorter SIMD operations occurs. Using only parametrization and outlining does not attain
the same performance: Only half the performance of TPL-Hand results, a flop rate between
5 and 10GFLOP/s, with a peak at np = 12. Even with smaller compilation units and known
data and loop sizes, the compiler does not generate a well-performing binary, causing the
necessity of manual optimization. Compared to the loop-based variants, TPG-Bl quickly
34 3 Performance optimization for tensor-product operators
MMG TPL TPL-Hand
1
4
1
2
1 2 4 8 16 32
Computational Intensity [FLOP/B]
1.25
2.5
5
10
20
40
Perform
ance
[GFLOP/s] Peak vec+FMA
vec no FMA
FMA no vec
no vec no FMA
L3bandwidth
Figure 3.5: Roofline model for the interpolation operator. with “vec” indicating vectorizationand “FMA” the fused multiply add capability. The horizontal lines indicate the peakperformance with the respective capabilities.
reaches a stable 10GFLOP/s and becomes faster than non-optimized loop variants with a
far lower optimization effort.
Figure 3.5 shows a roofline analysis of the operators when using ne = 4096 elements, i.e. a
diagram comparing the possible performance against the attained one [134]. The rate of
floating point operations is plotted over the computational intensity, the amount of work
performed per loaded data. Two limits are present for the floating point rate: The maximum
number of operations the core can perform, demarked by the top line, and the bandwidth
for loading and storing data, resulting in the sloped line on the left. For the analysis,
loading and storing from L3 cache is assumed, as well as that the result is only written once.
While this will over-estimate the algorithmic complexity of the implementations, it shows the
optimization potential. The variant MMG is only memory-bound for np = 2 and then quickly
reaches peak performance. All features of the CPU are utilized, including vectorization
and FMA. The hand-optimized variant is memory-bound for np ≤ 4 and attains half of
the possible performance there. After np = 4, the operation is compute-bound, and half
of the operations result in computation of the result. While this is small compared to the
performance of the matrix product, operations are required to load the data for matrices
and operands into the registers, limiting the potential of further optimization beyond 50 %
of peak performance.
3.4 Extension to Helmholtz solver 35
3.4 Extension to Helmholtz solver
3.4.1 Required operators
The last section investigated the attainable performance for tensor-product operations on the
interpolation operator, as it constitutes the basic building block for larger operators. This
section expands from this building block to the main component for incompressible fluid
flow solvers. There, the solution of the Helmholtz equations consumes a significant part
of the runtime. Hence, the main ingredients for such solvers are considered here: Operator
and preconditioner.
As shown in Subsection 2.3.3, the element Helmholtz operator for hexahedral elements is
He = d0M⊗M⊗M+ d1M⊗M⊗ L+ d2M⊗ L⊗M+ d3L⊗M⊗M . (3.6)
Extracting the diagonal mass matrix M allows to write the application on a vector u as
ve = d0ue + d1(I⊗ I⊗ L)ue + d2(I⊗ L⊗ I)ue + d3(L⊗ I⊗ I)ue (3.7)
where
ue = (M⊗M⊗M)ue (3.8)
L := LM−1 . (3.9)
With the diagonal mass matrix, application of the above formulation requires 6n4pne + 5n3
pne
operations and a loop-based implementation is shown in Algorithm 3.4.
With the operator present, a precontitioner is missing. For sake of simplicity, the local
preconditioning strategies from Chapter 2 are considered, with one being diagonal and the
other one relying on the fast diagonalization. In three dimensions, the application of the
associated operator (2.55) reads
ve = (S⊗ S⊗ S)De
(ST ⊗ ST ⊗ ST
)ue (3.10)
and consists of the consecutive application of interpolation – or more precisely a transfor-
mation to the generalized eigenspace of the element – application of the eigenvalues, and
backward transformation. Hence, the algorithms for the interpolation operator are expanded
by application of diagonal matrix and backward interpolation, leading to 12n4pne + n3
pne op-
erations per application with tensor-products and 4n6pne + n3
pne for a matrix-matrix variant.
36 3 Performance optimization for tensor-product operators
Algorithm 3.4: Loop-based variant of the Helmholtz operator (3.7) using one temporary storage,
u ∈ Rn3p .
1: for Ωe : do2: for 1 ≤ i, j, k ≤ np do3: uijk ←MiiMjjMkkuijk,e
4: vijk,e ← d0uijk
5: end for6: for 1 ≤ i, j, k ≤ np do
7: vijk,e ← vijk,e + d1∑np
l=1 Liluljk
8: end for9: for 1 ≤ i, j, k ≤ np do
10: vijk,e ← vijk,e + d2∑np
l=1 Ljluilk
11: end for12: for 1 ≤ i, j, k ≤ np do
13: vijk,e ← vijk,e + d3∑np
l=1 Lkluijl
14: end for15: end for
3.4.2 Operator runtimes
The tests performed on the interpolation operator were repeated on the Helmholtz and fast
diagonalization operator with Figure 3.6 depicting the achieved performance for the Helm-
holtz operator. As for the interpolation operator, the matrix-matrix variant is faster
for np = 2. However, starting from np = 4, the optimized loop variants outperforms it.
From these, the parametrized version TPL-Par extracts up to a quarter of the possible per-
formance, with outlining reliably increasing the performance. Further hand-tuning increases
the performance to 17.5GFLOP/s, where the deterioration compared to the interpolation
stemming from using one diagonal operation first. While slow at first, TPG-Bl can match
hand-optimized routines for odd polynomial degrees, and harnesses a quarter of the floating
point performance. However, the overall performance of it is lower than for the interpolation
operator. As the Helmholtz operator is additive, the result vector needs to be permuted in
addition to the input vector, requiring further temporary storage and bandwidth. The added
storage only allows for cache-blocking to L2 cache until np = 8, and a performance degra-
dation is visible in the plot. However, the variant still generates a better performance than
the parametrized version. When comparing the maximally performance of the operators to
the one attained for interpolation, the peak is lower. This is due to the added application
of the mass matrix, which is just one multiplication per point and, hence, memory-bound.
Figure 3.7 depicts the attained performance for the fast diagonalization operator. In ac-
cordance with the two other cases, parametrization generates a speedup of two to three
over TPL. Further hand-optimiziation increases the speedup to seven for operator sizes that
are a multiple of four and allows to harvest half of the peak performance. The more gen-
3.4 Extension to Helmholtz solver 37
MMG
TPL
TPL-Par
TPG-Bl
TPL-Hand TPL-ParO
2 4 8 12 16np
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Perform
ance
[GFLOP/s]
2 4 8 12 16np
0
80
160
240
320
400
480
560
640
Throughput[M
LUP/s]
Figure 3.6: Performance of the implementation variants for the Helmholtz operator (3.7) whenusing ne = 4096 elements.
MMG TPL TPL-Par TPG-Bl TPL-Hand
2 4 8 12 16np
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Perform
ance
[GFLOP/s]
2 4 8 12 16np
0
60
120
180
240
300
360
420
480
Through
put[M
LUP/s]
Figure 3.7: Performance of the implementation variants for the fast diagonalization opera-tor (2.52) when using ne = 4096 elements.
eral variant TPG-Bl, however, offers a fourth of the possible performance without manual
optimiziation.
The roofline analysis for Helmholtz and fast diagonalization operator is shown in Fig-
ure 3.8. For all relevant polynomial degrees, the operations are not memory-bound. But
only the matrix-matrix version attains peak performance, whereas the hand-optimized ver-
sion only extracts 50 %. As evident from the generated assembly code, the latter solely
utilizes vector instructions and the fused-multiply instruction. However, not all of these are
for computation: Loading the operands and matrices into the respective registers also occu-
pies instructions that are executed and, hence, runtime. With blocking to increase register
38 3 Performance optimization for tensor-product operators
MMG TPL TPL-Hand
1
4
1
2
1 2 4 8 16 32
Computational Intensity [FLOP/B]
1.25
2.5
5
10
20
40
Perform
ance
[GFLOP/s] Peak vec+FMA
vec no FMA
FMA no vec
no vec no FMA
L3bandwidth
1
4
1
2
1 2 4 8 16 32
Computational Intensity [FLOP/B]
1.25
2.5
5
10
20
40
Perform
ance
[GFLOP/s] Peak vec+FMA
vec no FMA
FMA no vec
no vec no FMA
L3bandwidth
Figure 3.8: Roofline models with “vec” indicating vectorization and “FMA” the fused multiplyadd capability. Left: Helmholtz operator, right: fast diagonalization operator.
Table 3.1: Number of iterations to reduce the initial residual by ten orders, and runtime per degreeof freedom and iteration, and achieved speedup when using optimized operators forthe solvers dfCG and bfCG.
Time per iteration and DOF [ns]
Iterations TPL TPL-Hand Speedup
p dfCG bfCG dfCG bfCG dfCG bfCG dfCG bfCG
3 117 92 58.8 117.8 25.2 28.8 2.3 4.1
7 297 178 41.0 80.9 15.9 22.2 2.6 3.6
11 495 257 49.8 94.9 22.5 31.6 2.2 3.0
15 688 337 49.6 94.9 25.6 36.3 1.9 2.6
reusage, further performance gains are to be expected. These will, however, be limited to a
factor of two at most, not a factor of ten presented here.
3.4.3 Performance gains for solvers
To evaluate performance gains for solvers, the tests from Section 2.4 were repeated, with
the solvers dfCG and bfCG being implemented twice, once with the operator variant TPL
and once with TPL-Hand, leading to four solvers to compare. Here, ne = 83 = 512 spectral
elements discretized the domain Ω = (0, 2π)3 with the polynomial degree being varied be-
tween p = 2 and p = 16. The solvers were run 11 times, reducing the initial residual by ten
orders, with the last ten times generating an average runtime and the required number of
iterations being measured.
3.4 Extension to Helmholtz solver 39
dfCG bfCG TPL TPL-Hand
2 4 8 16
Polynomial degree p
101
102
103
Number
ofiterationsn10
2 4 8 16
Polynomial degree p
100
101
102
Runtimeper
DOF[µs]
Figure 3.9: Solver runtimes for diagonal and block-Jacobi preconditioning when utilizing theoperator variant TPL and TPL-Hand. Left: Required number of iterations to reducethe residual by ten orders of magnitude, right: runtime per degree of freedom.
Figure 3.9 depicts the number of iterations to reduce the residual by ten orders, n10, as
well as the runtime per degree of freedom for the two solvers and operator implementation
variants. For a polynomial degree of p = 2, the interior element has no separate degree of
freedom. Hence, diagonal and block preconditioning result in the same operation and, in
turn, the same number of iterations. For low polynomial degrees, the difference in iterations
is small, but continuously grows to a factor of two at p = 16. However, for TPL, this lower
number of iterations does not lead to a lower runtime. At p = 2, the runtime of bfCG is twice
that of dfCG, with the gap slowly shrinking and equal runtimes being present at p = 11.
Afterwards the difference in runtime is minute. Introducing the optimized operators reduces
the runtime by a factor ranging between two and four, where four occurs for easily opti-
mizable and two for other polynomial degrees. Furthermore, block-preconditioning leads to
similar runtimes as diagonal preconditioning starting at p = 2, and retains a lower runtime
for p > 2. The optimization results in higher operator performance for operator sizes that
are a multiple of four, which directly translates to the runtimes of the solvers, for instance
solving for p = 7 leads to a runtime per degree of freedom lower than for p ∈ 4 . . . 6 by a
factor of two. This performance benefit makes the polynomial degrees 3, 7, 11, 15 preferable
in simulations: The throughput increases at the same number of degrees of freedom with no
incured penalty.
Table 3.1 lists the test results for operator sizes that are a multiple of four. For both solvers,
the runtime per iteration depends on the preconditioner choice. For TPL using the block
preconditioner increases the runtime two-fold. Furthermore, the runtime per iteration stag-
nates, with the same time being used at p = 11 and p = 15. As mentioned beforehand, this
constitutes evidence that the implementation is memory bound. For diagonal precondition-
ing, TPL-Hand lowers the runtime by a factor of two, with a larger speedup present at low
40 3 Performance optimization for tensor-product operators
polynomial degrees. The decline in speedup stems from the time per iteration increasing,
which is due to the constant performance of the operator. Lastly, the speedup when using
block-preconditioning ranges from 2.5 to 4 allowing to significantly lower the runtime of
simulations.
3.5 Conclusions
This chapter evaluated the performance of tensor-product spectral-element operators, using
the interpolation, Helmholtz and fast diagonalization operators as examples. For simple
loop-based variants, the findings from [17] hold: A matrix-matrix multiplication based on the
assembled three-dimensional element operator can outperform sum factorization for small
polynomial degrees, with the factorization becoming viable after p = 6. However, a roofline
analysis showed that the operator does not harness the full potential of the hardware and
can be improved significantly.
Optimizations were performed for the loop-based variants. These included using constant
bounds, unroll and jam, blocking, and vectorization and result in improved variants achieving
approximately 50 % of the peak performance. While the assembled matrix-matrix multi-
plication remains faster for a polynomial degree of p = 2, this case is even more efficiently
implemented with sparse matrix-vector multiplications using the global degrees of freedom
instead of the element-local ones. The optimization approach was thereafter applied to fast
diagonalization and Helmholtz operators, with 40 and 50 % of peak performance being
attained, respectively.
After applying the optimization techniques to the main components of a Helmholtz solver,
the performance gains resulting for the solver itself was investigated. The augmented opera-
tors reduced the overall computation time, by a factor of 2 for diagonal preconditioning and
by a factor of up to 4 for tensor-product preconditioning. As for incompressible fluid flow a
majority of the runtime is spent in Helmholtz solvers, using these optimization techniques
leads to a significant reduction in the turnover time of computations. Here only the case
of a Helmholtz solver was investigated, a study of the impact of the techniques onto a
fully-fledged PDE solver can be found in [64].
While the performance gains lead to a significant reduction in the computation time, they
come at a cost. Using one variant per polynomial degree leads to code replication, whereas
the optimizations reduce the readability and the combination of both the maintainability.
To regain readability and maintainability while retaining the speed, automated knowledge-
based systems are required, which generate the operators. Hence, further work will focus
on automating the generation of tensor-product operators using a domain-specific language.
First results are available in [127, 112].
41
Chapter 4
Fast Static Condensation – Achieving
a linear operation count
4.1 Introduction
The last chapter investigated the performance attainable with Helmholtz solvers based
on tensor-product operations. Significant performance gains over the baseline version were
attained. However, the number of operations increased with O(p · nDOF), rendering solution
expensive and even infeasible for large polynomial degrees. A solver scaling with O(nDOF)
could easily reduce the runtime by one order of magnitude. When taking the memory gap
into account, the number of loads and stores are required to reduce as well as otherwise the
algorithm would be memory bound.
For the spectral-element method, static condensation allows to eschew the interior degrees
of freedom and provides a standard method to decrease both, the number of unknowns as
well as the condition of the system matrix. Static condensation is widely employed, for
instance the first work on SEM capitalizes on it [105], as do more recent ones [21, 138].
In the references above, the static condensation significantly increased the performance.
However, the number of operations still scales with O(p4), as the main operator is not
matrix free. To remain efficient at high polynomial degrees, linear complexity is required
throughout the entire solver, from operator execution to preconditioner to the remaining
operations inside an iteration. Moreover, the resulting element matrix differs between the
elements, necessitatingO(p4ne) loads and stores per operator evaluation. With the increasing
gap between memory bandwidth and compute performance, only a matrix-free evaluation
technique secures the future performance for a method.
This chapter develops a linearly scaling, matrix-free variant of the static condensed Helm-
holtz operator, lowering the runtime of Helmholtz solvers tenfold and, furthermore, also
addresses the growing memory gap. The work is inspired by the three-dimensional version
42 4 Fast Static Condensation – Achieving a linear operation count
of the matrix-matrix-based operator implementation from [52], which was later on published
in [51]. The first factorization thereof was published in [57], with solvers being presented
in [60]. While these variants resulted in linear execution times of the iterations, they out-
performed unfactorized versions implemented via dense matrix-matrix multiplications only
for polynomial degrees p > 10. Current simulations, however, tend to use lower polynomial
degrees [8, 138, 87] so that a gain is often not achieved. Further factorizations allowed to
outpace matrix-matrix variants down to a polynomial degree of p = 2 and were presented
in [62].
The chapter is structured as follows: First, Section 4.2 introduces the main concept and
equations for of static condensation for the general and specifically the three-dimensional
case. Then, the resulting operator is factorized to a linearly scaling version that is capable
of outpacing matrix-matrix multiplications Section 4.3. Lastly, Section 4.4 proposes solvers
founding on these operators and investigates their viability in terms of resulting condition
number of the system matrix as well as runtime.
4.2 Static condensation
4.2.1 Principal idea of static condensation
Typically, the solution process for a linear equation involves every degree of freedom. How-
ever, for elliptic equations, the values in the interior solely depend upon the boundary values
of the domain and the right-hand side. Hence, one can choose suitable subdomains and
algebraically eliminate their interior degrees of freedom from the equation system. This
technique is called static condensation or Schur complement and leads to better condition,
fewer algebraic unknowns, and, hence, a faster solution process.
Let Hu = F denote the Helmholtz problem, with u as solution variable and F as discrete
right-hand side. The values are divided into interior degrees of freedom, uI, and boundary
degrees of freedom, uB, as shown in Figure 4.1. Similarly, the matrix H is decomposed into
interaction between boundary and inner part(HBB HBI
HIB HII
)(uB
uI
)=
(FB
FI
), (4.1)
with the symmetry of the operator allowing for
HIB = HTBI (4.2)
⇒
(HBB HBI
HTBI HII
)(uB
uI
)=
(FB
FI
). (4.3)
4.2 Static condensation 43
ξ2
ξ1u
ξ2
ξ1uI
ξ2
ξ1uB
Figure 4.1: Division of degrees of freedom into inner degrees of freedom (subscript I) and bound-ary degrees (subscript B) for a two-dimensional element. Left: Full tensor-productelement of degree p = 5. Middle: Only inner degrees of freedom. Right: Only bound-ary degrees of freedom.
As the inner element operator, HII, is invertible
uI = H−1II (FI −HIBuB) , (4.4)
leading to
HBBuB = FB −HBIuI
⇒ HBBuB = FB −HBIH−1II (FI −HIBuB) (4.5)
⇒(HBB −HBIH
−1II HIB
)⏞ ⏟⏟ ⏞=:H
uB = FB −HBIH−1II FI⏞ ⏟⏟ ⏞
=:F
. (4.6)
The above is a reduced equation system. It only incorporates the values on the domain
boundary, not the inner ones, and, hence, works on fewer degrees of freedom. However,
the resulting operator, H, is more complex. It consists of two parts: The primary part,
HPrim = HBB, is the restriction of the Helmholtz operator to the boundary nodes, whereas
the condensed part, HCond = HBIH−1II HIB, incorporates the interaction of the boundary
nodes with the inner element, and vice versa.
The method can be utilized in multiple ways. While it is traditionally used on an per-
element basis, e.g. [135], lowering the overall complexity of the algorithm, it can also serve
as a solver for the whole grid, as in [83], be employed for whole subdomains of the problem,
e.g. [117, 45], serve as basis for p-multigrid techniques [52] or as preconditioner for a DG
scheme [50].
4.2.2 Static condensation in three dimensions
In this section, the static condensation method is used on a three-dimensional tensor-product
element utilizing Gauß-Lobatto-Legendre points, leading to the residual evaluation
algorithm utilized in the three-dimensional version of the multigrid solver from [52]. As
investigating one element suffices, the element index e is dropped in favor of readability.
44 4 Fast Static Condensation – Achieving a linear operation count
Algorithm 4.1: Solution algorithm with static condensation.
1: for Ωe : do ▷ condense right-hand side2: Fe ← FB,e −HBI,eH
−1II,eFI,e
3: ue ← uB,e
4: end for5:
6: u← Solution(Hu = F) ▷ Solve condensed system7:
8: for Ωe : do ▷ regain inner nodes9: uB,e ← ue
10: uI,e ← H−1II,eFI,e −HIB,euB,e
11: end for
ξ3
ξ1
ξ2ξ3
ξ1
ξ2
n
w
bs
e
t
Figure 4.2: Explosion view of a three-dimensional spectral-element using a tensor-product basiswith Gauß-Lobatto-Legendre points. Left: Three-dimensional tensor-productelement using Gauß-Lobatto-Legendre points of polynomial degree p = 3.Right: Explosion view of the element with compass notation for the faces (for clarityof presentation only the element faces are shown).
Due to the tensor-product base using GLL points, the element boundary is directly separable.
It can be decomposed into three non-overlapping entities: Element vertices, edges, and faces.
There are eight vertices in every element, leading to eight data points. Similarly, there are
twelve edges with nI = p− 1 points on edges and, hence, 12nI data points being associated
with the edges. Lastly, there are n2I points per faces and thus 6n2
I facial degrees of freedom.
The static condensed Helmholtz operator consists of primary and condensed part. The
primary part is the restriction of the Helmholtz operator to the boundary nodes, i.e. faces,
edges, and vertices. To obtain the specific suboperator, one can multiply the respective
restriction operators onto the Helmholtz operator. For instance for the faces east and
west, when using the compass notation from Figure 4.2 the respective degrees of freedom uFe
and uFw , correspond to uijki=p,j∈I,k∈I and uijki=0,j∈I,k∈I, with the index set I = 1 . . . nI.Hence, the restriction operators for the faces are
RFe =(0 I 0
)⊗(0 I 0
)⊗(0 . . . 0 1
)(4.7)
4.2 Static condensation 45
RFw =(0 I 0
)⊗(0 I 0
)⊗(1 0 . . . 0
), (4.8)
where I is the identity matrix of appropriate size and 0 denotes row and column vectors
containing only zeroes. For example the restriction operator for p = 4, reads
RFw =
⎛⎜⎜⎝0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
⎞⎟⎟⎠⊗⎛⎜⎜⎝0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
⎞⎟⎟⎠⊗ (1 0 0 0 0)
. (4.9)
The restrictions to east and west face, respectively, lead to the operator
HFwFe = RFwHRTFe
(4.10)
which equates to
HFwFe = d0MII ⊗MII ⊗M0p + d1MII ⊗MII ⊗ L0p
+ d2MII ⊗ LII ⊗M0p + d3LII ⊗MII ⊗M0p ,(4.11)
where I is utilized as a short-hand for the inner part of the corresponding matrix or vector,
e.g.
MIp = (M1p . . .MnIp)T . (4.12)
The diagonal mass matrix simplifies the above to
HFwFe =d1MII ⊗MII ⊗ L0p . (4.13)
Each face to face, face to edge, edge to face, edge to edge, edge to vertex, vertex to edge and
vertex to vertex operator can be derived in a similar fashion. From these, the face-to-face
operators have the highest operational complexity and need to be investigated. Table 4.1
lists the tensor-product forms of these. As they at most utilized two-dimensional tensor-
products working on n2I data points per face, the primary part HPrim can be implemented
in O(n3I ) operations.
Where the primary part inherits the tensor-product structure of the full operator, the con-
densed part HCond is more convoluted. It incorporates not only the interaction between
boundary nodes and the inner element, but the inverse Helmholtz operator in the in-
ner element H−1II as well. The inner element is associated with the basis functions φijk
with i, j, k ∈ I, and the operator can be retrieved via restriction. The boundary, how-
ever, is more complicated. For dense matrices M and L, the tensor-product structure of
the Helmholtz operator (2.44) couples every data point with every other data point in
the element. But the approximated GLL mass matrix M is diagonal. Hence, the mass
46 4 Fast Static Condensation – Achieving a linear operation count
Algorithm 4.2: Evaluation of the condensed part using matrix-matrix productswith I = e,w, n, s, t, b.
1: for i ∈ I do2: vFi
←∑j∈I
HCondFiFjuFj
3: end for
term d0M⊗M⊗M only maps one point onto itself, whereas the other tensor products,
e.g.L⊗M⊗M, only work along mesh lines in the element, in this example in the ξ3 direc-
tion. The diagonal mass matrix leads to vertices only mapping to vertices and edges, edges
only mapping to vertices, edges, and faces, and faces only mapping to edges, faces, and the
inner element. Hence, the operators HBI and HIB consist of the interaction between the
interior of the element, and the element faces.
The inner element Helmholtz operator HII can be written in tensor-product form. How-
ever, the inverse of a tensor-product operator is not necessarily a tensor-product as well.
In general it takes O(n6I ) to apply the inverse and even the fast diagonalization technique
from Subsection 2.3.4 only reduces it to O(n4I ). Compared to a matrix-matrix implementa-
tion of the condensed part, no complexity gain is present when expressing the inverse with
fast diagonalization. Hence, implementations typically implement the face-to-face interac-
tion with matrix multiplication, e.g. done in [52, 51, 21]. The suboperators from face to face
are precomputed, stored, and reutilized in every application of the operator. The resulting
algorithm for the evaluation of the condensed part is shown in Algorithm 4.2. It utilizes 36
matrix multiplications to map from each face to each face, leading to 36n4I required multi-
plications. When storing the face values in a single array, one sole call to a well optimized
matrix-matrix multiplication, for instance DGEMM from BLAS, suffices as implementation.
4.3 Factorization of the statically condensed Helmholtz
operator
The operator evaluation technique from Algorithm 4.2 possesses two main drawbacks: First,
it scales super-linearly with the number of degrees of freedom, i.e. with O(n4p
). Hence, the
static condensation only leads to a better conditioned reduced equation system [21]. For
a system consisting of O(n2I ) values instead of O(n3
I ), less operations are to be expected.
Second, the matrices are precomputed and need to be stored. For non-homogeneous meshes
the element widths and, hence, the coefficients di,e vary, requiring one set of matrices per
element. At a polynomial degree of p = 15, these occupy 2.1GB of memory at double
precision – per element. While the operator complexity is an expensive inconvenience, the
4.3 Factorization of the statically condensed Helmholtz operator 47
Table 4.1: Suboperators of the primary part, face to face interaction, when assuming Gauß-Lobatto-Legendre points. Only non-zero entries are listed. Due to the symmetryof the GLL collocation nodes, the identities M00 = Mpp, L0p = L0p, and L00 = Lpp
were applicable.
i j HFiFj
w w d0MII ⊗MII ⊗MII + d1MII ⊗MII ⊗ L00 + d2MII ⊗ LII ⊗M00 + d3LII ⊗MII ⊗M00
e w d1MII ⊗MII ⊗ Lp0
e e HFwFw
w e HTFeFw
= HFeFw
s s d0MII ⊗MII ⊗MII + d1MII ⊗M00 ⊗ LII + d2MII ⊗ L00 ⊗MII + d3LII ⊗M00 ⊗MII
n s d2MII ⊗ Lp0 ⊗MII
n n HFsFs
s n HTFnFs
b b d0MII ⊗MII ⊗MII + d1M00 ⊗MII ⊗ LII + d2M00 ⊗MII ⊗MII + d3L00 ⊗MII ⊗MII
t b d3Lp0 ⊗MII ⊗MII
t t HFbFb
b t HTFtFb
memory requirement makes computations for inhomogeneous meshes next to impossible.
This section will first lift the latter restriction by introducing a matrix-free, tensor-product
formulation of the operator, then the first one by factorizing it to linear complexity, and
lastly demonstrate the attained linearity via runtime tests.
4.3.1 Tensor-product decomposition of the operator
The condensed part HCond consists of three suboperators: First, the boundary to inner
part HIB, second, the inner element Helmholtz operator to inner operator H−1II , third, the
inner to boundary interaction HBI. For the operator to be cast into a tensor-product form,
and hence allow for a better complexity, each sub-part requires a tensor-product notation.
The operator form the boundary to the inner element, HIB, is constructed by restricting
the Helmholtz operator to the boundary, and inner element, respectively. The inner
element is associated with the basis functions φijk with i, j, k ∈ I. Hence, the restriction
operator for it is
RI =(0 I 0
)⊗(0 I 0
)⊗(0 I 0
). (4.14)
As explained above, only the faces map to the interior of the element. The operators for all
six faces are similar. Without loss of generality, the east face is utilized for derivation of the
suboperator. The other five variants can be derived in the same fashion.
48 4 Fast Static Condensation – Achieving a linear operation count
Multiplying the restriction operators (4.14) and (4.7) onto the Helmholtz operator leads
to
HIFe = RIHRTFe
(4.15)
⇒HIFe =
⎛⎜⎜⎝0
I
0
⎞⎟⎟⎠T
⊗
⎛⎜⎜⎝0
I
0
⎞⎟⎟⎠T
⊗
⎛⎜⎜⎝0
I
0
⎞⎟⎟⎠T
⏞ ⏟⏟ ⏞RI
H
⎛⎜⎜⎝0
I
0
⎞⎟⎟⎠⊗⎛⎜⎜⎝0
I
0
⎞⎟⎟⎠⊗⎛⎜⎜⎜⎜⎜⎝0...
0
1
⎞⎟⎟⎟⎟⎟⎠⏞ ⏟⏟ ⏞
RTFe
. (4.16)
When using the tensor-product representation of the Helmholtz operator (2.44), the above
equates to
HIFe = d0MII ⊗MII ⊗MIp + d1MII ⊗MII ⊗ LIp
+ d2MII ⊗ LII ⊗MIp + d3LII ⊗MII ⊗MIp ,(4.17)
where I is utilized as a short-hand for the inner part of the corresponding matrix or vector,
e.g.
MIp = (M1p . . .MnIp)T . (4.18)
As the mass matrix associated with the GLL points is diagonal, MIp = 0 and the operator
further simplifies to
HIFe = d1MII ⊗MII ⊗ LIp . (4.19)
The above can be applied in n3I +O(n2
I ) multiplications to a vector uFe when using the
following evaluation order:
HIFeuFe = d3LIp ⊗MII ⊗MIIuFe = d3 (LIp ⊗ I⊗ I) (I⊗MII ⊗MII)uFe . (4.20)
Due to the diagonal mass matrix, the right tensor-product is a diagonal matrix being applied
to the face using n2I operations. The second operator, LIp ⊗ I⊗ I, expands from the face into
the element, i.e. from n2I to n3
I points and as it uses one multiplication per generated data
point n3I multiplications are required.
The Helmholtz operator is symmetric. Hence, the operator from the inner element back
to the face, HFeI, is the transpose of the above, i.e.
HIFe = d1MII ⊗MII ⊗ LpI , (4.21)
4.3 Factorization of the statically condensed Helmholtz operator 49
Table 4.2: Operators from the element faces to the inner element and vice versa with the faceindex i corresponding to the compass notation as shown in Figure 4.2.
i HFiI HIFi
w d1MII ⊗MII ⊗ L0I d1MII ⊗MII ⊗ LI0
e d1MII ⊗MII ⊗ LpI d1MII ⊗MII ⊗ LIp
s d2MII ⊗ L0I ⊗MII d2MII ⊗ LI0 ⊗MII
n d2MII ⊗ LpI ⊗MII d2MII ⊗ LIp ⊗MII
b d3L0I ⊗MII ⊗MII d3LI0 ⊗MII ⊗MII
t d3LpI ⊗MII ⊗MII d3LIp ⊗MII ⊗MII
and can be evaluated in O(n3I ) multiplications as well when reversing the evaluation order:
HIFeuI = d3,eLIp ⊗MII ⊗MIIuFe = (d3,e ⊗MII ⊗MII) (LpI ⊗ I⊗ I)uI . (4.22)
First, a dimension reduction on n3I values requires n
3I multiplications, then the diagonal tensor
product utilizes n2I multiplications. With the reverse evaluation order, i.e. first restrict to
the face, then use the diagonal it can be evaluated in n3p +O
(n2p
)multiplications
The other operators can be derived in a similar fashion and are listed in Table 4.2.
To gain a tensor-product formulation for the inverse of the inner element Helmholtz op-
erator, a tensor-product formulation of the operator itself is required. The inner element
operator is obtained by restricting to the inner element, i.e. applying the restriction opera-
tor RI from (4.14), from the left and right to the Helmholtz operator.
HII = RIHRTI (4.23)
Substituting the tensor-product formulation of the operator (2.44) into the above leads to
HII = d0MII ⊗MII ⊗MII + d1MII ⊗MII ⊗ LII
+ d2MII ⊗ LII ⊗MII + d3LII ⊗MII ⊗MII .(4.24)
Using the fast diagonalization from Subsection 2.3.4, the inner element Helmholtz can be
expressed using (2.52)
HII = (MIISII)⊗ (MIISII)⊗ (MIISII)DII
(STIIMII
)⊗(STIIMII
)⊗(STIIMII
), (4.25)
where SII is the transformation matrix for the inner element and DII the diagonal matrix
containing the eigenvalues of the inner element Helmholtz operator computed by (2.54):
DII = d0I⊗ I⊗ I+ d1I⊗ I⊗Λ + d2I⊗Λ⊗ I+ d3Λ⊗ I⊗ I . (4.26)
50 4 Fast Static Condensation – Achieving a linear operation count
As the homogeneous Dirichlet problem is solvable, the operator is invertible leading to
the inverse (2.55)
H−1II = (SII ⊗ SII ⊗ SII)D
−1II
(STII ⊗ ST
II ⊗ STII
). (4.27)
While (4.25) is a tensor-product operator, (4.27) is not. The inverse of the diagonal is not a
tensor-product anymore but can still be applied in n3I multiplications. However, the three-
dimensional tensor-product SII⊗SII⊗SII requires 3n4I multiplications, as does the transpose
of it. While no order reduction is present, the explicit formulation for the inverse allows
factorization of the operator.
4.3.2 Sum-factorization of the operator
With the suboperators mapping from the faces to the inner element from Table 4.2 and the
inverse of the inner element Helmholtz operator (4.27), explicit formulations with tensor-
products are present for all operators required for the condensed part HCond. However,
evaluating just one face-to-face operator utilized in Algorithm 4.2, HCondFiFj
, in the form
HCondFiFj
= HFiIH−1II HIFj
(4.28)
requires n3I multiplications for HIFj
, 3n4I for the transformation to the inner element eigen-
space, STII ⊗ ST
II ⊗ STII, n
3I for the application of the eigenvalues, further 3n4
I for the backward
transformation, and n3I for the inner to face operator. Hence, directly using the derived
tensor-product operators in Algorithm 4.2 leads to 6 · 6 · (6n4I + 3n3
I +O(n2I )) multiplica-
tions. Instead of 36n4I for the matrix-matrix implementation, 216n4
I multiplications are now
required, not only rendering the algorithm itself inferior, but additionally losing the benefit
of using optimized matrix-matrix multiplications from libraries. However, all of these op-
erators share one similarity: They first map to the inner element, then transform into the
inner element eigenspace, apply the diagonal, transform back, and map back to a face. As
all suboperators, except the application of the diagonal, are present in tensor-product form,
factorization of the operator is possible. For instance the operator from face east to face
west is
HCondFwFe
= d1MII ⊗MII ⊗ L0I⏞ ⏟⏟ ⏞HFwI
H−1II d1MII ⊗MII ⊗ LI0⏞ ⏟⏟ ⏞
HIFe
. (4.29)
Expanding the inverse of the inner element Helmholtz operator to (4.27) yields
HCondFwFe
= d1MII ⊗MII ⊗ L0I (SII ⊗ SII ⊗ SII)D−1II
(STII ⊗ ST
II ⊗ STII
)d1MII ⊗MII ⊗ LI0
⇒ HCondFwFe
= d1 (MIISII ⊗MIISII ⊗ L0ISII)⏞ ⏟⏟ ⏞HFwE
D−1II d1
(STIIMII ⊗ ST
IIMII ⊗ STIILI0
)⏞ ⏟⏟ ⏞HEFw
. (4.30)
4.3 Factorization of the statically condensed Helmholtz operator 51
Table 4.3: Operators from the element faces to the inner element eigenspace and vice versa withthe face index i corresponding to the compass notation as shown in Figure 4.2.
i HFiE HEFi
w d1 (MIISII)⊗ (MIISII)⊗ (L0ISII) d1(STIIMII
)⊗(STIIMII
)⊗(STIILI0
)e d1 (MIISII)⊗ (MIISII)⊗ (LpISII) d1
(STIIMII
)⊗(STIIMII
)⊗(STIILIp
)s d2 (MIISII)⊗ (L0ISII)⊗ (MIISII) d2
(STIIMII
)⊗(STIILI0
)⊗(STIIMII
)n d2 (MIISII)⊗ (LpISII)⊗ (MIISII) d2
(STIIMII
)⊗(STIILIp
)⊗(STIIMII
)b d3 (L0ISII)⊗ (MIISII)⊗ (MIISII) d3
(STIILI0
)⊗(STIIMII
)⊗(STIIMII
)t d3 (LpISII)⊗ (MIISII)⊗ (MIISII) d3
(STIILIp
)⊗(STIIMII
)⊗(STIIMII
)
The above directly maps from the face to the inner element eigenspace, denoted by the
subscript E, via HEFw , applies the inverse eigenvalues, and maps back with HFwE. As the
utilized matrices are now not diagonal anymore, the application of HEFw and HFwE both
use 3n3I multiplications, and the diagonal n3
I , leading to 7n3I multiplications per face to face
operator. This factorization lowers the number of multiplications for the condensed part
when using Algorithm 4.2 from 216n4I + 108n3
I +O(n2I ) down to 252n3
I +O(n2I ), i.e. in linear
complexity. The operators HFiE and HEFirequired for the evaluation are listed in Table 4.3.
While the new operator formulation achieves linear complexity, the leading factor for the
algorithm remains prohibitively high: The matrix-matrix variant still uses fewer multiplica-
tions until p = 8, and has the benefit of optimized libraries. Further factorization is required
to attain a competitive operator for polynomial degrees utilized in practice.
Hitherto factorization was only performed on single suboperators. The next step consists of
factorizing common factors in the condensed part. The result of applying it to a vector uF
is
vFi=∑j∈I
HCondFiFj
uFj(4.31)
⇒ vFi=∑j∈I
HFiED−1II HEFj
uFj. (4.32)
As the left part is not dependent upon j, a sum factorization yields
vFi= HFiE D
−1II
∑j∈I
HEFjuFj⏞ ⏟⏟ ⏞
vE⏞ ⏟⏟ ⏞uE
. (4.33)
The first term,∑
j∈I HEFjuFj
, corresponds to the residual in the inner element eigenspace
induced by the boundary nodes and is, hence, named vE. The application of the diag-
52 4 Fast Static Condensation – Achieving a linear operation count
Algorithm 4.3: Evaluation of the condensed part using a sum factorization of the tensor-productoperators with I = e,w, n, s, t, b.
1: vE ←∑j∈I
HEFjuFj
2: uE ← D−1II vE
3: for i ∈ I do4: vFi
← HFiEuE
5: end for
onal transforms it into the corresponding degrees of freedom in the inner element and is,
named uE. The third one maps these back onto the face. As the calculation of uE is the same
for every face, it can be stored and reused, culminating in Algorithm 4.3. The algorithm
maps from all six faces into the inner element eigenspace, using 3n3I multiplications per face
to compute the residual vE induced by the boundary nodes. Then the diagonal is applied to
compute the resulting solution in the inner element uE, requiring a further n3I , and stores uE.
The impact of the solution in the inner element eigenspace results from mapping back with
further 3n3I multiplications per face. In total 37n3
I multiplications occur and the algorithm,
hence, is of lower complexity than the matrix-matrix multiplication variant of Algorithm 4.2
starting from p = nI + 1 = 3.
4.3.3 Product-factorization of the operator
The operator evaluation method derived in Subsection 4.3.2 already attains two of the main
goals: It possesses a linear operator complexity, i.e. it scales with O(p3), and requires a
linearly scaling amount of memory. However, it only uses less multiplications than matrix-
matrix based variants starting from p = 3, and can not easily harness optimized libraries.
Thus, it is to be expected, that it will only be faster in practice when using very high
polynomial degrees. And, as shown in [60], that is the case. Further factorization is required
to become competitive for polynomial degrees relevant in practice.
Algorithm 4.3 utilizes each of the operators HFiE and HEFilisted in Table 4.3. All of
these suboperators share a similar structure: In two dimensions the matrix STIIMII, or the
transpose of it is applied, while in the direction orthogonal to the face either STIILIp or S
TIILI0
prolong into the eigenspace, or LpISII or L0ISII restrict to the face. For an evaluation of the
condensed part, 24n3I multiplications out of 37n3
I are spent applying STIIMII or the transpose
of it. To further improve the efficiency of the condensed part, these matrices need to be
removed from the operator.
One way to eliminate the extraneous matrix products from the operator is a coordinate
transformation. The transformation is chosen in a way such that STIIMII equates to identity,
4.3 Factorization of the statically condensed Helmholtz operator 53
eliminating these matrices from the tensor-product operators, while not transforming the
vertices. To this end, the matrix
S =
⎛⎜⎜⎝1 0 0
0 SII 0
0 0 1
⎞⎟⎟⎠ (4.34)
is applied in all three directions to the Helmholtz operator (2.44), leading to the trans-
formed Helmholtz operator H
H :=(ST ⊗ S
T ⊗ ST)H(S⊗ S⊗ S
)(4.35)
H = d0
(STMS⊗ S
TMS⊗ S
TMS
)+ d1
(STMS⊗ S
TMS⊗ S
TLS)
+ d2
(STMS⊗ S
TLS⊗ S
TMS
)+ d3
(STLS⊗ S
TMS⊗ S
TMS
).
(4.36)
The transformed mass and stiffness matrices are defined as
M := STMS (4.37)
L := STLS . (4.38)
Using (2.49), they compute to
M =
⎛⎜⎜⎝M00 0 0
0 STIIMIISII 0
0 0 Mpp
⎞⎟⎟⎠ =
⎛⎜⎜⎝M00 0 0
0 I 0
0 0 Mpp
⎞⎟⎟⎠ (4.39)
L =
⎛⎜⎜⎝L00 L0ISII L0p
STIILI0 ST
IILIISII STIILIp
Lp0 LpISII Lpp
⎞⎟⎟⎠ =
⎛⎜⎜⎝L00 L0ISII L0p
STIILI0 Λ ST
IILIp
Lp0 LpISII Lpp
⎞⎟⎟⎠ . (4.40)
Thus, the inner element mass and stiffness matrices in the new coordinate system are MII = I
and LII = Λ. They are diagonal and reduce the face to eigenspace operators from Table 4.3
to those in Table 4.4. Where previously every tensor-product consisted of one row or column
matrix in addition to two dense matrices and, now only the row and column matrices remain.
Each suboperator now requires n3I instead of 3n3
I multiplications, reducing the number in
the condensed part further down from 37n3I to 13n3
I .
While the operation reduction in the condensed part is significant, the primary part is
simplified as well. In the condensed system it includes face-to-face interaction as well and
54 4 Fast Static Condensation – Achieving a linear operation count
Table 4.4: Operators from the element faces to the inner element eigenspace and vice versa withthe face index i corresponding to the compass notation as shown in Figure 4.2.
i HFiE HEFi
w d1I⊗ I⊗ L0I d1I⊗ I⊗ LI0
e d1I⊗ I⊗ LpI d1I⊗ I⊗ LIp
s d2I⊗ L0I ⊗ I d2I⊗ LI0 ⊗ I
n d2I⊗ LpI ⊗ I d2I⊗ LIp ⊗ I
b d3L0I ⊗ I⊗ I d3LI0 ⊗ I⊗ I
t d3LpI ⊗ I⊗ I d3LIp ⊗ I⊗ I
the operators are shown in Table 4.1. For these, two types exist. The first maps from one
face onto itself, e.g. for face east
HFeFe = d0MII ⊗MII ⊗MII + d1MII ⊗MII ⊗ L00
+ d2MII ⊗ LII ⊗M00 + d3LII ⊗MII ⊗M00 .(4.41)
In the transformed system these read
HFeFe = d0I⊗ I⊗ I+ d1I⊗ I⊗ L00 + d2I⊗Λ⊗M00 + d3Λ⊗ I⊗M00 . (4.42)
While the application of (4.41) requires two tensor products containing the densely populated
stiffness matrix and, hence, uses 2n3I +O(n2
I ) multiplications, application of (4.42) only re-
quires O(n2I ) multiplications. In the second case, the faces map to opposing faces. These op-
erators are diagonal to begin with and the transformed system retains this, resulting in a com-
plexity of O(n2I ). Therefore, the transformed system not only reduces the number of multipli-
cations in the condensed part, it also reduces the primary part from 12n3I +O(n2
I ) to O(n2I )
multiplications. The primary part face to face operators are listed in Table 4.5.Lastly, the
transformation can be applied on each face and edge after condensing the right-hand side
in Algorithm 4.1 and before recomputing the solution. As the transformation is a tensor-
product operation on the faces, only O(p3) operations are added to the complexity of pre-
and post-processing, adding a runtime similar to the one of an operator evaluation to these.
4.3.4 Extensions to variable diffusivity
In Section 4.3 a new variant of the static condensed Helmholtz operator was proposed.
It is capable of applying the operator with a linear complexity. However, this is only the
4.3 Factorization of the statically condensed Helmholtz operator 55
Table 4.5: Suboperators of the primary part, face to face interaction in the transformed system.Only non-zero entries are listed.
i j HFiFj
w w d0I⊗ I⊗ I+ d1I⊗ I⊗ L00 + d2I⊗Λ⊗M00 + d3Λ⊗ I⊗M00
e w d1I⊗ I⊗ Lp0
e e HFwFw
w e HT
FeFw= HFeFw
s s d0I⊗ I⊗ I+ d1I⊗M00 ⊗Λ+ d2I⊗ L00 ⊗ I+ d3Λ⊗M00 ⊗ I
n s d2I⊗ Lp0 ⊗ I
n n HFsFs
s n HT
FnFs
b b d0I⊗ I⊗ I+ d1M00 ⊗ I⊗Λ+ d2M00 ⊗ I⊗ I+ d3L00 ⊗ I⊗ I
t b d3Lp0 ⊗ I⊗ I
t t HFbFb
b t HT
FtFb
case for a constant Helmholtz parameter λ, i.e. a constant diffusivity µ. For non-constant
diffusivities the Helmholtz equation can be written as
λu−∇ · (µ∇u) = f (4.43)
which corresponds, when disregarding the boundary terms, to the weak version being∫x∈Ω
(λuv)(x) dx+
∫x∈Ω
((∇v)T µ∇u
)(x) dx =
∫x∈Ω
(vf)(x) dx . (4.44)
With the viscosity being approximated with a polynomial of degree p, the GLL quadrature
of degree p is not sufficient to fully evaluate the term, as it only exactly integrates until
polynomial order 2p− 1. Still, in literature the GLL quadrature of degree p is employed,
hence, commiting a “variational crime”. With it, the element Helmholtz operator of the
above is
He = d0,eM⊗M⊗M
+ d1,e(MT/2 ⊗MT/2 ⊗DTMT/2
)diag(µe)
(M1/2 ⊗M1/2 ⊗M1/2D
)+ d2,e
(MT/2 ⊗DTMT/2 ⊗MT/2
)diag(µe)
(M1/2 ⊗M1/2D⊗M1/2
)+ d3,e
(DTMT/2 ⊗MT/2 ⊗MT/2
)diag(µe)
(M1/2D⊗M1/2 ⊗M1/2
),
(4.45)
56 4 Fast Static Condensation – Achieving a linear operation count
where diag(µe) is the diagonal matrix consisting of the diffusivity vector µe. Mind, that
for µe = 1, the diffusion operators yield the expected stiffness matrix, as due to the symmetric
mass matrix
DTMT/2M1/2D = DTMD = L . (4.46)
The factorization of the static condensed Helmholtz operator is not applicable to (4.45),
as the diagonal matrix diag(µe) disrupts the tensor-product structure of the operator. While
the whole operator can not be easily condensed, reduced version thereof can. Application of
the fast diagonalization requires a tensor-product structure for the operator. This structure
can be recovered, by approximating the diffusivity with a tensor product decomposition, i.e.
diag(µe) ≈ diag(µ3,e)⊗ diag(µ2,e)⊗ diag(µ1,e) . (4.47)
The above only generates a crude approximation, but suffices to capture the main features
inside an element. These can be treated implicitly, whereas high-frequency components can
be treated explicitly in a time-marching scheme.
After introducing the tensor-product structure, fast diagonalization and, in turn, the oper-
ator evaluation technique from Subsection 4.3.2 are applicable. The main difference stems
from one generalized eigenvalue decomposition being required per direction, such that, e.g. for
the first direction,
ST(DTMT/2diag(µ1,e)M
1/2D)S = Λ (4.48a)
STMS = I . (4.48b)
After exchanging the sole transformation matrix with the appropriate transformation ma-
trices per direction, the operators from Table 4.3 operator can be evaluated at the same
cost. This allows to apply the condensed part using Algorithm 4.3 with the same number of
multiplications. Hence, the above technique allows for static condensation with treatment
of varying diffusivities, albeit at a reduced resolution.
4.3.5 Runtime comparison of operators
Several variants for the application of the condensed Helmholtz operator were derived in
the previous sections. The first one maps from each face of the element to every other,
leading to the 36 face to face interactions in Algorithm 4.2. One large matrix multiplication
consitutes the most efficient way to implement this algorithm. The variant implementing it
is labeled MMC1 and requires 36n4I multiplications during the application and incorporates
the primary part for the face to face interaction as well.
4.3 Factorization of the statically condensed Helmholtz operator 57
Table 4.6: Number of multiplications of leading terms of application and precomputation stepsfor three different variants of the condensed Helmholtz operator.
Variant Precomputation Primary part Condensed part
TPF O(n3Ine) 3n4
Ine –
MMC1 O(n5Ine) 56n2
Ine 36n4Ine
MMC2 O(n5Ine) 12n3
Ine 12n5Ine
TPC O(n3Ine) 12n3
Ine 37n3Ine
TPT O(n3Ine) 68n2
Ine 13n3Ine
Algorithm 4.3 can be evaluated in multiple fashions. The first one is a tensor-product in the
condensed system, requiring 37n3I multiplications for the condensed part and 12n3
I for the
primary part. The associated variant is called TPC. The second variant employs the same
algorithm, but with a coordinate transformation that streamlines the number of multipli-
cations to 13n3I and is called TPT. The last variant, MMC2, uses a matrix multiplication
to map from the faces of the element into the inner element eigenspace instead of a tensor
product. The operator requires 12n5I +O(n3
I ) multiplications, leading to a higher complexity
than the matrix matrix variant MMC1. However, the memory requirement does not scale
with the number of elements anymore, enabling it to be used for non-homogeneous grids.
The last considered operator is called TPF and implements a tensor-product version of the
Helmholtz operator in the full, uncondensed system. It is enhanced with the techniques
described in Chapter 3 to constitute an efficient implementation of the full operator and,
hence, a measure whether using static condensation is beneficial runtime-wise.
Table 4.6 summarizes the different operator complexities as well as the required precompu-
tation costs. While the transformed variant exhibits the lowest multiplication count starting
from p = 1, the runtimes of the implementation can and will differ in practice. Multiplication
counts do not directly transfer to runtime, as, e.g., loading, storing, execution, and cache
effects have an influence as well. Hence, tests were conducted to compare the efficiency of
the five operator variants.
The operators were implemented using Fortran 2008, with the matrix multiplications being
delegated to DGEMM from BLAS. The Intel Fortran compiler v. 2018 compiled the program
using the Intel Math Kernel Library (MKL) as BLAS implementation.
On one CPU core of an Intel Xeon E5-2680 v3 running at 2.5GHz, the operators, as well
as the required precomputations, were repeated 100 times and the runtimes, being mea-
sured by MPI Wtime, averaged. The polynomial degree was varied between 2 ≤ p ≤ 32 for a
constant number of elements ne = 512 and a Helmholtz parameter λ = π, allowing for in-
sights on the runtime for technically relevant polynomial degrees, as well as the asymptotical
behaviour of the operator variants.
58 4 Fast Static Condensation – Achieving a linear operation count
TPF MMC1 MMC2 TPC TPT
2 4 8 16 32
polynomial degree p
10−5
10−4
10−3
10−2
10−1
100
101
102
Preprocessingtime[s]
5
1
1
3
2 4 8 16 32
polynomial degree p
10−5
10−4
10−3
10−2
10−1
100
101
102
Operatorruntime[s]
5
1
1
3
2 4 8 16 32
polynomial degree p
106
107
108
109
Eqv.D
OF/s[1/s]
2 4 8 16 32
polynomial degree p
0
10
20
30
40
Perform
ance
[GFLOP/s]
Figure 4.3: Runtimes for the different operator variants. Top left: precomputation times for theoperator, top right: runtimes of the operators, bottom left: runtime per equivalentdegree of freedom being computed as (p+ 1)3ne, bottom right: achieved performancein Giga Floating Point OPeration per Second (GFLOP/s) computed by using theleading terms from Table 4.6.
Figure 4.3 depicts the resulting runtimes and precomputation times for the different op-
erator variants. The precomputation times of the operators scale with the expected or-
der: For MMC2 O(p5) results, leading to the largest total precomputation time starting
from p = 4 and for MMC1 O(p5) as well, but with a lower coefficient, hence, a slightly
lower precomputation time. The tensor-product variants only require the generalized eigen-
value decomposition, scaling with O(p3), and setting the the elementwise inverse eigenvalues.
Thus, they have the lowest precomputation time of the condensed operators, though the pre-
computation time of TPF is two orders of magnitude lower as it does not require the gener-
alized eigenvalue transformation and the calculation of D−1II . One has to bear in mind that,
due to memory restrictions, the precomputation of MMC1 is done for only one element,
whereas every other condensed variant computed the matrix D−1II . For non-homogeneous
meshes the precomputation time of MMC1 would increase ne-fold.
4.3 Factorization of the statically condensed Helmholtz operator 59
The runtimes of the operators also exhibit the expected order: MMC2 scales with O(p5),TPF and MMC1 with O(p4), and the factorized variants with O(p3). At p = 2, the tensor-
product variant TPF has the lowest runtime, with TPT and MMC1 sharing a slightly larger
one and TPC being the slowest variant after MMC2. The matrix-matrix variant MMC1 stays
faster than TPC until p > 9, and the full variant TPF generates runtimes comparable to TPC
until p = 20. Overall, the linearly scaling operator in the condensed system grants a linearly
scaling runtime. But the operator is slower than the one in the full system until p = 17.
Computing in the transformed system, however, remedies the situation: With the operator
being faster than MMC1 starting from p = 2 and a constant factor of three compared
to evaluation in the condensed system, the operator is capable of being faster than the
optimized TPF for p > 7. The low performance of the new variants stems from two factors.
First and foremost, the primary part comprises a multitude of smaller operators, e.g.mapping
from vertices to edges. Most of these work on the whole dataset and are memory bound and
therefore have a low optimization potential. As their number of loads, stores, and operations
scales with O(n2I ) for the transformed case, the operator can attain a better performance.
And second, while the condensed part is a monolithic operator, it is memory-bound as well
due to loading the eigenvalues of the inner element operator. The combination of these two
operators leads to a low performance for low polynomial degrees while limiting the maximum
performance for high ones. Nevertheless, the transformed part outpaces the hand-optimized
version for the full system for every relevant polynomial degree but p = 7, by a factor of two
for p > 11 and by a factor of six for p = 31, polynomial degrees the latter is optimized for.
Compared to the condensed operators, TPT has the largest throughput starting at p = 2,
rendering it preferable for all polynomial degrees.
In Figure 4.3, the performance in GFLOP/s is shown as well. For low polynomial degrees,
only TPF is capable of extracting a significant amount of the maximum performance., This
stems from the many suboperators present in the primary part, which limit the performance.
After p = 9, however, the matrix-matrix based variants nearly reach peak performance, as
the condensed part dominates their runtime. The loop-based implementations for condensed
and transformed system do not even extract a third of that performance. However, for large
polynomial degrees they reach 10GFLOP/s whenever the operator width is divisable by the
SIMD width of 4. The achieved performance corresponds to half of expected optimum with
loop-based tensor-product implementations, as TPF shows. Hence, the current implementa-
tions still have optimization potential, however, due to the plethora of small operators, it is
not as exploited as with the monolithical tensor-product operators invesitigated in Chapter 3.
60 4 Fast Static Condensation – Achieving a linear operation count
4.4 Efficiency of pCG solvers employing fast static con-
densation
In the previous sections, linearly scaling operator were derived as a prerequisite for linearly
scaling solvers. However, a solver comprises more operations and good iteration schemes and
preconditioners are required for the fast solution of the equation set. This section utilizes
the conjugate gradient (CG) method [54, 118] to investigate the impact of the polynomial
degree on the condition of the system matrices and, hence, the solution process. As linearly
scaling preconditioners are required, these are derived first. Then, solvers are proposed and
their efficiency compared.
4.4.1 Element-local preconditioning strategies
In a preconditioned conjugate gradient solver, the preconditioner is called once in a every
iteration, just as the operator. Hence, it induces an overhead into a solver that can not
be neglected. If the preconditioner were to scale super-linearly with the number of degrees
of freedom, for instance with O(p4ne), even the best operator evaluation technique will not
fix the super-linear iteration time. With static condensation, the requirement limits the
preconditioners to be either diagonal in nature, or exhibit a tensor-product form.
In [21] three cases of preconditioning were investigated for the static condensation method
in two-dimensions. The first one did not apply preconditioning at all, the second utilized a
diagonal preconditioner, and the third method utilized the block-wise exact inverse on the
faces, edges, and vertices. It will be called block preconditioner in this work. The condition κ
of the unpreconditioned case scaled with κ = O(p2), the diagonally preconditioned version
with κ = O(p1), and for the block preconditioner κ = O(1) was reported. The same three
preconditioners will be utilized here, the identity, i.e. no preconditioning, a diagonal pre-
conditioning, and a block-wise preconditioning. The application of the former two trivially
scales with the number of degrees of freedom. However, the calculation of the diagonal pre-
conditioning can be expensive. And the last preconditioner incorporates the exact inverse of
the operator from a face onto itself. Hence, these need to be factorized to linear complexity
prior to their applicability.
The diagonal preconditioner for the condensed case consists of the inverse main diagonal of
the operator. It can be directly computed by restricting the primary and condensed parts
to a specific point on face, an edge, or to a vertex. As the degrees of freedom are present on
multiple elements, the contributions from adjoining elements are summed and then inverted.
Due to the diagonal mass matrix, the condensed part works exclusively on the faces. Hence,
the main diagonal is the same as in the uncondensed case for edges and vertices. For the
4.4 Efficiency of pCG solvers employing fast static condensation 61
faces, however, additional terms arise from the condensed part. For instance for face east to
face east, these are
HCondFeFe
= HFeIH−1II HIFe , (4.49)
which can be expressed as
HCondFeFe
= HFeED−1II HEFe (4.50)
and expands with Table 4.3 to
HCondFeFe
= d21 (MIISII ⊗MIISII ⊗ LpISII)D−1II
(STIIMII ⊗ ST
IIMII ⊗ STIILIp
)(4.51)
With the definitions
MII = STIIMII (4.52)
LIp = STIILIp , (4.53)
the above simplifies to
HCondFeFe
= d21
(M
T
II ⊗ MT
II ⊗ LpI
)D−1
II
(MII ⊗ MII ⊗ LIp
)(4.54)
⇒ HCondFeFe
= d21
(MII ⊗ MII
)T (I⊗ I⊗ LpI
)D−1
II
(I⊗ I⊗ LIp
)(MII ⊗ MII
). (4.55)
The outer two tensor products map the face onto itself and do not interfere with the com-
putation of the interior operator. The middle three terms are a restriction of the diagonal
matrix D−1II to two dimensions. As it is constant in an element, it needs to be computed once
per face, not once per point, leading to O(p3) multiplications and, hence, a linearly scaling
preconditioner initialization. The diagonal for the other faces can be derived in the same
fashion, leading to a preconditioner that can be computed in linear runtime and evaluated
in O(n2Ine) operations. The inverse of the operator itself can be computed similarly, only
requiring further computation of M−1. Moreover, in the transformed system M = I holds,
leading to a diagonal preconditioner.
4.4.2 Considered solvers and test conditions
Five solvers are considered for testing. The block-preconditioned solver in the full sys-
tem bfCG from Section 2.4 serves as baseline, and includes the faster operators from Chap-
ter 3. The second one, cCG, solves in the condensed system, with dcCG adding diagonal
and bcCG block preconditioning. Lastly, dtCG applies diagonal preconditioning in the trans-
formed system, lowering the amount of work for operator as well as for the preconditioner.
62 4 Fast Static Condensation – Achieving a linear operation count
For these five solvers, the tests from Section 2.4 are repeated, with the domain being dis-
cretized with ne = 8× 8× 8 elements of polynomial degrees p ∈ 2 . . . 32. From the used
number of iterations n10, the minimum possible condition number is calculated via
κ ≥ κ⋆ =
(2n10
ln(2ε
))2
, (4.56)
where ε is the achieved residual reduction, approximated by 10−10. To attain reproducible
runtime results, the solvers are called 11 times, with the runtimes of the last 10 solver calls
being averaged. This precludes measurement of instantiation effects, e.g. library loading,
that would not be present in a real-world simulation.
As done for the operators, the tests were conducted on one node of the HPC Taurus at ZIH
Dresden. It consisted of two sockets, each containing an Intel Xeon E5-2680 v3 with twelve
cores running at 2.5GHz. Only one of the cores was utilized during the tests, leading to the
algorithms, not the parallelization efficiency being measured. The solvers were implemented
in Fortran 2008, compiled with the Intel Fortran compiler and MPI Wtime was used for time
measurements.
4.4.3 Solver runtimes for homogeneous grids
Figure 4.4 depicts the number of iterations and the upper limit of the condition numbers
computed thereof. The solvers can be classified into three distinct categories: The first
one consists of the uncondensed block-preconditioned solver bfCG. It exhibits the highest
iteration count, with a slightly lower slope than 1. The unpreconditioned and diagonally
conditioned condensed solvers are the second category, generating less than half the number
of iterations at p = 32. Both start at the same number, with a lower slope for the diagonal
preconditioner, albeit not significantly, as even at the highest polynomial degree it is only
reduced by one fifth. The third category consists of the block-preconditioned condensed
solver bcCG and the diagonally preconditioned solver in the transformed system dtCG.
They use significantly less iterations than dcCG, less than half at p = 32, and have virtually
the same number of iterations. Moreover, differing from the other solvers, they do not
have a constant slope after p = 8. Rather, the slope slightly decreases when increasing the
polynomial degree.
The number of iterations directly translates into the condition number, also shown in Fig-
ure 4.4. With the condition, the categories from the number of iterations become more
distinct: The block-preconditioned uncondensed system exhibits a slope of O(p7/4
)and the
highest condition number. The unpreconditioned condensed system performing, as the diag-
onally preconditioned one, possess slopes between p3/2 and p after p = 8. Lastly, the block-
preconditioned version does not have a constant gradient, but a decreasing one matching
4.4 Efficiency of pCG solvers employing fast static condensation 63
bfCG cCG dcCG bcCG dtCG
2 4 8 16 32
Polynomial degree p
102
103
Number
ofiterationsn10
1
1
4
1
2 4 8 16 32
Polynomial degree p
101
102
103
104
Conditionκ∗
7
4
32
2 4 8 16 32
Polynomial degree p
10−2
10−1
100
101
102
103
Runtime[s]
5
1
1
3
2 4 8 16 32
Polynomial degree p
100
101
102
Runtimeper
DOF[µs]
3
4
Figure 4.4: Number of iterations, approximated condition number κ⋆, and runtimes for thefive solvers when varying the polynomial degree p at a constant number of ele-ments ne = 8× 8× 8. Top left: number of iterations, top right: resulting conditionnumber, bottom left: runtimes, bottom right: runtimes per degree of freedom (DOF).
the poly-logarithmic bound [61]. While the resulting condition numbers do not match those
reported in [21], one has to keep in mind, that condition numbers change when changing the
number of dimensions, e.g. [118].
The number of iteration does not directly translate to the runtime. For the uncondensed
solver, the number of iterations scales linearly with the polynomial degree and the operator
complexity with O(p4), leading to the runtime scaling with O(p5). Different to the full
system, the condensed and transformed solvers seem to posses a linear runtime, i.e.O(p3).With the unpreconditioned condensed version having the highest, followed by the diagonally
preconditioned version, and then the block-preconditioned one. However, their runtime does
not differ substantially. Using the block preconditioning only gains a factor of 1.5 in the
64 4 Fast Static Condensation – Achieving a linear operation count
Table 4.7: Approximated runtime per iteration and runtime per degree of freedom (DOF) of thecondensed solvers for certain polynomial degrees p.
Runtime per iteration and DOF [ns] Runtime per DOF [µs]
p cCG dcCG bcCG dtCG cCG dcCG bcCG dtCG
5 32.60 33.27 41.44 21.04 3.49 3.43 3.11 1.58
9 18.58 19.13 23.75 11.70 2.77 2.58 2.16 1.06
13 17.45 18.00 22.73 10.72 3.21 2.88 2.27 1.09
17 16.10 16.43 20.79 10.48 3.44 3.04 2.27 1.15
21 14.94 15.21 19.01 9.27 3.66 3.16 2.17 1.08
25 14.35 14.62 18.62 8.69 3.89 3.36 2.23 1.05
29 13.59 14.03 18.47 7.86 3.98 3.40 2.29 0.99
runtime compared to diagonal preconditioning. This is due to the high preconditioning cost
compared to the operator evaluation, the preconditioner takes nearly as many operations as
the operator itself. Even a factor of two in the number of iterations does not set off this
runtime disadvantage. The transformed solver, on the other hand, is capable of employing
a diagonal preconditioner in addition to a faster operator, resulting in a factor of three
compared to the runtime of the block-preconditioned version.
The runtime per degree of freedom stagnates for the last three solvers in the range of 1 µsto 5 µs. Diagonal and unpreconditioned version exhibit a minute increase p = 8 to p = 32
and for the block-preconditioned and transformed version none is present. Slight kinks
occur when p+ 1 is a multiple of four, i.e. when nI is a multiple of four, matching the
vector instruction width of the architecture [48]. The stagnation of the runtime can be
explained by the increase in the number of iterations being small, and CG type solvers
incorporating a large number of array instructions. As the data set scales with O(p2) for
the static condensation method, their relative runtime decreases, offsetting the increase in
iterations. Furthermore, the runtime of the operator does not directly scale with p3 but
exhibits a slightly lower slope, as shown in Subsection 4.3.5.
Table 4.7 lists the number of iterations in combination with the runtimes per degree of
freedom for the condensed solvers. As expected, the unpreconditioned solver has the lowest
runtime per iteration of the three condensed system solvers. The addition of a diagonal
preconditioner increases the runtime by approximately 2 %, making the increased effort
worthwhile: The number of iterations decreases significantly and, as a result, the runtime by
a quarter. The block-preconditioner takes up to a third of the runtime of the solver. For it
the large increase in runtime per iteration counters the decrease in iterations, diminishing the
potential runtime savings to only a third. Due to the cheaper operator, the transformed solver
possesses the lowest runtime per iteration and, as it shares the number of iterations with
the block-preconditioned version, also the lowest runtime. One iteration of dtCG only costs
4.4 Efficiency of pCG solvers employing fast static condensation 65
bfCG cCG dcCG bcCG dtCG
23 43 83 163
Number of elements ne
101
102
103
Number
ofiterationsn10
3
1
1
3
23 43 83 163
Number of elements ne
10−2
10−1
100
101
102
103
Runtime[s]
3
4
4
3
Figure 4.5: Number of iterations and corresponding runtimes for the five solvers when varyingthe number of elements ne at a constant polynomial degree p = 16. Top left: numberof iterations, top right: resulting condition number, bottom left: runtimes, bottomright: runtimes per degree of freedom (DOF).
a third of one iteration for bcCG, rendering it the fastest solvers. While these savings seem
small compared to the savings from the operator, the remainder of the solver has to be taken
into account: Solvers based on the conjugate gradient method heavily depend upon array
operations and scalar products. With ever more efficient operators, these occupy a significant
amount of the runtime. For instance, for the transformed system solver, the largest portion
of the runtime is spent in just these operations, not the operator itself. This constitutes a
hard limit on the runtime per iteration and limits the potential for further factorizations.
Lastly, when comparing to the data from Table 3.1, solving with dtCG at p = 29 requires a
third of the runtime per degree of freedom and iteration than solving with bfCG at p = 7,
where bfCG was most efficient. Therefore the main goal was accomplished: The performance
of the full system was extended to high polynomial orders. Moreover, the current solvers are
a factor of three faster per iteration and degree of freedom.
To investigate the robustness of the solvers against the number of elements, the tests were
repeated using a constant polynomial degree p = 16 and increasing the number of elements
from ne = 23 to ne = 163. Figure 4.5 depicts the required number of iterations and the
resulting runtimes per degree of freedom. In three dimensions, the runtime of CG-based
finite element solvers without global coupling scales with n4/3e [118]. Here, however, the
number of iterations exhibits a slightly lower slope than 1/3. The effect probably stems from
not computing in the asymptotic regime. The main conclusion, however, is that the solvers
are not robust against increases in the number of elements. This is to be expected, as global
preconditioning is required to achieve this feat, e.g. with low-order finite elements [90, 50]
or even multigrid [130].
66 4 Fast Static Condensation – Achieving a linear operation count
x1
x2
x1
x2
x1
x2
Figure 4.6: Cut through the x3 = 0 plane for the three meshes with constant expansion factor α.Left: α = 1, middle: α = 1.5, right: α = 2.
4.4.4 Solver runtimes for inhomogeneous grids
In real-life simulations, homogeneous grids are seldomly applicable. To capture all relevant
features of a flow, the grid needs to adapt to the solution, with the most common case being
refinement near the wall.
To investigate the behavior of the solvers for inhomogeneous meshes, the testcase from Sec-
tion 2.4 was adapted. Instead of a homogeneous grid, grids generated using a constant
expansion factor α are employed. Three cases are considered: α = 1, α = 1.5, and α = 2,
resulting in the meshes depicted in Figure 4.6. While an expansion factor of α = 1 yields a
homogeneous grid, α = 1.5 stretches every cell in the grid and results in a maximum aspect
ratio of ARmax = 17. Applying α = 2 further magnifies this effect and leads to ARmax = 128.
The grids are populated by a large variety of element, with their shapes ranging from cubes,
over pancakes to needles. With non-matrix-free solution techniques, these meshes are not
treatable at high polynomial degrees due to stifling memory requirements. The solvers
presented in this section, however, are matrix-free, leading to comparably small memory
requirements and allowing to easily compute with high polynomial degrees such as 32.
Figure 4.7 compares the runtimes and number of iterations for the three grids. Independent
of the expansion factor, the runtime of bfCG scales with O(p5), and of the condensed solvers
with O(p4). However, the number of iterations increases with the expansion factor. When
raising it from α = 1 to α = 1.5, the number of iterations of cCG increases by a factor of
four, whereas the preconditioned solvers only incur an increase of one quarter. The situation
gets more pronounced when increasing the maximum aspect ratio to 128 at α = 2. There,
using cCG is not feasible anymore, while it already requires 650 iterations at p = 2, 6000
are needed at p = 30. With diagonal preconditioning, these numbers lower to 85 and 360,
respectively, only showing a 50% increase total and the block-preconditioned variants exhibit
similarly high robustness. While not attaining the full robustness permitted by iterative
substructuring methods [107], the solvers are near impervious to the aspect ratio of the
mesh.
4.4 Efficiency of pCG solvers employing fast static condensation 67
bfCG cCG dcCG bcCG dtCG
2 4 8 16 32
Polynomial degree p
102
103
Number
ofiterationsn10
1
1
4
1
α = 1.0
2 4 8 16 32
Polynomial degree p
10−2
10−1
100
101
102
103
Runtime[s]
5
1
1
3
α = 1.0
2 4 8 16 32
Polynomial degree p
102
103
Number
ofiterationsn10
1
1
4
1
α = 1.5
2 4 8 16 32
Polynomial degree p
10−2
10−1
100
101
102
103
Runtime[s]
5
1
1
3
α = 1.5
2 4 8 16 32
Polynomial degree p
102
103
Number
ofiterationsn10
1
1
4
1
α = 2.0
2 4 8 16 32
Polynomial degree p
10−2
10−1
100
101
102
103
Runtime[s] 5
1
1
3
α = 2.0
Figure 4.7: Number of iterations and runtime per degree of freedom for the four solvers whenusing stretched meshes with constant expansion factor α. Top: α = 1, middle: α = 1.5,bottom: α = 2.
68 4 Fast Static Condensation – Achieving a linear operation count
4.5 Summary
This chapter investigated the static condensation method as ways to lower both, operation
count as well as memory bandwidth for elliptic solvers for the spectral-element method.
To lower the operation count, a tensor-product formulation of the static condensed Helm-
holtz operator was derived and factorized to linear complexity, with a further coordinate
transformation streamlining the multiplication count. Not only are the resulting operators
matrix-free and allow to circumvent the growing memory bandwidth gap, their runtime
scales linearly with the number of degrees of freedom in the grid as well. The new evaluation
technique outpaces variants employing highly optimized libraries for matrix-matrix multipli-
cations as well as the optimized tensor-product variants for the full system from Chapter 3
by a factor of 20 and 5, respectively. Moreover, the linear scaling unlocks computations
on high polynomial degrees which were previously unfeasible due to the staggering operator
costs for linear solvers, removing the barrier between spectral and spectral-element methods.
After comparing the efficiency of the operators, solvers based on the two fastest evalua-
tion techniques were investigated, with the solvers from Chapter 3 serving as baseline. As
in Chapter 3, block-Jacobi type preconditioners significantly lower the iteration count com-
pared to pure diagonal preconditioning. Moreover, they lead to an astounding robustness
with the condition number being near independent of the aspect ratio of the elements. For
instance raising the maximum aspect ratio from ARmax = 1 to ARmax = 128 only leads to
an increase by 50 % in the iteration count. While the block-Jacobi preconditioning is as
expensive as the condensed Helmholtz operator itself, a coordinate transformation stream-
lines both and results in a diagonalization of the preconditioner. The combination of linear
scaling operators and low increase in the condition number leads to a linearly scaling runtime
for the solver with respect to the number of degrees of freedom. This allows to outperform
the solvers presented in Chapter 3 by a factor of 50 at p = 32, with a runtime per degree of
freedom just over 1 µs. Lastly, the effect is not only due to a decreased number of iterations,
the runtime of one iteration is a third of the one for the full case, when comparing a con-
densed p = 29 with p = 7 in the full system. This extends the performance of the high-order
methods towards very high polynomial degrees, and therefore convergence rates.
While the resulting solvers scale close to optimally with respect to the polynomial degree, as
investigated in [61], the performance degrades with the number of elements. Global informa-
tion exchange is required to gain robustness with respect to the number of elements, which
the block-Jacobi preconditioners considered here do not provide. This will be addressed in
the next chapter.
69
Chapter 5
Scaling to the stars – Linearly scaling
spectral-element multigrid
5.1 Introduction
In the previous chapter, a linearly scaling operator for the Helmholtz equation was derived.
It allows for linearly scaling iteration times and in conjunction with a pCG method attains
solution times per unknown near 1 µs. But while the solver incurs only minute increases in
the number of iterations while raising the polynomial degree, it is not robust with regard
to the number of elements. Long-range coupling is required to attain a constant number of
iterations and to allow the solvers to stay competitive.
For low-order schemes, h-multigrid has been established as an efficient solution technique,
increasing and decreasing the element width h between the levels [13, 12, 47]. With high-
order methods, p-multigrid allows to lower and raise the polynomial degree instead of the
element width h. Both kinds of multigrid require a smoother to eliminate high-frequency
components which can not be represented on the coarser grids and overlapping Schwarz
smoothers have proven to be a very effective choice [88, 52, 122], lowering the iteration count
below 10.
With an overlapping Schwarz method, the domain is decomposed into small overlapping
subdomains, typically blocks of multiple elements. On these, the inverse operator is ap-
plied and the results combined to a new solution. However, while the approach generates
exceptional convergence rates, application of the smoother is expensive. In the full equation
system, the inverse is often facilitated using fast diagonalization [88], whereas no matrix-
free inverse is known for the condensed case and using a precomputed inverse leads to a
linearly scaling in two dimensions, but not in three [52, 51]. In the full as in the condensed
case, the operator scales super-linearly when increasing the polynomial degree. The increas-
ing smoother costs inherent to current Schwarz methods limit the efficiency of high-order
70 5 Scaling to the stars – Linearly scaling spectral-element multigrid
methods at high polynomial orders and therefore convergence rates. Again, linearization of
the operation count would allow for significant performance gains.
The goal of this chapter lies in attaining a linearly scaling multigrid solver, combining the
convergence properties from [123, 51] with the operator derived in the last chapter. To-
wards this end, Section 5.2 recapitulates the main ideas of Schwarz methods, extracts
the main kernel and then factorizes it to linear complexity. Then, Section 5.3 discusses
multigrid solvers founding on the static condensation operators proposed in Chapter 4 and
the Schwarz smoother. Lastly, the runtime tests conducted in Section 5.4 prove that the
convergence properties from [123, 51] are retained while the runtime scales linearly with the
number of degrees of freedom.
The results presented in this chapter are available in [63] and are inspired by the work
presented in [52, 51].
5.2 Linearly scaling additive Schwarz methods
5.2.1 Additive Schwarz methods
Overlapping Schwarz decompositions are a standard solution technique in continuous as
well as discontinuous Galerkin spectral-elements methods [31, 88, 27, 117]. Instead of
solving the whole equation system, a Schwarzmethod determines a correction of the current
approximation by combining local results obtained from overlapping subdomains. Repeating
the process leads to convergence.
If the current approximation u is not the exact solution, it leaves a residual r:
r = F− Hu . (5.1)
The goal lies in lowering the residual below a certain tolerance. Towards this end, cor-
rections ∆u to u are sought with H∆u ≈ r. Gaining the exact solution in one step re-
quires Hu = r and, hence, a full solve, which is global and expensive. On small subdomains,
however, solution is relatively cheap. Such local corrections ∆ui are computed on multiple
overlapping subdomains Ωi and afterwards combined to a global solution.
To attain the operator on subdomain Ωi, the Helmholtz operator H is restricted to the
subdomain using the Boolean restriction Ri
Hi = RiHRT
i (5.2)
5.2 Linearly scaling additive Schwarz methods 71
Figure 5.1: Block utilized for the star smoother. Left: Subdomain consisting of 23 elements, mid-dle: full system including Dirichlet nodes, right: Collocation nodes correspondingto the star smoother setup.
and a solution to the residual computed:
Hi∆ui = Rir . (5.3)
These corrections are local to the subdomain and need to be combined into a global cor-
rection. With an additive Schwarz method, all corrections are simultaneously computed,
weighted and then added, resulting in a global correction ∆u satisfying
∆u =∑i
RT
i Wi H−1
i Rir⏞ ⏟⏟ ⏞∆ui
. (5.4)
In the above, Wi is the weight matrix on the respective subdomain. Multiple options exist
for the weighting: Traditional additive Schwarz methods employ the identity matrix as
weight and a relaxation factor is required to ensure convergence [37]. For spectral-element
methods using the inverse multiplicity instead, i.e. weighting each grid point with the num-
ber of subdomains it occurs in, removes this restriction [88], and further refinement with
distance-based weight functions yields convergence rates where only two or three iterations
are required with multigrid [123].
For Schwarz methods, the choice of subdomain dictates both time to solution on the sub-
domain as well as the convergence rate of the whole algorithm. While smaller subdomains
are preferrable for the first, larger subdomains and, correspondingly, large overlaps are of
paramount importance for the latter [123]. Typically, element-centered subdomains are uti-
lized, which overlap into every neighboring element. However, with static condensation
the number of remaining degrees of freedom is large. In [52, 51] a vertex-based Schwarz
smoother was constructed with a 2d element block as subdomain in Rd. Using static con-
densation on this block leaves only the three planes interconnecting the elements remaining.
Compared to an element-centered subdomain with half an element overlap, less degrees of
freedom are present, rendering the method preferable and it is, hence, used here. As it
resembles a star in the two-dimensional case, it is called star smoother.
72 5 Scaling to the stars – Linearly scaling spectral-element multigrid
In Figure 5.1 the 23 element block is depicted. With a residual-based formulation, theDirich-
let problem is homogeneous, allowing to eschew the boundary nodes, thus resulting in nS =
2p − 1 points per dimension. For the condensed system only the three planes connecting
the elements remain, each with with n2S degrees of freedom. The resulting star operator,
Hi, can be expressed with tensor-products when using the technique from Chapter 4, lead-
ing to the operation count scaling with O(n3S). However, the operator consists of primary
and condensed part and, furthermore, is assembled over multiple elements. This intricate
operator structure renders solution hard and a matrix-free, linearly scaling inverse is yet un-
known. While using explicit matrix inversion scales linearly with the number of unknowns
in two dimensions [52], the three-dimensional version scales with O(n4S) [51]. This section
will develop a linearly scaling inverse for the three-dimensional case.
5.2.2 Embedding the condensed system into the full system
Execution of the Schwarz method in the condensed system requires solution on subdo-
mains Ωi as key component. However, even with tensor-product operators, finding a matrix-
free inverse has so far eluded the community, leaving only matrix inversion on the table.
The application scales with O(n4S) = O(p4) and is, by definition, not matrix free. This not
only increases the operation count super-linearly when going to high polynomial degrees, but
also results in overwhelming memory requirements. While the inverse for one star at p = 16
requires 63 MB, over 1 GB is utilized at p = 32, rendering the method unfeasible for inho-
mogeneous meshes. Both drawbacks, runtime and memory requirement, constitute a hard
limit on the polynomial degree used for the method and need to be circumvented. This
section lifts both restrictions by first deriving a matrix-free inverse and then linearizing the
operation count. The linear operation count directly extends to a linearly scaling smoother
which, in turn, allows for a linearly scaling multigrid cycle. To this end, the condensed
system is embedded into the full equation system. Then, a solution technique from the full
case is exploited to attain a matrix-free inverse, which is afterwards factorized.
To investigate embedding the star subdomains into their respective full 2 × 2× 2 systems,
considering one block suffices, allowing for the subscript i to be dropped in favor of read-
ability. The 2d element block is reordered into degrees of freedom remaining on the star,
uS, and element interior ones, uI, leading to an operator structure similar to the condensed
case (4.6) (HSS HSI
HIS HII
)(uS
uI
)=
(FS
FI
)(5.5)
⇒(HSS −HSIH
−1II HIS
)⏞ ⏟⏟ ⏞H
uS = FS −HSIH−1II FI . (5.6)
5.2 Linearly scaling additive Schwarz methods 73
For the condensed system to be embedded in the full system, the right-hand side of the
condensed system needs to generate the same solution when used in the full system. Here,
the modified right-hand side
F =
(FS
FI
)=
(FS −HSIH
−1II FI
0
)(5.7)
is considered. It leads to a solution u with(HSS HSI
HIS HII
)(uS
uI
)=
(FS
0
)(5.8)
⇒(HSS −HSIH
−1II HIS
)⏞ ⏟⏟ ⏞H
uS = FS = FS −HSIH−1II FI . (5.9)
Due to H being positive definite, the system possesses a unique solution and uS = uS. How-
ever, F and F differ and, hence, the overall solution differs as well, generating a differ-
ence in the interior points, i.e. uI = uI. As the solution on the interior is not required for
the Schwarz method in the condensed system, the inverse full operator on F generates the
desired solution. This allows to apply solution methods from the full system to infer solution
methods into the condensed system.
5.2.3 Tailoring fast diagonalization for static condensation
To solve the condensed system, a method for solving in the full system can be used. Hence,
investigating the operator on the full star block is required. The collocation nodes on the
full star Ωi exhibit a tensor-product structure, allowing for the operator Hi to be written in
a similar fashion as (2.44):
Hi = d0Mi ⊗Mi ⊗Mi
+ d1Mi ⊗Mi ⊗ Li
+ d2Mi ⊗ Li ⊗Mi
+ d3Li ⊗Mi ⊗Mi .
(5.10)
Here, the matrices Mi and Li are the one-dimensional mass and stiffness matrices restricted
to the full star Ωi, which correspond to the assembly of the respective one-dimensional
matrices from the elements, as shown in Figure 5.1. In practice these will differ per direction
to allow for varying element widths inside one block. However, to improve readability, the
same stiffness and mass matrices are utilized in all three dimensions.
74 5 Scaling to the stars – Linearly scaling spectral-element multigrid
As Mi and Li are symmetric positive definite, the fast diagonalization technique from Sub-
section 2.3.4 is applicable. The inverse can be expressed with tensor products
H−1i = (Si ⊗ Si ⊗ Si)D
−1i
(STi ⊗ ST
i ⊗ STi
), (5.11)
where Si is a non-orthogonal transformation matrix and Di is the diagonal matrix com-
prising the eigenvalues of the block operator. The tensor-product application of the above
requires 12 · n4S operations, which still scales super-linearly with the number of degrees of
freedom when increasing p. However, using the reduced right-hand side from (5.7) is sufficient
to attain solution on the condensed star. The application of (5.11) consists of three steps:
Mapping into the three-dimensional eigenspace, applying the inverse eigenvalues, and map-
ping back. The right-hand side Fi is zero in element-interior regions, allowing factorization
of the operator and, furthermore, the values in the interior are not sought, further increasing
the potential. The forward operation, computing FE =(STi ⊗ ST
i ⊗ STi
)Fi, now only works
on three planes rather than the whole three-dimensional data. Hence, only two-dimensional
tensor products remain when expanding to the star eigenspace last, which require O(n3S) op-
erations. This leads to the forward operation scaling linearly. The application of the inverse
eigenvalues consists of one multiplication with a diagonal matrix and scales linearly as well.
Lastly, mapping back is only required for the faces of the star, not for interior degrees. When
not computing these interior degrees, the mapping from block eigenspace to the faces is the
transpose of the forward operation and uses O(n3S) operations as well. The combination of
all three operations yields an inverse that can be applied with linear complexity.
The algorithm to apply the linearly-scaling inverse is summarized in Algorithm 5.1 where for
clarity of presentation the star index i was dropped. The first step consists of extracting the
right-hand for the subdomains and store it on the three faces, which are perpendicular to
the x1, x2, and x3 directions and named F1, F2, and F3, respectively. To retain the tensor-
product structure, the three faces are stored seperately as matrix of extent nS × nS, with an
index range of I = −p . . . p where index 0 corresponds to the 1D index of the face in the
full system. However, this leads to the edges and the center vertex being stored multiple
times. Hence, as first action, the inverse multiplicity M−1 is applied to account for these
multiply stored values. Then, the two-dimensional transformation matrix ST ⊗ ST is used
on the three faces. Using permutations of I ⊗ I ⊗ STI0, the results are mapped into the star
eigenspace, where the inverse eigenvalues are applied. Lastly, the reverse order of operations
maps back from the eigenspace back to the star faces. In Algorithm 5.1, 18 one-dimensional
matrix products are utilized, 9 to map into the eigenspace, 9 to map back from it, and one
further multiplication in the eigenspace. In total, 37n3S operations are utilized.
The additive Schwarz-type smoother resulting from using the above algorithm is shown
in Algorithm 5.2. First, the data is gathered for every subdomain. On these the inverse is
applied via Algorithm 5.1, leading to local corrections ∆ui, which are weighted and prolonged
5.2 Linearly scaling additive Schwarz methods 75
Algorithm 5.1: Inverse of the star operator using the right-hand side u on the three faces of thestar, F1, F2, and F3. For clarity of presentation, the star index i was dropped.
FF ←M−1FF ▷ Account for multiply occurring pointsFE ←
(ST ⊗ ST ⊗ ST
I0
)FF1 ▷ contribution from face perpendicular to x1
+(ST ⊗ ST
I0 ⊗ ST)FF2 ▷ contribution from face perpendicular to x2
+(STI0 ⊗ ST ⊗ ST
)FF3 ▷ contribution from face perpendicular to x3
uE ← D−1FE ▷ application of inverse in the eigenspaceuF1 ← (S⊗ S⊗ S0I) uE ▷ map solution from eigenspace to x1 faceuF2 ← (S⊗ S0I ⊗ S) uE ▷ map solution from eigenspace to x2 faceuF3 ← (S0I ⊗ S⊗ S) uE ▷ map solution from eigenspace to x3 face
Algorithm 5.2: Schwarz-type smoother using stars subdomains corresponding to the nV ver-tices.
function Smoother(r)for i = 1, nV do
Fi ← M−1
i Rir ▷ extraction of data
∆ui ← H−1
i Fi ▷ inverse on starsend for∆u←
∑nV
i=1 RT
i Wi∆ui ▷ contributions from 8 vertices per elementreturn ∆u
end function
to the global domain. The weight matrix Wi is inferred by restricting the tensor-product
of one-dimensional diagonal weight matrices W = W1D ⊗W1D ⊗W1D to the condensed
system. As in [123], diagonal matrices populated by smooth polynomials of degree pW
are used. These are constructed to be one on the vertex of the star and zero on all other
vertices. For polynomial degrees larger than one, higher derivatives are set to zero at the
vertices smoothing the transition, as shown in Figure 5.2. In preliminary studies pW = 7
produced the best results, hence, this value is utilized here as well.
In Algorithm 5.2 the inverse is applied in the condensed system and then the results are
weighted. In nodal space, i.e. when using the condensed system, the weight matrix is di-
agonal and cheap to apply. In modal space, however, the weight matrix is dense, limiting
the applicability of the transformed operator and, hence, a faster residual evaluation. To
circumvent the increased operator cost, the tensor-product structure of transformation and
weight matrix can be exploited. As shown in Subsection 4.3.3, the transformation works
separately on faces, edges, and vertices, and therefore also separately on the planes of the
stars. This allows the transformation to be interchanged with the application of the multi-
plicity, and be merged with the forward application of mapping into the eigenspace. Vice
versa, the backward operator is merged with applying the weights in nodal space and the
tensor product of mapping to transformed space. As a result, the operator in Algorithm 5.1
76 5 Scaling to the stars – Linearly scaling spectral-element multigrid
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00x
0.00
0.25
0.50
0.75
1.00
Weightfunctionw
Figure 5.2: One-dimensional subdomain corresponding to one center vertex at x = 0 rangingfrom x = −1 to x = 2. The weight functions corresponding to the left (dark grey),center (black), and right vertex are shown (light grey). As they are one on theirrespective vertex and zero on the other vertices, the weight functions are a partitionof one.
Figure 5.3: Implementation of boundary conditions for the one-dimensional case on the right do-main boundary. Utilized data points are drawn in black, non-utilized ones in white.Left: Used smoother block consisting of two elements of polynomial degree p = 4.Middle: A Neumann boundary condition decouples the right element. Right: Ho-mogeneous Dirichlet boundary condition decouples the middle vertex as well.
can be utilized for the transformed system when using separate sets of matrices for forward
and backward operation. Furthermore, these matrices can include the weighting, lowering
the amount of loads and stores and leading to performance gains in the smoother.
5.2.4 Implementation of boundary conditions
The additive Schwarz smoother outlined above is vertex-based and uses the 23 element
block surrounding a vertex as subdomain. This approach, however, does not work at the
boundaries. There, the subdomain corresponding to a vertex reduces in size, in the worst
case to one element, and the boundary condition implementation needs to account for this.
When implementing every possible combination of boundary conditions, 53 = 125 variants
are required. Here, an approach is presented which utilizes the same operator for interior as
well as boundary vertices, hence, lowering the required implementations effort.
When parallelizing overlapping Schwarz methods with domain decomposition, data from
partitions on other processes is required. A typical implementation is adding a layer of ghost
elements to the partition, generating an overlap with the surrounding partitions. With ghost
elements, changing the matrices for the stars instead of the structure of the subdomains is
possible. As structured meshes are assumed, the subdomains are always a tensor-product of
one-dimensional domains, rendering investigation of the one-dimensional case sufficient.
5.2 Linearly scaling additive Schwarz methods 77
Figure 5.3 depicts a boundary vertex in combination with the boundary element and the
ghost element. For the periodic case the ghost element is part of the domain and no change
in the operator is required. But when applying Neumann boundary conditions, the ghost
element is not part of the domain anymore, and the subdomain for the vertex reduces to
one element. To regain the operator size, the corresponding transformation matrices and
eigenvalues are padded with identity so that the same operator size as for the two-element
subdomain results. Furthermore, the right-hand side is set to zero outside of the domain, such
that the correction computes to zero. In addition to the treatment of the Neumann case, the
boundary vertex is decoupled forDirichlet boundaries, the corresponding matrices padded,
and the right-hand side zeroed out. This leads to a correction of zero on the Dirichlet
point and, hence, retaining the initial value for the inhomogeneous boundary conditions.
The treatment expands from one dimension to multiple via the tensor-product structure of
the operator. Due to the transformation matrices working in their respective directions,
applying the treatment for the one-dimensional case in each direction is sufficient. Hence,
either no change is required as no boundary condition is present, or only the matrices of
one dimension, two dimensions, or three dimensions are changed according to the method
outlined above. As the transformation matrices are padded with identity, decoupled parts
only map onto themselves, computing a correction of zero.
5.2.5 Extension to element-centered block smoothers
In the above, a vertex-based smoothers was considered for the lower number of degrees
of freedom remaining on the star in the condensed case. However, many algorithms in the
literature utilize element-centered smoothers [88, 123]. This section derives a linearly scaling,
matrix-free inverse for element-centered subdomains.
Figure 5.4 shows an element-centered subdomain overlapping into the neighboring elements
in every direction. Here, an overlap of one whole element per side is assumed. As with the
star block, the collocation points exhibit a tensor-product structure and the operator can be
written in tensor-product form (2.44), with the matrices being replaced by those restricted
to the subdomain. The condensed block system can be embedded into the full block with
the same arguments used for the star block in Subsection 5.2.3. Hence, as in the former case,
fast diagonalization on the full 33 element block can be restricted to gain a solution method
in the condensed system. As the right-hand side is still zero in the element interiors and six
faces are present, the number of operations is 73n3S, where nS = 3p− 1. Compared to fast
diagonalization in the full system, with 12n4S operations, the new algorithm is more efficient
starting from p = 4.
78 5 Scaling to the stars – Linearly scaling spectral-element multigrid
Figure 5.4: Block utilized for the element-centered smoothers. Left: full system including Dirich-let nodes. Right: Collocation nodes corresponding to the condensed setup.
5.3 Multigrid solver
5.3.1 Multigrid algorithm
For finite volume and finite difference methods, h-multigrid is one of the most prominent
solution techniques. The element width h is changed on each level to exploit faster con-
vergence on these [14, 12, 47]. With spectral-element methods, changing the polynomial
degree instead is an option, leading to so-called p-multigrid, a standard building block for
higher-order methods [114, 88, 104]. Levels L to 0 are introduced, with their polynomial
degrees being defined as
∀ 0 ≤ l < L : pl = p0 · 2l (5.12a)
pL = p , (5.12b)
where p0 is the polynomial degree utilized on the coarse mesh.
A multigrid method requires four main components. The first one is the operator for residual
evaluation. Then, a smoother smoothing out high-frequency residual components, which do
not converge as fast on coarser levels. A grid transfer operator restricts the residual to the
coarser grid and prolongs back, and, lastly, a solver to solve on the coarsest grid. Here, the
condensed system is utilized, with the operator derived in Chapter 4 as residual evaluation
technique. The grid transfer operator from level l− 1 to level l, J l, is implemented with the
embedded interpolation, restricted from the tensor-product of one-dimensional operators to
the condensed system, and the restriction from level l to l − 1, JT
l−1, is the transpose of it.
5.3 Multigrid solver 79
Algorithm 5.3: Multigrid V-cycle for the condensed system using νpre pre- and νpost post-smoothing step.
function MultigridCycle(u, F)uL ← uFL ← Ffor l = L, 1,−1 do
if l = L thenul ← 0
end iffor i = 1, νpre,l do
ul ← ul + Smoother(Fl − Hlul) ▷ Presmoothingend forFl−1 ← J
T
l−1
(Fl − Hlul
)▷ Restriction of residual
end forSolve(H0u0 = F0) ▷ Coarse grid solvefor l = 1, L do
ul ← ul + J lul−1 ▷ Prolongation of correctionfor i = 1, νpost,l do
ul ← ul + Smoother(Fl − Hlul) ▷ Postsmoothingend for
end forreturn u← uL
end function
Lastly, a conjugate gradient method for the condensed system at p0 = 2 serves as coarse grid
solver.
The components suffice to construct a V-cycle, as shown in Algorithm 5.3, with a varying
number of pre- and post-smoothing steps νpre,l and νpost,l on level l. For these, two vari-
ants are considered. The first one utilizes one step of pre- and one of post-smoothing,
leading to a traditional V-cycle, whereas the second one doubles both with each level,
i.e. νpre,l = νpost,l = 2L−l. The increasing number of smoothing steps can stabilize the method
for stretched and non-uniform meshes [124].
A simple multigrid method constructed from the V-cycle is shown in Algorithm 5.4. To
switch from the full equation system to the condensed, the initial guess is restricted and the
right-hand side condensed. Then, V-cycles are performed until convergence is reached. After
convergence, the interior degrees of freedom are computed. As the performance of multigrid
can deteriorate with anisotropic meshes, a second algorithm implements Krylov accelera-
tion [100, 136], which uses the V-cycle as a preconditioner instead of an iterator. Tradition-
ally preconditioned conjugate gradient (pCG) methods are utilized to this end. However, the
weighting in the smoother does not lead to a symmetric preconditioner, and standard pCG
is not guaranteed to converge [54]. The inexact preconditioner CG method (ipCG) extends
80 5 Scaling to the stars – Linearly scaling spectral-element multigrid
Algorithm 5.4: Multigrid algorithm for the condensed system.
function MultigridSolver(u = (uB,uI)T , F = (FB,FI)
T )u← uB ▷ restrict to element boundariesF← FB −HBIH
−1II FI ▷ restrict and condense inner element RHS
while√rT r > ε do
u←MultigridCycle(u, F) ▷ fixpoint iteration with multigrid cycleend whileu← (u,H−1
II (FI −HIBu))T
▷ regain interior degrees of freedomreturn u
end function
the pCG method towards non-symmetric preconditioners [41] and was previously already
employed for multigrid [122]. Algorithm 5.5 shows the Krylov-accelerated multigrid for
the condensed system.
5.3.2 Complexity of the resulting algorithms
The solvers presented in the last section have three phases. In the first one, the initial
guess and right-hand side are condensed, using Wpre operations. Then, the solution process
takes place with Wsol operations. The third phase computes the interior degrees of freedom
with Wpost operations. Pre- and post-processing utilize three-dimensional tensor products to
map to the eigenspace of the element interior and back, leading to Wpre +Wpost = O(p4ne).
The solution process, on the other hand, consists of applying the multigrid cycle. On each
level, smoother and operator are the main contributors to the operation count, with both
scaling with O(p3l ). On the finest level this results in O(p3). Each coarsening step lowers
the polynomial degree by a factor 1/2 and with a constant number of smoothing steps, the
costs scale with (1/2)3 = 1/8. A geometric series results, limiting the operation count of
the cycle to the operation count on the fine level times a factor of 8/7. The second variant
uses νpre,l = νpost,l = 2L−l. Here, the effort per level is lowered by 2 · (1/2)3 = 1/4, resulting
in a factor of 4/3. In both cases, the amount of work for the branches of the V-cycle is a
constant factor times the work on the fine grid. The last portion of the solution time stems
from the coarse grid solver. The cost scales with O(p30nαe ) = O(nα
e ), where α depends on the
solution method. In the present case with a CG solver, α is 4/3, but α = 1 is possible with
appropriate low-order multigrid solvers [14]. Hence, in practice the cost per cycle is O(p3ne),
i.e. linear in the number of degrees of freedom, and the required work solving the Helmholtz
equation computes to
Wtotal = Wpre +NcycleWcycle +Wpost (5.13)
⇒ Wtotal = O(p4ne
)+NcycleO
(p3ne
). (5.14)
5.4 Runtime tests 81
Algorithm 5.5: Krylov-accelerated multigrid algorithm for the condensed system.
function ipCGMultigridSolver(u = (uB,uI)T , F = (FB,FI)
T )u← uB ▷ restrict to element boundariesF← FB −HBIH
−1II FI ▷ restrict and condense inner element RHS
r← F− Hu ▷ initial residuals← r ▷ ensures β = 0 on first iterationp← 0 ▷ initializationδ = 1 ▷ initializationwhile
√rT r > ε do
z←MultigridCycle(0, r) ▷ preconditionerγ ← zT rγ0 ← zT sβ = (γ − γ0)/δδ = γp← βp+ z ▷ update search vectorq← Hp ▷ compute effect of pα = γ/(qT p) ▷ compute step widths← r ▷ save old residualu← u+ αp ▷ update solutionr← r− αq ▷ update residual
end whileu← (u,H−1
II (FI −HIBu))T
▷ Regain interior degrees of freedomreturn u
end function:
The main contribution to the work stems from smoothing on the fine grid, not from pre-
or post-processing. When assuming just one V-cycle, the respective work on the fine grid
is 2 · 37n3S ≈ 2 · 37(2p3) = 592p3, whereas pre- and post-processing require 12p4 + 25p3 each.
Hence, for only one V-cycle, the cost of the smoother dominates the runtime until p > 48.
Thus, for all practical purposes, the multigrid algorithm scales linearly with the degrees of
freedom, when increasing the number of elements as well as when increasing the polynomial
degree.
5.4 Runtime tests
5.4.1 Runtimes for the star inverse
To evaluate the efficiency of the star inverse, runtime tests were conducted. Here, three
implementations were tested. First, a fast diagonalization variant working on the full 23
element block. As it was implemented with tensor products and works in the full system,
it is called “TPF”. Then, a matrix-matrix multiplication in the condensed system, applying
82 5 Scaling to the stars – Linearly scaling spectral-element multigrid
a precomputed inverse for the star operator, called “MMC”. Lastly, the new variant imple-
menting Algorithm 5.1 was considered. Due to computing in the condensed system with
tensor products, it has the name “TPC”. All three variants were implemented in Fortran
using double precision. As the data size is nS = 2p− 1 and thus odd by definition, oper-
ators and data were padded by one to attain an even operator width allowing for better
code optimization. One call to DGEMM from BLAS implemented variant “MMC”, whereas
tensor-product variants were facilitated with loops. The outermost loop corresponded to a
subdomain, leading to inherent cache blocking, whereas the innermost non-reduction loop
was treated with the Intel-specific single-instruction multiple-data (SIMD) compiler direc-
tive !dir$ simd, enforcing vectorization of the loops. Furthermore, the variant “TPC” was
refined with the techniques from Chapter 3, using parametrization, unroll and jam, and
blocking for matrix accesses.
The tests were performed on one node of the high-performance computer Taurus at ZIH
Dresden, which consisted of of two Xeon E5-2590 v3 processors, with twelve cores each, run-
ning at 2.5 GHz. For testing purposes, only one core was utilized, allowing for 40 GFLOP/s
as maximum floating point performance [48]. Furthermore, the code was compiled by the
Intel Fortran Compiler v. 2018, with the corresponding Intel Math Kernel Library (MKL)
serving as BLAS implementation for “MMC”.
As test case the inversion on 500 star subdomains was considered, corresponding to 500
vertices being present on the processor. Each subdomain was set to Ωi = (0, π)× (0, π/4)×(0, π/3), where a solution of
uex(x) = sin(µ1x1) sin(µ2x2) sin(µ3x3) , (5.15)
with parameters µ1 = 3/2, µ2 = 10, and µ3 = 9/2 is employed. From the exact solution
on the collocation points, the Helmholtz residual with λ = π served as right-hand side.
The three operators were applied, the runtime measured and the maximum error compared
to (5.15) computed on the collocation nodes. The polynomial degree was varied from p = 3
to p = 32 and each application was repeated 101 times, with only the last 100 times being
measured with MPI WTime, excluding instantiation effects from the measurement.
Figure 5.5 shows the runtime, runtime per degree of freedom, rate of floating point operations,
and maximum error for the three operators. The errors are below 10−12 and there is little
difference between the variants, but a slow increase from 10−15 to 10−13 occurs. This is an
artifact of the test: First, the input to the operators is calculated by using the Helmholtz
operator, then, the inverse is applied. Applying the Euclidean norm to this operation leads
to the definition of the condition number. Hence, the increased error stems from an increase
in the condition of the system and the variants achieving the same error validates that they
are correctly implemented.
5.4 Runtime tests 83
MMC TPF TPC
2 4 8 16 32
polynomial degree p
10−5
10−4
10−3
10−2
10−1
100
101
102
Runtime[s] 4
1
1
3
2 4 8 16 32
polynomial degree p
101
102
103
104
Runtimeper
DOF[ns]
11
2 4 8 16 32
polynomial degree p
0
10
20
30
40
Perform
ance
[GFLOP/s]
2 4 8 16 32
polynomial degree p
10−15
10−14
10−13
10−12
Max
.error
Figure 5.5: Results for the application of the inverse star operator. Top left: Operator runtimeswhen using the fast diagonalization in the full system (TPF), applying the inversevia a matrix-matrix product in the condensed system (MMC), and using the inversevia tensor-product factorization (TPC), top right: Runtimes per equivalent numberof degrees of freedom (DOF) being computed as p3 per block, bottom left: rate offloating point operations, measured in Giga FLoating point OPerations per second,bottom right: maximum error of solution from the variants compared to (5.15).
The runtimes of “TPF” as well as “MMC” exhibit slope four, whereas “TPC” achieves a
slope of three, i.e. linear scaling with the number of degrees of freedom. This translates
to a constant runtime per degree of freedom for “TPC”, and increasing ones for “TPF”
and “MMC”. For every tested polynomial degree, the matrix-matrix based variant is faster
than the tensor-product variant in the full system. Furthermore, it is the fastest implemen-
tation for p < 5. However, starting from p = 6, “TPC” becomes faster, reaching a factor of
three at p = 8 and a factor of 20 at p = 32. As to be expected, the matrix-matrix based im-
plementation nearly attains peak performance with 35GFLOP s−1, whereas the loop-based
variants range between 5 and 20GFLOP s−1. However, where “MMC” has a constant perfor-
mance, the performance of the loop-based variants heavily depends upon the operator size.
For even p the operator widths are a multiple of four and, hence, a multiple of the SIMD
84 5 Scaling to the stars – Linearly scaling spectral-element multigrid
width. There, and double the performance for odd p is achieved, which is an artifact of
compiler optimization, where the treatment of remainder and smaller SIMD operations are
detrimental to the performance
5.4.2 Solver runtimes for homogeneous meshes
To verify both, that the new multigrid solvers scale linearly with the number of degrees of
freedom as well as that they can be more efficient than the previously developed solvers, the
tests from Section 2.4 were repeated using the multigrid solvers. Again the domain is set
to Ω = (0, 2π)3 using inhomogeneous Dirichlet boundary conditions and a Helmholtz
parameter λ = 0, leading to the harder to solve Poisson’s equation.
Four solvers are considered for testing. The baseline solver is dtCG presented in Section 4.4,
which is a conjugate gradient (CG) solver using diagonal preconditioning in the transformed
system. Then, a multigrid solver implementing Algorithm 5.4, called tMG, using one pre-
and one post-smoothing step, a Krylov-accelerated version thereof based on Algorithm 5.5
called ktMG, and, lastly, a Krylov-accelerated version with varying number of smoothing
steps called ktvMG. All of these multigrid solvers utilize the residual evaluation technique
in the transformed system from Section 4.3.
For ne = 83, the polynomial degree was varied between p = 2 and p = 32. To preclude
measurement of instantiation effects, the solvers were run 11 times and for the last 10 times
the number of iterations required to reduce the residual by a factor of 10−10, called n10,
as well as the runtime measured. From the number of iterations, initial residual ∥r0∥ and
reached residual ∥rn10∥, the convergence rate
ρ = n10
√∥rn10∥∥r0∥
(5.16)
is computed, which reflects the residual reduction achieved per iteration.
Figure 5.6 shows the results of the tests. For dtCG the number of iterations increases
with the polynomial degree, which is countered by more efficient operators, leading to a near
constant runtime per unknown for p > 8 near the 1 µs. For the multigrid solvers, however, the
number of iterations is mostly constant, starting at four iterations, and decreasing with the
polynomial degree. Everytime a new multigrid level is introduced, e.g. at p = 5 and p = 17,
the convergence rate decreases leading to faster convergence and, possibly, fewer required
iterations. The solvers tMG and ktMG attain convergence rates near 10−4 for polynomial
degrees lower than 16 and introducing varying smoothing significantly lowers the convergence
rate to 10−5, and later below 10−6. This leads to the two former solvers using three iterations
for p < 17 and ktvMG using only two, with ktMG also using only two iterations for p > 17.
The attained convergence rate matches the one for solvers with similar overlap, e.g. [123].
5.4 Runtime tests 85
dtCG tMG ktMG ktvMG
2 4 8 16 32
Polynomial degree p
10−2
10−1
100
101
102
Runtime[s]
1
3
2 4 8 16 32
Polynomial degree p
100
101
102
Number
ofiterationsn10
2 4 8 16 32
Polynomial degree p
100
101
Runtimeper
DOF[µs]
2 4 8 16 32
Polynomial degree p
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Con
vergence
rate
ρ
Figure 5.6: Results for homogeneous meshes of ne = 83 elements when varying the polynomialdegree p. Top left: runtime of the solvers, top right: number of iterations required toreduce the residual by 10 orders of magnitude, bottom left: runtimes per degrees offreedom (DOF), bottom right: convergence rates of the solvers.
When comparing the runtime, all multigrid solvers are slower than the CG-based solver un-
til p = 10. This is a result of the rather large overhead inherent to the multigrid algorithm
as well as the low number of elements used in the test, favoring the CG solver. From p = 10
on, the multigrid solvers become more efficient and are faster than dtCG, with p = 17 in-
troducing a new level and, hence, creating further overhead and an exception. Every one
of the multigrid solvers is capable of breaching the 1 µs mark per unknown. For mid-range
polynomial degrees, the Krylov-accelerated solver with varying smoothing uses only two
iterations, and it uses 0.6 µs per unknown for p = 16. Afterwards, the Krylov-accelerated
version uses two iterations as well and incurs less overhead, allowing it to be slightly faster
and attaining 0.5 µs per unknown at p = 32
As verification that the solvers attain a constant number of iterations when varying the
number of elements, the tests were repeated at p = 16 with the number of elements per
direction k being varied from k = 4 to k = 28. Figure 5.7 shows the resulting convergence
86 5 Scaling to the stars – Linearly scaling spectral-element multigrid
dtCG tMG ktMG ktvMG
4 8 16 32
Elements per direction k
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Conv ergence
rate
ρ
4 8 16 32
Elements per direction k
100
6× 10−1
2× 100
3× 1004× 100
Runtimeper
DOF[µs]
Figure 5.7: Convergence rates and runtimes per degree of freedom for the four solvers for homo-geneous meshes of ne = k3 elements of polynomial degree p = 16.
rate and runtime per degree of freedom. As expected, the solver dtCG has an increasing
runtime per degree of freedom with slope n1/3e , stemming from an increase in the number of
iterations. The multigrid solvers, on the other hand, exhibit a slight increase in the conver-
gence rate, i.e. they converge more slowly due to the boundary conditions having a decreased
impact on the domain. However, they are capable of reaching a constant convergence rate
after k = 8. While the convergence rate stagnates, the runtime per degree of freedom does
not. It decreases for the multigrid solvers. This is an artifact of the vertex-based smoother:
While the number of elements is k3, the number of vertices computes to (k + 1)3. Hence,
increasing k leads to a reduction in the ratio of vertices. The smoother becomes more effi-
cient when increasing the number of elements. The increase in runtime at k = 28, however,
is due to occupying the whole RAM of one socket, and using non-uniform memory access
afterwards with lower bandwidth and, hence, a larger overall runtime.
5.4.3 Solver runtimes for anisotropic meshes
In the above section, only homogeneous meshes were considered. In simulations, however, the
resolution often needs to be adapted to fully resolve the solution at reasonable computational
cost, e.g. to capture high gradients in the boundary layer near the wall. This leads to
anisotropic or even stretched meshes with considerably higher condition numbers and, in
turn, decreased performance of the solvers. To investigate the influence of high aspect
ratios, the testcase from [124] was combined with the test from Section 4.4. For a given
aspect ratio AR, the domain is set to
Ω = (0, 2π · AR)× (0, π ⌈AR/2⌉)× (0, 2π) , (5.17)
5.4 Runtime tests 87
dtCG tMG ktMG ktvMG
1 2 4 8 16 32 48
Aspect ratio
100
101
102
Number
ofiterationsn10
1 2 4 8 16 32 48
Aspect ratio
100
101
Runtimeper
DOF[µs]
Figure 5.8: Runtimes for the four solvers for anisotropic meshes of ne = 83 elements of polynomialdegree p = 16.
allowing to use homogeneous meshes consisting of anisotropic brick-shaped elements. Here,
ne = 83 elements of polynomial degree p = 16 discretized the domain and the aspect ratio
was varied from AR = 1 to AR = 48.
In Figure 5.8 the number of iterations and runtimes per degree of freedom of the solvers
are depicted. The multigrid solvers have a large increase in the number of iterations, lead-
ing to a higher runtime in turn: For tMG the number of iterations increases from three to
sixty. The introduction Krylov acceleration mitigates the effect and stabilizes the num-
ber of iterations until AR = 4. Further varying the number of smoothing steps stabilizes
the iteration count until AR = 8, but the deterioration is still present. While the solvers
are very capable for homogeneous meshes, their efficiency rapidly deteriorates for high as-
pect ratios. None of the multigrid solvers is capable of attaining the high robustness of
the locally-preconditioned solver dtCG, which only takes twice as long for an aspect ratio
of AR = 48. With the Schwarz smoothers, only a higher spatial overlap guarantees effi-
cient solution [124], requiring an increase in the amount of overlap between elements, which
is not considered here.
5.4.4 Solver runtimes for stretched meshes
Anisotropic meshes already allow to lower the number of degrees of freedom in one or two
directions. However, even in simple geometries, such as a plane channel flow, local mesh
refinement can significantly lower the number of degrees of freedom. Typically stretched
meshes are utilized to refine near walls while keeping the element width in the center constant.
The testcase from Section 4.4.4 allows to study the robustness against varying aspect ratios
in one grid and is, hence, utilized here.
88 5 Scaling to the stars – Linearly scaling spectral-element multigrid
Table 5.1: Number of iterations required for reducing the residual by 10 orders of magnitude forthe stretched grids using a constant expansion factor α.
p
α Solver 4 8 16 32
1 dtCG 71 87 108 129
1 tMG 5 3 3 3
1 ktMG 4 3 3 2
1 ktvMG 4 3 2 2
1.5 dtCG 98 117 126 144
1.5 tMG 21 11 7 5
1.5 ktMG 11 8 6 4
1.5 ktvMG 11 8 5 3
2 dtCG 105 133 158 180
2 tMG 36 26 18 12
2 ktMG 15 13 10 8
2 ktvMG 15 13 10 8
The domain Ω = (0, 2π)3 is considered and discretized with three grids consisting of 83 ele-
ments. These are constructed using a constant expansion factors α ∈ 1, 1.5, 2 in all three
directions and are shown in Figure 4.6. Setting α = 1 results in a homogeneous mesh, as
utilized in the previous sections and serves as baseline to compare to. Using α = 1.5 stretches
the grid in all three directions, leading to small cells near one end of the domain and very large
cells on the opposite end. As a result the maximum aspect ratio is ARmax = 17. The third
grid is constructed with α = 2, further magnifying the effect and leading to ARmax = 128.
The latter two grids are populated by a large variety of elements, typical to large-scale simu-
lations in computational fluid dynamics, with their shapes ranging from boxfish, over plaice,
to eels. Most high-order solvers are not capable of computing on such meshes at all: As the
operators will differ in every element, matrix-free methods are required for preconditioner as
well as the operator. Furthermore, the performance of most solvers degrades when increasing
the aspect ratios in the grid.
The number of iterations required to lower the residual by ten orders of magnitude are
presented in Table 5.1. The most robust solver is the locally preconditioned dtCG, for which
the number of iterations increases only by 50 % when raising the expansion factor to 2.
The multigrid algorithms do not fare as well. For tMG, the number of iterations increases
seven-fold for p = 4 and four-fold for p = 32. Using Krylov-acceleration significantly lowers
the factor, with only half the number of iterations being utilized at p = 4 and a third less
at p = 32. The intended decrease in condition from introducing varying smoothing has a
5.4 Runtime tests 89
limited effect at low polynomial degrees for α = 1 and α = 1.5, and none at α = 2. While
the convergence rate gets lowered, the effect is not large enough to decrease the number
of iterations. Again, increasing the number of smoothing steps is not beneficial anymore.
This indicates, as in the previous section, that the geometrical overlap, not the number of
smoothing steps, is the main ingredient for convergence.
5.4.5 Parallelization
In the last sections every test was performed on one core of one node. However, real-life
simulations run on hundreds, if not thousands of cores. The decomposition of domain and
efficient sharing of relevant information is of paramount importance in the parallel case, lest
the parallelization diminishes the performance.
The presented solver utilizes a vertex-based smoother, which requires data from the sur-
rounding elements. The amount of work and, hence, the runtime scales with the number
of vertices per partition. These, in turn, scale with the number of elements per direction k
as nv = (k + 1)3. Moreover, data from neighboring processes is required. In finite difference
and finite volume methods ghost cells facilitate the sharing at the partition boundaries: On
each process the domain is padded by one layer of grid cells, which contain the data of other
processes. As a result, small partitions of the domain are not beneficial for the parallel
efficiency, as the amount of work as well as the required amount of communication increases.
For finite difference or finite volume methods, the difference is negligible, as the number of
data points is large compared to the amount of added cells. With SEM, however, it is not.
At k = 4 and p = 16, still 64 degrees of freedom are present per direction, but there are
three times more vertices than elements in a subdomain and, thus, also a factor of three
more work.
To attain a good parallel efficiency, a simple domain decomposition approach does not suf-
fice. A hybrid parallelization is required to retain large partition sizes on multi-core sys-
tems [49, 70]. The coarse level implements domain decomposition, via MPI, whereas Open-
MP facilitates the second layer which exploits data parallelism.
To verify the efficiency of the OpenMP parallelization, the tests from Section 5.4.2 were
repeated on one node using two processes. The domain was Ω = (0, 2 · 2π)×(0, 2π)2 with the
number of elements varying from ne = 2 · 8× 8× 8 to ne = 2 · 16× 16× 16 at a polynomial
degree of p = 16. The above allows for isotropic elements when decomposing in the x1-
direction, ensuring a comparable convergence rate. In addition to the number of elements,
the number of threads and, hence, number of cores per process was varied from 1 to the
number of cores per processor, 12, and the parallel speedup over using only one core as well
as the parallel efficiency measured.
90 5 Scaling to the stars – Linearly scaling spectral-element multigrid
Table 5.2: Parallel speedup and efficiency for the four solvers defined in Section 5.4.2 when in-creasing the number of threads per process and varying the number of elements perdirection k.
Speedup Efficiency [%]
Number of threads Number of threads
k Solver 4 8 12 4 8 12
8 dtCG 3.51 5.73 6.13 88 72 51
8 tMG 3.48 5.80 6.48 87 73 54
8 ktMG 3.49 6.06 7.06 87 76 59
8 ktvMG 3.54 5.75 7.04 89 72 59
12 dtCG 3.57 5.69 6.35 89 71 53
12 tMG 3.62 6.31 7.95 91 79 66
12 ktMG 3.61 6.28 7.82 90 78 65
12 ktvMG 3.59 6.21 7.74 90 78 65
16 dtCG 3.63 5.02 5.42 91 63 45
16 tMG 3.66 6.24 7.68 91 78 64
16 ktMG 3.61 6.07 7.46 90 76 62
16 ktvMG 3.60 6.10 7.43 90 76 62
Table 5.2 lists the measured speedups and efficiencies. At k = 8, the parallel efficiency for
four threads is at 80 %, but quickly deteriorates to only 40 % at twelve cores for dtCG
and 50 % for the multigrid solvers. Slight differences are present: The Krylov-accelerated
versions have a higher efficiency. At k = 12, the efficiency is still higher, with 90 % when
using four threads and 65 % for twelve threads for the multigrid and 44% for dtCG. When
increasing the elements per direction in a subdomain to k = 16 no further gains in parallel
efficiency are present. This indicates that the low efficiency does neither stems from latency
nor from communication.
The efficiency of the on-node parallelization is acceptable, but not extraordinary. However,
this mainly stems from the architecture: For the utilized node configuration, the L3-cache
does not scale linearly with the number of utilized cores. It saturates at ten times the
bandwidth of a single core [48]. Moreover, the memory bandwidth for twelve cores is only
five times the one when using one core, limiting the performance of operators and scalar
products in both, conjugate gradients and multigrid methods.
The previous test measured the speedup on one node by increasing the number of cores.
The test showed that utilizing 12 elements per direction for each partition yields an efficient
setup. To investigate the scaling for more than one node, the scaleup of the code is measured,
using 123 elements of polynomial degree p = 16 per partition and scaling the domain size
5.4 Runtime tests 91
Table 5.3: Parallel efficiencies for the three solvers defined in Section 5.4.2 when increasing thenumber of processes with a constant number of elements per direction k = 12 on eachprocess when using 12 threads per process. Here, nN refers to the number of computenodes, nP to the number of processes and nC to the number of cores employed in thesimulation. The parallel efficiency is computed based on the run using one whole node,i.e.nN = 1, nP = 2 and nC = 24.
With coarse grid solver Without coarse grid solver
nN nP nC dtCG tMG ktMG dtCG tMG ktMG
1 1 12 1.40 1.05 1.07 1.40 1.02 1.04
1 2 24 1.00 1.00 1.00 1.00 1.00 1.00
2 4 48 0.86 0.87 0.90 0.86 0.96 0.97
4 8 96 0.78 0.74 0.82 0.78 0.91 0.92
8 16 192 0.63 0.63 0.75 0.63 0.87 0.90
16 32 384 0.50 0.49 0.65 0.50 0.83 0.86
32 64 768 0.48 0.42 0.55 0.48 0.79 0.80
accordingly. The number of nodes is varied from 1 to 16, with twelve threads per process. As
the domain shape varies, the convergence rate varies as well. This leads to ktvMG exhibiting
a convergence rate slightly over or under 10−5, leading to either two or three iterations being
used. Hence, it is excluded from this test.
Table 5.3 lists parallel efficiency computed from the scaleup over using one full node, i.e. the
run using nN = 1, nP = 2 and nC = 24. The numbers are presented two times, once with the
total runtime, and once when excluding the coarse grid solver. When increasing the number
of utilized nodes from one to four, the parallel efficiency drops from one to 90 % and 87 %
for tMG and ktMG, respectively. The deterioration continues, to only 63 % and 55 %
efficiency at 32 nodes. When excluding the runtime of the coarse grid solver, the efficiency
increases significantly to 80 %. This shows that two separate effects are present: First, the
coarse grid solver is not efficient in the parallel case. Second, the remaining solvers scale,
but not to multiple thousands of cores. The first stems from using a CG solver leading to
an increase in the number of iterations. In addition, the coarse grid incorporates far fewer
degrees of freedom and the optimal distribution among the nodes is not necessarily the node
count used for the fine grid. This can be remedied, for instance by redistributing the grid on
coarser levels to a lower number of processes. The decline in the efficiency when excluding
the coarse grid solver can be attributed to the blocking nature of the implementation of
the communication: Most of the communication stems from boundary data exchange, not
scalar products. Therefore, the communication pattern can be improved by overlapping
communication and computation, hiding latency, and transfer time. When implemented
correctly, this allows for ideal speedups even when using one element per core [56]. However,
92 5 Scaling to the stars – Linearly scaling spectral-element multigrid
the implementation effort is not negligible. While the parallel scalability achieved here is not
stellar, it is sufficient as a proof of concept and can be improved with the methods outlined
above.
5.5 Summary
This chapter investigated p-multigrid for the static condensed Helmholtz equation in order
to attain a constant iteration count for the solvers presented in Chapter 4. Schwarz-type
smoothers were considered due to their excellent convergence properties. However, the main
component of the smoother, the inverse of the 2 × 2× 2 element block, scales with O(p4)and is not matrix free in the condensed case. To generate a matrix-free solution technique,
the static condensed system was embedded into the full one, allowing the application of
fast diagonalization as solution technique. Subsequently, the operator was factorized to a
linear complexity, i.e. to scale with O(p3). The linearly scaling inverse allowed for a linearly
scaling Schwarz type smoother, which, in turn, led to a completely linearly scaling multigrid
cycle.
Multigrid solvers founding on the linearly scaling multigrid cycle were proposed and their
efficiency evaluated using the test case from Chapter 2. With homogeneous meshes, less than
three iterations are required to reduce the residual by ten orders, matching solvers with a
similar overlap [123], and for p ≥ 16 the number of iterations is constant, independent of the
number of elements. Combined with the linearly scaling smoother and residual evaluation,
this results in high efficiency when computing at high polynomial degree: Over wide ranges
of polynomial degrees, the solver requires less than one microsecond per degree of freedom,
even using only 0.6 µs at p = 16 and 0.5 µs at p = 32. This throughput is a factor of three
higher than in the Schwarz method which stimulated this research [51] and a factor of
four higher than in highly optimized multigrid methods, such as [82], when comparing the
method proposed in this chapter at p = 32 to the throughput from [82] at p = 8. Hence, the
multigrid method developed in this chapter allows to extend the applicability of high-order
methods from p = 8 to p = 32 without any loss in performance.
After investigating the performance for homogeneous meshes, inhomogeneous meshes were
considered. While the solvers perform very well in the homogeneous case, the performance
degrades with inhomogeneous ones. Krylov-acceleration mitigates the deterioration, but
still a factor of four in the number of iterations occurs when increasing the maximum aspect
ratio in the grid from α = 1 to α = 128, leading to a worse performance than attained
with the solvers from Chapter 4. A larger overlap is required to maintain the efficiency
of Schwarzmethods for high aspect ratios, and iterative substructuring promises a different
venue to attain robustness against the aspect ratio [27, 106].
5.5 Summary 93
To investigate the parallel scalability, runtimes were conducted, on one node as well as across
several nodes. On one node, memory bandwidth limits the scalability of the algorithm to,
however, still a parallel efficiency of 65 % is achieved when increasing the number of threads
per process to 12. When parallelizing over multiple nodes, acceptable efficiencies are attained
until 32 nodes and, hence 768 cores are utilized. The coarse grid solver limits the efficiency,
as no redistribution of the domain is performed on coarser levels.
Here, only the continuous spectral-element method was considered. For flow simulations,
the discontinuous Galerkin method is often prefered for its inherent diffusivity and there-
fore stabilization. In order to raise the applicability of the solver, future work will expand
the multigrid solver towards the discontinuous Galerkin method, exploiting the similar
structure provided by the hybridizable discontinuous Galerkin method [76, 138]. Further-
more, efficiency gains are to be expected from fusing operators and using cache-blocking to
decrease the impact of the limited memory bandwidth.
94 5 Scaling to the stars – Linearly scaling spectral-element multigrid
95
Chapter 6
Computing on wildly heterogeneous sys-
tems
6.1 Introduction
The last chapters developed algorithmic improvements for high-order methods using homo-
geneous hardware found in workstations and many HPCs. However, many of the latter al-
ready incorporate accelerators [94] and future HPCs will be even more heterogeneous [26]. To
future-proof the developed algorithms, this chapter investigates orchestration of the hetero-
geneous system most deployed in current high-performance computers: one node comprising
CPUs and GPUs [94].
Many programming concepts can be applied to program heterogeneous system. Decompos-
ing the application into many small subtasks, which can then be served to the compute
units, lead to task-based parallelism and is, for instance, implemented in StarPU [4] and
allows to harness large portions of the hardware without load balancing concerns. Using a
decomposition based on the operators and applying these on the hardware most suited for
the task leads to intra-operator parallelism and can lead to significant performance gains for
databases [73]. However, CFD applications do fit neither of these programming models as
data parallelism, not task parallelism, is present.
In CFD, domain decomposition is the main method to attain parallelism. Inside the result-
ing partitions, data parallelism can still be exploited, e.g. shared-memory capabilities with
OpenMP [70]. Similarly, GPU applications facilitate a two-level parallelization, combining
domain decomposition with data parallelization inside a domain. This chapter will expand
the concept to address multiple types of hardware using one single source code.
Programming the heterogeneous system is only one side of the problem. After attaining com-
putation, the issue of load balancing remains. The compute units need to be orchestrated to
96 6 Computing on wildly heterogeneous systems
compute in concert, not in disarray. While a perfect load balancing can harness the full per-
formance, a bad one can lower the performance below the one attained with only parts of the
hardware. Here, only static load balancing is considered. It allows for insights into the static
state for dynamic load balancing as well as the same performance after all information is re-
viewed. Proportional load balancing models allow for simple hardware models parametrized
with heuristics [137] or by auto-tuning [25], whereas functional performance models aim at
decomposing the runtime into contributors and even allow for incorporation of the commu-
nication time [30]. The above models are applicable in programs ranging from small matrix
products in [140], small CFD codes [59], or whole CFD solvers [86]. In the above references,
the runtime of the application was treated as a black box, averaging over all iterations or
time steps. While this simplified model can work well for some applications, it can prove
to be too simple in others and degrade the performance obtained in high-performance com-
putations. Hence, after proposing a programming model, load balancing between different
compute units is investigated using the solvers from Chapter 2 as examples.
This chapter summarizes the work presented in [59] and [58] and is is structured as fol-
lows: First, Section 6.2 discusses the model problem and develops a programming model for
heterogeneous hardware. Then, Section 6.3 investigates load balancing to harvest the full
potential of the hardware.
6.2 Programming wildly heterogeneous systems
6.2.1 Model problem
To investigate programming and the prospective performance gains unlocked through com-
puting on wildly heterogeneous systems, the heart of an incompressible flow solver is exam-
ined: The Helmholtz solver. It occupies the largest portion of the runtime, solving the
pressure and diffusion equations occurring in the time stepping scheme. Here, the solvers
from Chapter 2 and Chapter 3 are revisited to allow for an easier analysis of the load bal-
ancing.
The solver bfCG from Section 2.4 implements a preconditioned conjugate gradient (pCG)
method [118], with the preconditioner using the fast-diagonalization technique to apply
the exact inverse in the interior of the elements. Algorithm 6.1 summarizes the solution
method. Each iteration starts with a search vector p H-orthogonal to the space spanned
by previous search vectors. The first step computes the effect of p, called q, by applying
the element-wise Helmholtz operator (2.44) and assembling the result over the elements.
The optimal step width α is computed, requiring a scalar product, and afterwards solution
as well as residual are updated. Then, the preconditioner applies an approximate inverse to
the residual and the next search vector results via orthogonalization. The communication
6.2 Programming wildly heterogeneous systems 97
Algorithm 6.1: Preconditioned conjugate gradient method for an element-wise formulatedspectral-element method solving the Helmholtz equation Hu = F. Here, M−1
denotes the inverse multiplicity of the data points.
1: for all Ωe : re ← Fe −Heue ▷ local initial residual2: r← QQT r ▷ scatter-gather operation3: for all Ωe : ze ← P−1
e re ▷ preconditioner application4: for all Ωe : pe ← ze ▷ set initial search vector5: ρ←
∑e z
Te M−1
e re ▷ scalar product6: ε←
∑e r
Te M−1
e re ▷ scalar product7: while ε > εtarget do8: for all Ωe : qe ← Hepe ▷ local Helmholtz operator9: q← QQTq ▷ scatter-gather operation10: α← ρ/(
∑e q
Te M−1
e pe) ▷ scalar product11: for all Ωe : ue ← ue + αpe ▷ update solution vector12: for all Ωe : re ← re − αqe ▷ update residual vector13: ε←
∑e r
Te M−1
e re ▷ scalar product14: for all Ωe : ze ← P−1
e re ▷ preconditioner application15: ρ0 ← ρ16: ρ←
∑e z
Te M−1
e re ▷ scalar product17: for all Ωe : pe ← peρ/ρ0 + ze ▷ update search vector18: end while
structure of the solver resembles that of a typical linear solver: Operator evaluation is
followed by boundary data exchange and control variables, such as the current norm of the
residual, are computed via reductions over all elements and, hence, processes. Therefore,
the performance gains obtained for bfCG can be seen as representative for a larger class of
solvers.
6.2.2 Two-level parallelization of the model problem
Domain decomposition is a standard technique computational fluid dynamics, where locality
inherent to the physical problem transfers to locality of the operators. The locality of the
operator, in turn, allows for decomposition of the domain into non-overlapping partitions
and working on these in parallel. With a spectral-element method, the domain is already
decomposed into elements. At this stage, domain decomposition amounts to defining par-
titions of the set of elements and decomposing the operations onto these so that they can
run in parallel. In Algorithm 6.1 most operations are element-wise operations, such as the
local Helmholtz operator in line 8, updates of solution and residual in lines 11 and 12, or
the application of the preconditioner in line 17. While these operations remain unchanged,
the gather-scatter operation in line 9 and the scalar products (lines 10, 13 and 16) require
communication, as either boundary data exchange or reduction operations occur. The added
communication between partitions constitutes the key change.
98 6 Computing on wildly heterogeneous systems
Typically, MPI serves as message-passing layer [128], as it is widely employed and allows for
fine-tuning of communication patterns [49]. For instance latency hiding can be implemented,
overlapping communication and computation as done in [56], leading to near linear scaling up
to hundreds of thousands of cores. As often similar patterns occur, libraries, such as PETSc
and OP2, implement domain decomposition layers that can be directly utilized [2, 110]. Fur-
thermore, partitioned global address space languages, such as Coarray Fortran and Unified
Parallel C, allow for message passing with constructs inherent to the language [111, 20]. As
language constructs, instead of library calls, implement the message passing, the compiler
can optimize the communication structure. While traditionally utilized for exploitation of
shared-memory capabilities, OpenMP can attain speedups similar to those of MPI-based
domain decomposition layers [10]. Of these methods, MPI was chosen as implementation of
the message passing layer, as availability and compiler support were superior to the other
variants.
Domain decomposition implements a rather coarse level of parallelization, which further
incurs the need of boundary data exchange. When increasing the number of utilized cores,
the domains get ever smaller and the ratio of communication to computation increases,
leading to parallel inefficiencies. All operations in Algorithm 6.1 work on every element in
the partition, albeit some with communication involved, opening up further possibilities for
acceleration. To lower the amount of communication, larger partitions are required. Hence,
parallelism inside the partitions are exploited, using the shared-memory capabilities of multi-
core components or many-core devices, leading to a second, fine-grained, hardware-specific
parallelization.
One possibility to gain acceleration inside a subdomain lies in exploitation of multi-core
hardware and OpenMP is a thread-based parallelization for these [102]. The programmer
states how the work should be shared between the threads using compiler directives, also
referred to as pragmas. For Algorithm 6.1 this results in encapsulation of each occurring
element-wise operation in pragmas which designate the iterations to be shared among the
executing threads. To minimize initialization overhead, the threads are generated once at
the start of the program and kept alive through the whole computation, i.e. every solver call
uses the same threads. To prevent non-uniform memory accesses (NUMA), such as reading
data from the memory of a different socket, the data is initialized using the “first touch
principle”, enforcing that data lies in the memory of the core computing on it [49]. The
approach described above results in a hybrid parallelization consisting of a coarse-grained
MPI layer handling data distribution and communication, and a fine-grained OpenMP layer
exploiting the shared-memory capabilities of the hardware as, e.g., done in [70].
Similarly to using multi-cores, GPUs can be harnessed to exploit data parallelism. While tra-
ditionally programmed viaOpenCL or CUDA, the pragma-based language extensionOpen-
ACC allows for a programming style similar to OpenMP [101]. The extension was based
on OpenMP and therefore shares a lot of the the syntax as well as the process of code
6.2 Programming wildly heterogeneous systems 99
1 do e = 1 , n e2 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P3 M u( i , j , k ) = M 123 ( i , j , k ) ∗ u( i , j , k , e )4 r ( i , j , k , e ) = M u( i , j , k ) ∗ d (0 , e )5 end do ; end do ; end do67 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P8 tmp = 09 do l = 1 , N P
10 tmp = tmp + L t i l d e T ( l , i ) ∗ M u( l , j , k )11 end do12 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (1 , e )13 end do ; end do ; end do1415 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P16 tmp = 017 do l = 1 , N P18 tmp = tmp + L t i l d e T ( l , j ) ∗ M u( i , l , k )19 end do20 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (2 , e )21 end do ; end do ; end do2223 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P24 tmp = 025 do l = 1 , N P26 tmp = tmp + L t i l d e T ( l , k ) ∗ M u( i , j , l )27 end do28 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (3 , e )29 end do ; end do ; end do30 end do
1 ! $omp do2 ! $ acc p a r a l l e l p re sent (u , r ) &3 ! $ acc& num workers (1 ) async (1)4 ! $ acc loop gang worker p r i va t e (M u)5 do e = 1 , n e6 ! $ acc cache (M u)78 ! $ acc loop c o l l a p s e (3) independent vector9 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P
10 M u( i , j , k ) = M 123 ( i , j , k ) ∗ u( i , j , k , e )11 r ( i , j , k , e ) = M u( i , j , k ) ∗ d (0 , e )12 end do ; end do ; end do1314 ! $ acc loop c o l l a p s e (3) independent vector15 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P16 tmp = 017 ! $ acc loop seq18 do l = 1 , N P19 tmp = tmp + L t i l d e T ( l , i ) ∗ M u( l , j , k )20 end do21 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (1 , e )22 end do ; end do ; end do2324 ! $ acc loop c o l l a p s e (3) independent vector25 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P26 tmp = 027 ! $ acc loop seq28 do l = 1 , N P29 tmp = tmp + L t i l d e T ( l , j ) ∗ M u( i , l , k )30 end do31 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (2 , e )32 end do ; end do ; end do3334 ! $ acc loop c o l l a p s e (3) independent vector35 do k = 1 , N P ; do j = 1 , N P ; do i = 1 , N P36 tmp = 037 ! $ acc loop seq38 do l = 1 , N P39 tmp = tmp + L t i l d e T ( l , k ) ∗ M u( i , j , l )40 end do41 r ( i , j , k , e ) = r ( i , j , k , e ) + tmp ∗ d (3 , e )42 end do ; end do ; end do43 end do44 ! $ acc end p a r a l l e l45 ! $omp end do
Figure 6.1: Fortran implementation of the element Helmholtz operator with the factorizationfrom (3.7). In the code, n e is the number of elements, N P the one-dimensionaloperator size np constant during compiliation, M 123 the diagonal of the three-dimensional mass matrix, L tilde T the transpose of the matrix in (3.9), d the metricfactors, u the input vector and r the output vector. Left: Implementation for singlecores. Right: Single-core implementation augmented with directives for OpenMPand OpenACC.
extension: Directives encapsulate the operations, designating them to be offloaded to the
accelerator. However, where CPUs typically incorporate tens of cores, GPUs are built with
thousands of them. Hence, exploitation of parallelism inside the elements is required. This
can be facilitated by encapsulating loops inside the elements in pragmas. Contrary to Open-
MP, where initialization of threads and false sharing of data leads to performance degra-
dation, the low-bandwidth interconnect to the GPU mandates minimization of memory
transfers [44], lest the performance ends up being limited by the transfer bandwidth and not
compute performance. Hence, the data is created once at the start of the algorithm, and kept
on the GPU for the whole duration, only updating boundary data. Finally, asynchronous
execution of GPU kernels further enhances the performance.
100 6 Computing on wildly heterogeneous systems
Figure 6.1 shows a loop-based implementation of the element Helmholtz operator in the
factorized form (3.9). The unaugmented variant consists of a loop over all elements. First,
the precomputed three-dimensional mass matrix weighs the input vector and the diagonal
contribution from the Helmholtz parameter is saved. Then, the stiffness matrix is applied
in all three coordinate directions, with one loop nest each. Overall, 27 lines of code are
used for the implementation. For OpenMP adding one “!$omp do” statement before and
an “!$omp end do” statement behind the loop suffices for acceleration. As the threads are
created outside of the operator, every variable created inside is located on the stack of the
respective thread and is therefor local to the thread. No statements for privatization are
needed. ForOpenACC, further code is required. First, a “!$acc parallel” region instantiates
a kernel and states whether the data already resides on the GPU or needs to be copied.
Then, “!$acc loop” enclosing the element loop instantiates a GPU kernel, designates it for
the “gang” and “worker” execution units and states that the temporary M u shall not be
shared among these. Every interior loop nest is collapsed to allow for parallelism for the
“vector” compute unit. And lastly, the !$acc loop seq directives prevents the compiler
from parallelizing the reduction loops. In total, 14 lines of code are required to accelerate
the kernel.
Both, OpenMP and OpenACC implement data parallelism using compiler directives.
These are comments and, hence, the program can be compiled without them. This in turn
allows to handle both types of fine level parallelization pragmas inside one code, compiling
only for the hardware at hand. Furthermore, with both variants inside one single source code,
code maintenance is easier than when supporting two completely separate implementations
of the same problem. Where the implementation of key operators, as described in Chapter 3,
might differ, the remainder of the code stays the same, with slight additions (“glue code”)
for data handling and thread instantiation.
The matching MPI communication patterns generate a second benefit: The multiple program
multiple data capabilities of MPI allows to couple different versions of the program, each
compiled for different hardware. For instance an OpenMP and an OpenACC versions of
the program can be used in a single simulation, generating a heterogeneous parallel program
working on different types of hardware. In this way, a heterogeneous run time environment
utilizing all available processing units can be generated directly from a single source without
any further work, eventually enabling a better speedup of the application. Moreover, other
compute units, for instance FPGAs, can be incorporated in a similar fashion, if pragma-based
language extensions allow to program them.
6.2.3 Performance gains for homogeneous systems
To gauge the performance of the parallelization, the solver bfCG was enhanced to be able to
compute on nodes comprising multi-core CPUs and GPUs: Each operator was augmented
6.2 Programming wildly heterogeneous systems 101
Table 6.1: Speedup over using a single core when computing on either multiple cores with Open-MP or GK210 GPU chips on nS sockets. The corresponding runtimes per iterationare located in Table A.1 in the appendix.
p = 7 p = 11 p = 15
ne ne ne
nS Setup 83 123 163 83 123 163 83 123 163
1 4 cores 3.8 3.9 3.4 3.9 3.8 3.8 3.9 3.8 3.8
1 8 cores 7.3 7.2 6.0 5.4 6.9 6.8 7.0 6.9 6.9
1 12 cores 10.1 9.9 8.9 10.3 9.3 9.2 9.5 9.1 9.3
1 1 GK210 13.0 20.3 24.7 17.9 22.4 24.1 16.7 18.7 19.5
1 2 GK210 15.9 31.8 42.2 28.4 39.0 44.1 29.6 34.5 36.6
2 4 cores 7.2 8.0 7.8 8.0 7.8 7.7 7.9 7.6 7.6
2 8 cores 13.2 15.1 14.3 15.4 14.3 13.8 14.6 13.7 11.9
2 12 cores 17.3 21.3 17.3 21.5 19.5 18.6 19.7 18.5 18.0
2 1 GK210 14.8 31.2 41.9 28.2 39.2 45.1 29.6 34.6 37.2
2 2 GK210 15.9 43.9 66.5 44.6 64.5 81.0 51.2 62.8 70.5
with OpenACC and OpenMP directives, and conditional compilation leads to a process
either computing on one GPU via OpenACC, or on a set of CPU cores addressed with
OpenMP. To couple the processes, domain decomposition using static, block-structured
grids was implemented, with the inter-process communication realized with MPI.
Runtime tests were conducted for the homogeneous case of using either one CPU or one
GPU. As in previous chapters, one node of the HPC system Taurus served as measurement
platform. The node utilized in this chapter incorporates two NVIDIA K80 cards, consisting
of two Kepler GK210 GPU chips each, in addition to the two Intel Xeon E5-2680 processors.
To obtain programs for multi- and many-cores, the PGI compiler v. 17.7 compiled the pro-
gram, once with OpenMP and once with OpenACC. The runtime and performance data
presented here differ from those presented in Section 3.4 as a different compiler is utilized.
The periodic domain Ω = (−1, 1)3 was discretized with ne = 83, ne = 123, and ne = 163
with bfCG solving the Helmholtz equation. The process was repeated ten times, first
computing on socket with 1, 4, 8 or 12 cores or 1 or 2 GPUs, and then on two sockets
to evaluate the MPI implementation. As runtimes libraries, such as the implementation of
OpenACC, can autotune, the solver was called 50 times, with the time per iteration of the
last 20 solver calls being averaged, leading to a reproducable runtime.
Table 6.1 compiles the speedups over one core, with the corresponding runtimes residing
in Table A.1 in the appendix. For p = 7, an increase of the core count to four yields a near
linear speedup, which slightly degrades at ne = 163. This is to be expected, as the data
set for the CG solver fits into L3 cache, the bandwidth thereof scales near linearly with
102 6 Computing on wildly heterogeneous systems
the number of utilized threads [48]. The RAM bandwidth does, however, not scale linearly
with the number of utilized cores, leading to lower speedups at larger data sizes where the
performance degradation becomes more apparent with higher thread counts. For 8 cores,
the speedup is lowered from 7 to 6 and for 12 from 10 to 9. Losing the same amount of
performance stems from the CPU layout [48]: The first eight cores are located on a ring
interconnect with its own RAM interface, as are the last four. However, the latter are only
added when switching from 8 to 12 threads, leading to a near double the performance in
some cases.
For p = 7, computing on one GK210 chip generates a slightly better performance than one
whole CPU at ne = 83, with the speedup increasing with larger data sizes to 24, i.e. the
compute power of two whole CPUs. One part of this higher speedup stems from the CPU
not computing in the L3 cache, but rather loading data from RAM, leading to a larger
runtime for the CPU. However, adding a second GPU chip does not significantly increase
the performance at ne = 83, only raising the speedup from 13 to 16, but pays off at ne = 163,
where the 24 get complemented by 18. While for the two GPUs the memory bandwidth
scales linearly, the overhead in computing on GPUs is significant, leading to an offset in the
runtime.
When increasing the polynomial degree, the speedup obtained withOpenMP stays constant.
For the GPU it does not, as the larger data sizes allow to circumvent the offset in the runtime.
This becomes even more pronounced when using the two available GK210 chips, with the
speedup at ne = 83 increasing from 16 at p = 7 to 30 at p = 15.
When computing on both sockets, a near linear increase in speedup is present for the CPU
runs. With the utilized architecture the memory bandwidth and compute power double,
allowing for the linear increase while incurring very slight inefficiencies due to communication
of boundary data. The GPU computations, however, show little performance gains for p = 7.
Using one GK210 chips per sockets nets the performance of two on one socket, and increasing
to four chips only gains 50 % more performance for high element counts, but near none at
low numbers of elements. This, again, indicates that large offsets in the runtime of the GPU
are present and large data sizes are required for the implementation to become efficient.
While not generating a linear speedup, the implementation still lies inside the expected
ranges, both for CPUs as well as for GPUs. Furthermore, the adapted concept allows to
address different kinds of hardware with the same code base, attaining speedups in the
expected range. However, when the GPUs computes, the CPU stays underutilized and vice
versa. This will be addressed in the next section.
6.3 Load balancing model for wildly heterogeneous systems 103
OpenMP
OpenACC
Core 01 Core 02
Core 03 Core 04
Core 05 Core 06
Core 07 Core 08
Core 09 Core 10
Core 11 Core 12
PCIe
GK210 1 GK210 2
OpenMP
OpenACC OpenACC
Core 01 Core 02
Core 03 Core 04
Core 05 Core 06
Core 07 Core 08
Core 09 Core 10
Core 11 Core 12
GK210 1 GK210 2
Figure 6.2: Socket layout and utilized mapping of CPU and GPU resources to MPI, OpenMP,andOpenACC. Left: Two MPI processes, one steering 10 cores using OpenMP (lightgrey), with the first core steering the first GK210 chip (grey). Right: Whole CPUgets utilized by adding a further OpenACC process (dark grey).
6.3 Load balancing model for wildly heterogeneous sys-
tems
6.3.1 Single-step load balancing for heterogeneous systems
To attain effective load balancing of the computation, runtime models and therefore runtime
data is required. As static decompositions are sought here, the information is required prior
to heterogeneous computation. Hence, runtime tests were conducted for the homogeneous
case of using either one CPU or one GPU. Figure 6.2 depicts the setup of one socket, as
described in [48]. As the two GPU chips per GPU card are addressed separately, one process
per chip is utilized, allocating two of the twelve available cores to harness the GPUs. The
remaining ten cores were fused into one process via OpenMP.
Runtime tests were conducted using either ten consecutive cores or one GK210 chip. For a
polynomial degree of p = 7, the number of elements varied between ne = 100 and ne = 1000,
with the solver being called 50 times and the last 20 resulting runtimes averaged. Block-
structured grids restrict the possible mesh decompositions, leading to a low granularity
available for load balancing. Hence, a two-dimensional problem was considered with only
one element discretizing the third direction to increase the granularity available for load
balancing. The employed grids and their decompositions reside in Table A.2 in the appendix,
alongside the measured runtimes.
104 6 Computing on wildly heterogeneous systems
GPU CPU
0 200 400 600 800 1000
Number of elements ne
0
1
2
3
4
Tim
eper
iterationt Iter[m
s]
0 200 400 600 800 1000
Number of elements ne
0.0
0.5
1.0
1.5
2.0
Speedup
Figure 6.3: Computation using either one GPU or ten cores of a CPU. Left: Times per iteration,tIter, for the homogeneous setup. Right: Speedups over one GK210 chip.
Figure 6.3 depicts the runtimes in combination with the speedups. As the goal lies in adding
the compute power of the CPU to that of the GPU, the speedup is computed based on the
performance of one GK210 chip. For the CPU the iteration time is 0.5ms at ne = 100, and
except for some some small spikes at ne = 500 and ne = 800, a linear increase to 4ms is
taking place. The GPU chip is slower at first, using twice the time per iteration at ne = 100,
but attains equality at ne = 300 and is near twice as fast for large numbers of elements.
For both, CPU and GPU processes, the iteration times tIter behave mostly linear with regard
to ne and can be approximated using a slope and an offset, similar to the one-dimensional
case discussed in [59]. For compute unit m, the runtime is approximated with constants C(m)0
and C(m)1 such that
t(m)Iter = C
(m)0 + C
(m)1 n(m)
e . (6.1)
The approximation holds as long as the compute unit stays in the same state, e.g. loads data
from the same cache level. Furthermore, side effects such as memory contention are not
accounted for.
To gain a decomposition of the mesh, the load balancing model from [30, 59, 140] is ap-
plied. The respective iteration times of the processes are approximated using (6.1), with the
maximum iteration time of both units dominating the iteration time as
tIter = maxm
(t(m)Iter
). (6.2)
Minimization yields the optimal runtime
t∗Iter = maxm
(t(m)Iter
)→ min with ne =
∑m
n(m)e . (6.3)
6.3 Load balancing model for wildly heterogeneous systems 105
GPU CPU Single-step measurement Single-step model
0 200 400 600 800 1000
Number of elements ne
0
1
2
3
4
Tim
eper
iterationt Iter[m
s]
0 200 400 600 800 1000
Number of elements ne
0.0
0.5
1.0
1.5
2.0
Speedup
Figure 6.4: Iteration times and speedups for computations using one GPU and ten cores.Left: Times per iteration, tIter, for the heterogeneous setup. Right: Speedups overone GK210 chip.
The above results in both units having approximately the same time for one iteration.
For load balancing, the runtime data from Figure 6.3 was approximated with (6.1) and
a decomposition in the x1 direction computed via (6.3). As the utilized solver requires
block-structured grids, only certain decompositions are possible and the realized element
distribution deviates slightly from the optimum. For these, heterogeneous computations
were carried out, using the ten cores in conjunction with the GPU chip. The employed
grid decompositions are shown in Table A.3 in the appendix, along with the modeled and
measured runtimes.
The runtimes and speedups are shown in Figure 6.4. The runtime model predicts that hetero-
geneous computation is inefficient at ne = 100 and, hence, no data is presented. At ne = 200,
the load balancing model deems heterogeneous computing efficient, with a predicted factor
of 1.7 over using just one GK210, remaining faster for all numbers of elements and predict-
ing a speedup of over 1.5 after ne = 500. However, a stark difference is present between
model and measurement. Where a speedup of over 1.5 was expected, the heterogeneous
computation hardly outperforms a single GPU chip. As Table A.3 summarizes, the relative
error between model and measurement is forty percent for all tests. While the approach is
well suited for homogeneous system and simple algorithms for heterogeneous systems, e. g.
explicit time stepping or lattice Boltzmann codes [30, 59], it evidently does not work for
the present problem.
6.3.2 Problem analysis
The last section show-cased that a one-step model does not suffice to solve the load balancing
problem for the Helmholtz solver. To glean some insights on why it does not work, a closer
106 6 Computing on wildly heterogeneous systems
n(C)e = 400 n
(G)e = 600
t(G)1
t(C)1
t(G)2
t(C)2
t(G)3
t(C)3
t(G)4
t(C)4
n(C)e = 400 n
(G)e = 600
t(C)1 t
(G)1
t(C)2 t
(G)2
t(C)3 t
(G)3
t(C)4 t
(G)4
n(C)e = 225 n
(G)e = 775
t(C)1 t
(G)1
t(C)2 t
(G)2
t(C)3 t
(G)3
t(C)4 t
(G)4
Figure 6.5: Influence of communication barriers on load balancing for one CPU (C) collaboratingwith one GPU (G) for the present solver with ne = 1000. Left: Load balancing whendisregarding communication barriers using the single-step model. Middle: Same casewith communication barriers retained. Right: Runtimes when accounting for barriers.
look onto Algorithm 6.1 is required. In the iteration process, most operations work on an
element-by-element basis. They work on local data, do not generate side effects, require no
communication and are easy to parallelize. For a load balancing model they only result in
a larger runtime to account for and, therefore, different model parameters. The operations
not adhering to this pattern are the scalar products and the gather-scatter operation. The
scalar products in lines 10, 13, and 16 implement reduction operations over all data points
and, hence, processes. As they compute, e.g., step widths that are required for further
steps in the algorithm, their evaluation constitutes a synchronization barrier. Furthermore
the gather-scatter operation in line 9 adds the residual on adjoining elements and, hence,
requires boundary data exchange with neighboring subdomains. For computing the scalar
product in line 10, the result is required, necessitating a fourth communication barrier.
When modelling the communications steps as barriers, Algorithm 6.1 decomposes into four
distinct substeps: The first starts after the scalar product in line 16 and ends before the
boundary data exchange in line 9. It incorporates the update of the search vector p, ap-
plication of the element-wise Helmholtz operator, the local part of the gather-scatter
operation, and collection and communication of boundary data. After the boundary data
exchange, the second substep begins at line 9 and ends after the local part of the scalar
product in line 10. It inserts the received boundary data into the subdomain and evaluates
the local part of the scalar product. The third substep, line 10 to line 13, consists of the
update of solution and residual vector. The last substep incorporates application of the
preconditioner and another scalar product. It extends from line 13 to 16.
6.3 Load balancing model for wildly heterogeneous systems 107
To investigate the effect of these barriers, bfCG was augmented with runtime measurements
for the four substeps. One time stamp is taken before and one after each communication
and the accumulated runtime averaged over the number of iterations. The runtimes for the
homogeneous case were repeated, measuring the runtime of the substeps and approximating
these using (6.1). For ne = 1000, Figure 6.5 demonstrates the effect of the communication
barriers. The single-step load balancing does not take the barriers into account and tries
to optimize for the left case. In practice, the communication barriers lead to the middle
case, where a severe load imbalance occurs in each substep. This increases the runtime and
leads to a deviation from the prediction. As shown on the right, decreasing the number of
elements for the CPU can mitigate parts of the imbalances and decrease the overall runtime.
Obviously, load balancing with regard to solver communication barriers is required to achieve
optimal results.
6.3.3 Multi-step load balancing for heterogeneous systems
Using the insights from the last section, the load balancing model from Subsection 6.3.1 is
readily extended: For each substep of the solution process, a linear fit approximates the
runtime, i. e. for compute unit m and substep i
t(m)i = C
(m)0,i + C
(m)1,i n(m)
e , (6.4)
where C(m)0,i is a non-negative constant, C
(m)1,i a positive slope, and n
(m)e the number of ele-
ments assigned to the compute unit. Similarly to the single-step model, the approximation
requires the compute unit to stay in the same state, e. g. using the same cache hierarchy,
and side effects such as mmemory saturation are excluded. The end of each substep invokes
synchronization, e.g. collectively calculating a scalar product or exchanging boundary data.
With communication in CFD typically being latency-bound [49], differences in communica-
tion time are neglected here.
As before, the slowest process dominates the runtime of a substep ti:
ti = maxm
(t(m)i
)= max
m
(C
(m)0,i + C
(m)1,i n(m)
e
), (6.5)
and with multiple steps to consider, the optimum runtime now computes to
t∗Iter =∑i
maxm
(t(m)i
)→ min with ne =
∑m
n(m)e . (6.6)
108 6 Computing on wildly heterogeneous systems
Measurement Single-step model Multi-step model
0 200 400 600 800 1000
Number of elements ne
1
2
3
4
Tim
eper
iterationt Iter[m
s]
200 300 400 500 600
Number of elements ne
1.50
1.75
2.00
2.25
2.50
2.75
3.00
Tim
eper
iterationt Iter[m
s]
Figure 6.6: Runtimes predicted by the single-step and multi-step load balancing model
for ne = 1000 compared to measurements. Here, n(C)e denotes the number of ele-
ments assigned to the CPU. Left: Modeled runtimes for every possible distribution.Right: Closeup of the transition.
Where the single-step model from Subsection 6.3.1 utilized one linear system, here each
substep replicates one. The resulting coupled system is, however, non-linear and not easily
solved. Introducing auxiliary variables di for each substep with
∀m : di ≥ t(m)i ≥ 0 , (6.7)
casts it into a linear one:
t∗Iter =∑i
di → min with ne =∑m
n(m)e , (6.8)
and allows for a shorter time to solution.
To investigate whether the derived model more accurately replicates reality, computations
for every decomposition of the ne = 1000 mesh were conducted. Figure 6.6 depicts the result
in combination with predictions from single-step and multi-step model. With the single-
step model, the runtime decreases when increasing the number of elements on the CPU,
n(C)e , as the CPU takes work from the GPU. After reaching equilibrium between CPU and
GPU at n(C)e = 400, the iteration time increases, with the CPU now being slower. The
measurements, however, exhibit a different behaviour: While in agreement with the single-
step model at first, the runtime starts increasing at n(C)e = 225. A large discrepancy between
single-step model and measurement result, especially at the equilibrium point predicted by
the single-step model. With the extended model, the equilibria of the substeps divide the
plot into five distinct zones, each exhibiting a different slope. Overall, the modeled runtime
is higher, stemming from more restrictions in the optimization, but so is the accuracy:
6.3 Load balancing model for wildly heterogeneous systems 109
GPU
CPU
Single-step measurement
Single-step model
Multi-step measurement
Multi-step model
0 200 400 600 800 1000
Number of elements ne
0
1
2
3
4
Tim
eper
iterationt Iter[m
s]
0 200 400 600 800 1000
Number of elements ne
0.0
0.5
1.0
1.5
2.0
Speedup
Figure 6.7: Runtimes for one iteration and speedups over one GPU with the new load balancingmodel. Left: iteration times, right: speedups.
near perfect agreement is present with the measurement until n(C)e = 350. The multi-step
model yields a more accurate reproduction of the measurements than the single-step version.
Furthermore, the minimum iteration time at n(C)e = 225 constitutes the most relevant feature
of the runtime and the multi-step model is capable of capturing it, while the single-step model
is not.
6.3.4 Performance with new load balancing model
The runtime tests from Subsection 6.3.1 were repeated including heterogeneous computation
using the multi-step load balancing model. Figure 6.7 depicts the resulting runtimes and
speedups. As the multi-step model introduces more restrictions, the predicted speedup is
lower compared to single-step model: Where previously a speedup of up to 1.5 was com-
puted, only 1.3 is estimated now. However, the predictions match the experiments. The
model is four times as accurate and the error is less than ten percent. Furthermore, the
attained speedup is higher for all utilized number of elements. Hence, the multi-step model
is preferable for this configuration.
To ensure that the behaviour of the two load balancing schemes is not reserved to the specific
test case and data size, the test was repeated for p = 11 and p = 15 in addition to the already
computed case of p = 7. Figure 6.8 depicts the resulting speedups over using one GPU and
the relative errors of the load balancing models versus their respective meausured runtimes.
In all three cases, the GPU does not compute significantly faster than the CPU for low
number of elements, but attains speedups near two for large numbers of elements. For p = 11,
a spike is present at ne = 300, stemming from a relatively high runtime for the CPU. For
all polynomial degrees, the single-step model predicts that the whole computational power
110 6 Computing on wildly heterogeneous systems
GPU
CPU
Single-step measurement
Single-step model
Multi-step measurement
Multi-step model
0 200 400 600 800 1000
Number of elements ne
0.0
0.5
1.0
1.5
2.0
Speedup
0 200 400 600 800 1000
Number of elements ne
0
10
20
30
40
50
Relative
error[%
]
0 200 400 600 800 1000
Number of elements ne
0.0
0.5
1.0
1.5
Speedup
0 200 400 600 800 1000
Number of elements ne
0
10
20
30
40
50
Relativeerror[%
]
0 200 400 600 800 1000
Number of elements ne
0.0
0.5
1.0
1.5
Speedup
0 200 400 600 800 1000
Number of elements ne
0
10
20
30
40
50
Relativeerror[%
]
Figure 6.8: Performance gains with heterogeneous computing. Left: speedups over using only 10cores, right: relative errors of the two considered load balancing models. Top: p = 7,middle: p = 11, bottom: p = 15.
6.3 Load balancing model for wildly heterogeneous systems 111
GPU Multi-step measurement Multi-step model
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
Speedup
2 GPUs
2 CPUs + 2 GPUs
4 GPUs
2 CPUs + 4 GPUs
Figure 6.9: Speedups over two GK210 chips using additionally 10 cores of each of the 2 CPUs,using four GPUs, and using all compute resources of a node.
of the CPU can be added to the one of the GPU, netting a speedup of 1.5 for all polynomial
degrees. However, this is not achieved in practice. For low numbers of elements the error
is near 50 %. It decreases with the number of elements and saturates at 40 % for p = 7,
20 % for p = 11, and 10 % for p = 15. This decrease stems from the scaling of the operator
costs: Where the element Helmholtz operator and preconditioner scale with O(p4), theremaining operations scale with O(p3), increasing the relevance of substeps 1 and 4 and
diminishing the impact of substeps 2 and 3. In effect, only two substeps remain relevant
for load balancing, decreasing the difference between the models. Moreover, Helmholtz
and fast diagonalization operator exhibit a similar runtime behaviour, further decreasing the
error. The multi-step model predicts more modest performance gains. For p = 7, the speedup
of the GPU is increased to 1.3, with the model closely fitting experiments after ne = 400 and
achieving an error lower than 5 percent. For lower number of elements the elements allocated
to the CPU are cached in L3 and different constants are required for accurate predictions
in this regime. In all three cases, the multi-step model outperforms the single-step model,
both in terms of prediction accuracy and attained speedup, adding 30 %, 50 % and 80 % of
the CPU performance to the one of the GPU for p = 7, 11, 15, respectively. Furthermore, no
performance degradation is encountered when using heterogeneous computing.
Until now, only parts of the node were utilized, only 11 of 24 cores and only one of four
GPU chips. To test the model in a real-life scenario, a simulations using 40 × 40× 1 = 1600
elements of polynomial degree p = 11, were performed on one node, leading to approxi-
mately 2.1M degrees of freedom. Four setups were investigated: The reference setup uses
two of the four GK210 chips, i. e. one per socket. The second one adds ten cores per socket
to the two chips. The third setup utilizes all four GK210 chips present, and the last one
leverages the whole node to solve the problem (see Figure 6.2). The time per iteration was
measured, averaged over 20 solver runs, and the resulting speedup over two GK210 chips
computed.
112 6 Computing on wildly heterogeneous systems
Figure 6.9 depicts the speedups over two GPUs. The setup employing all four GK210 chips
achieves a speedup of 1.8, which can be explained by the large offsets present in the runtimes
of the substeps on the GPU. Adding the compute power of twenty cores to the two GPUs,
the computation is 25% faster, reasonably matching the prediction. Similarly, the same
absolute speedup gain of 0.25 is achieved when using the 20 cores in conjunction with the
four GPUs.
6.4 Conclusions
In this chapter, the orchestration of heterogeneous systems was investigated on the main
ingredient of a flow solver, the Helmholtz solver. Following the traditional parallelization
concepts in CFD, a two-level parallelization was introduced. MPI realized domain decom-
position, while pragma-based language extensions exploited the data parallelism inside the
partitions. The model allowed to utilize one single source for generation of a code running
on a heterogeneous system, lowering the maintenance effort.
Thereafter, load balancing for the resulting heterogeneous system consisting of one CPU
and one GPU was investigated. Many references employ a single step model to compute the
resulting runtime. For the considered pCG solver, the model led to a relative error of 40 % in
the runtime. The synchronization barriers resulting from scalar products and boundary data
exchange introduced further constrains that the simplified model did not account for and,
in turn, led to large load imbalances. An improved model accounting for these barriers was
derived. With the new model, runtime and prediction matched closely, with the error lying
below 5 % for most cases. While the barriers restricted the model and, therefor, lowered the
attainable speedup, the heterogeneous computation was 30 % faster than solely computing
on the GPU for every tested polynomial degree.
Afterwards, computations harnessing the whole computational node were conducted. For
these, the runtime prediction matched the measurement well and the heterogeneous system
generated 25 % more performance than only using the GPUs would have provided. While
the attained speedups seem small, one has to keep in mind that while current hardware only
exhibits mild heterogeneity, future HPCs will be more heterogeneous [26].
Here, only the runtime of a pCG-based Helmholtz solver with local preconditioning was
considered. With multigrid, a multi-level approach to load balancing is required. Hence,
further work will consist of expanding the new load balancing model to multigrid, possibly
including redistribution between the levels.
113
Chapter 7
Specht FS – A flow solver computing
on heterogeneous hardware
7.1 Introduction
In the previous chapters, large increases in the performance of operators and consecutively
elliptic solvers were demonstrated. These consisted of algorithms lowering the operations
count, optimizations of the operators themselves, and exploitation of heterogeneous hard-
ware. However, it is yet unclear whether these improvements carry over to a full flow solver.
Hence, this chapter will combine these techniques into a flow solver, validate it, and showcase
the attainable performance for flow simulations. The structure of this chapter follows this
order, with Section 7.2 introducing the notation and presenting the flow solver, Section 7.3
validating the implementation, and Section 7.4 investigating the performance of the solver.
A preliminary validation was performed in [109] and excerpts of the performance tests are
available in [63].
7.2 Flow solver
7.2.1 Incompressible fluid flow
Incompressible fluid flow is governed by the Navier-Stokes equations, see e.g. [81, 115]. In
non-dimensional form these can be written as
∂tu+∇T(uuT
)= −∇P +
1
Re∆u+ f (7.1a)
∇ · u = 0 (7.1b)
114 7 Specht FS – A flow solver computing on heterogeneous hardware
with u being the velocity vector, P the pressure, Re the Reynolds number, t the time vari-
able and f the body force. Equation (7.1a) constitutes the momentum balance of the fluid,
whereas the continuity equation (7.1b) restricts the velocity field to be free of divergence.
Multiple problems arise when discretizing the Navier-Stokes equations: First, the con-
vection terms are non-linear and typically lead to non-symmetrical equation systems, raising
the costs of implicit treatment. Second, the pressure possesses no evolution equation, baring
the usage of traditional discretization methods. And third, the continuity equation (7.1b) is
just a constraint on the velocity field, rather than a separate equation to solve.
7.2.2 Spectral Element Cartesian HeTerogeneous Flow Solver
This section presents the flow solver Spectral Element Cartesian HeTerogeneous Flow Solver,
in short Specht FS. The solver is the culmination of the previous chapters: It incorporates
the algorithmic advances for the static condensation Chapter 4, the linearly scaling multi-
grid solver Chapter 5, as well as the hardware-optimized operators from Chapter 3 and
is parallelized with the concept from Chapter 6. The goal of this code is not to generate
another general-purpose code, but rather to evaluate the performance attainable with high-
order methods. Therefore, the code serves as testbed for parallelization techniques, operator
optimization and algorithmic factorization techniques.
Specht FS solves the incompressible Navier-Stokes equations, using a spectral-element
method in space while using projection schemes in time. Multiple time stepping schemes
are implemented. These include velocity correction [75], consistent splitting schemes and
rotational incremental pressure-correction schemes [46]. While the former two are only im-
plemented using the same polynomial degree in pressure and velocity, the implementation
of the latter allows to use a lower polynomial order in the pressure. Furthermore, additional
variables, such as passive and scalars, can be added.
The spatial discretization of Specht FS is in accordance with the previous chapters: A
spectral-element discretization implements the weak forms resulting from the time-stepping
schemes. The discretization uses structured Cartesian grids composed of tensor-product ele-
ments using GLL points. On these, solution of Helmholtz equations is facilitated by every
of the solvers from Chapter 2, Chapter 4 and Chapter 5. The associated operators, rang-
ing from Helmholtz operator to preconditioner, were implemented using the techniques
described in Chapter 3, leading to hardware-optimized operators which extract at least a
quarter of the available performance. Furthermore, four different versions of the transport
operator are implemented: The convective, quasi-linear form, the conservative form, the
skew-symmetric form, and overintegration form [23]. Lastly, Specht FS allows stabiliza-
tion of the time-stepping scheme by either polynomial filtering [33] or the spectral vanishing
viscosity (SVV) model, which can be included in the diffusion solvers without any overhead.
7.2 Flow solver 115
The solver is written in modern Fortran using the parallelization concept from Chapter 6:
MPI implements domain decomposition with OpenMP and OpenACC providing a fine-
grain parallelization inside the partitions to exploit data parallelism. Therefore, conditional
compilation can be utilized to generate a single-core runtime computing solely with MPI,
one exploiting shared-memory via OpenMP, GPUs via OpenACC, or both to create a
heterogeneous runtime environment. For instance, the runtime measurements in Chapter 6
were conducted in a flow testcase in Specht FS.
7.2.3 Pressure correction scheme in Specht FS
Algorithm 7.1 summarizes the rotational incremental pressure-correction scheme [46]. The
scheme utilizes a backward-differencing formulas of order k (BDFk) as basis, treating con-
vection terms and the body forces explicitly in time, whereas the diffusion terms are treated
implicitly. In each time step, the pressure, convection terms, and body forces get extrapo-
lated from the previous time steps, with the body force being extrapolated using order k and
the pressure being extrapolated with order k−1 for stability reasons. Therefore, the scheme
requires the information from k previous time levels and is only self-starting for BDF1. For
BDF2, one prior step of BDF1 is required.
The first operation of the time step consists of extrapolating the pressure and velocities to
the end of the time step. Then, the pressure derivative, convection terms, and body forces
are treated explicitly by serving as right-hand side for the implicit treatment of the diffu-
sion terms. During this treatment, the boundary conditions for the velocities are enforced.
The resulting velocity u ⋆ incorporates the correct boundary conditions but is not free of di-
vergence. Therefore, a potential ∆ϕ ⋆ is computed with homogeneous Neumann boundary
conditions and the velocity is corrected to be divergence-free. Lastly, the pressure is updated
using the potential.
With the BDF1 scheme treating diffusion implicitly, only the time step restriction for the
convective term remains. Therefore, the time step is limited by the CFL criterion [74]
∆t =CCFL
p2
mine|he|
maxΩ|u|
, (7.2)
where he denotes the extent of the element Ωe, and CCFL the CFL number. For the BDF2
scheme and higher orders of BDFk, an in-depth stability analysis can be found in [66].
116 7 Specht FS – A flow solver computing on heterogeneous hardware
Algorithm 7.1: Pressure correction method of kth-order in rotational form. For sake of readabil-ity, the convection operator was short-handed to N (u) = ∇T (uuT ).
for n = 1, nt doP ⋆ ←
∑k−1i=1 β
k−1i P n−i ▷ pressure extrapolation of order k − 1
F ← − 1∆tP ⋆ − 1
∆t
∑ki=1 α
ki u
n−i
+∑k
i=1 βki
(f n−i − N (un−i)
)Solve:
αk0Re
∆tu ⋆ −∆u ⋆ = Re · F ▷ Helmholtz equations for velocities
Solve: −∆ϕ ⋆ = − 1∆t∇ · u ⋆ ▷ Laplace equation for correction
un ← u ⋆ −∆t∇ϕ ⋆ ▷ correct velocitiesP n ← P ⋆ + ϕ ⋆ − 1
Re∇ · u ⋆ ▷ update pressure
end for
Table 7.1: Utilized coefficients αki for backward-differencing formulas of order k and weights βk
i
for extrapolation of order k.
k αk0 αk
1 αk2 βk
1 βk2
1 1 −1 1
2 32−4
212
2 −1
7.3 Validation
7.3.1 Test regime
To validate the flow solver Specht FS, three tests are conducted. The first one investigates
the behavior for periodic domains using the Taylor-Green vortex. Due to the periodicity,
the implementation of convection terms and coupling between velocity and pressure can be
evaluated separately from the implementation of boundary conditions and the corresponding
splitting error. The second one adds boundary conditions and tests these. After validating
for small analytical test cases, the turbulent plane channel flow at Reτ = 180 serves to
validate simulation of turbulent flows.
Preliminary versions of the tests mentioned above were carried out in [109] for the velocity
correction scheme from [75]. Here, only the incremental pressure-correction scheme is tested,
with the same polynomial order for pressure and velocity. Furthermore, exact integration of
the convection terms is performed.
7.3 Validation 117
−1.0 −0.5 0.0 0.5 1.0x1
−1.0
−0.5
0.0
0.5
1.0
x2
−1.0 −0.5 0.0 0.5 1.0x1
−1.0
−0.5
0.0
0.5
1.0
x2
−0.60
−0.45
−0.30
−0.15
0.00
0.15
0.30
0.45
0.60
Figure 7.1: Taylor-Green vortex for a vanishing transport velocity and a spatial frequencyof ω = π at t = 0. Left: streamlines, right: contour plot of the pressure.
7.3.2 Taylor-Green vortex in a periodic domain
In a periodic domain Ω = (−1, 1)3 with a body force of f(x, t) = 0,
uex,1(x, t) = u1,0 + F (t) cos(ω (x1 − t · u1,0)) · sin(ω (x2 − t · u2,0)) (7.3a)
uex,2(x, t) = u2,0 − F (t) sin(ω (x1 − t · u1,0)) · cos(ω (x2 − t · u2,0)) (7.3b)
uex,3(x, t) = u3,0 (7.3c)
Pex(x, t) = −(F (t))2
4(cos(2ω (x1 − t · u1,0)) + cos(2ω (x2 − t · u2,0))) (7.3d)
with
F (t) = exp
(−ω2 2t
Re
), (7.3e)
is a solution for the Navier-Stokes equations of the class called Taylor-Green vor-
tices [43]. In the above ω is the spatial frequency of the problem, with ω ∈ π · z, z ∈ Z,and (u1,0, u2,0, u3,0)
T is the transport velocity of the vortices. Figure 7.1 depicts the stream-
lines of the flow and the pressure field for ω = π, t = 0 and a transport velocity of u0 = 0.
The test case is suited to investigate the correctness of the implementation of the time-
stepping scheme regarding transport and diffusion terms, as well as the pressure coupling,
whereas the implementation of the body force and boundary conditions can not be tested.
In the following tests the solution parameters are set to ω = π and (u1,0, u2,0, u3,0)T = (2, 3, 4).
TheReynolds number is Re = 10 and lies in the stable regime of the flow as Re/ω < 50 [43].
118 7 Specht FS – A flow solver computing on heterogeneous hardware
u1 P BDF1 BDF2
10−7 10−6 10−5 10−4 10−3
∆t
10−10
10−8
10−6
10−4
10−2
∥ε∥ L
2
1
1
1
2
10−7 10−6 10−5 10−4 10−3
∆t
10−8
10−6
10−4
10−2
100
|ε| H
1
1
1
1
2
Figure 7.2: Error of the two time-stepping schemes BDF1 and BDF2 for the Taylor-Greenvortex in a periodic domain when using ne = 8×8×2 spectral elements of degree p = 8.Left: L2 error for u1 and the pressure P , right: errors in the H1 semi norm.
Two computations are performed, one using the scheme of first order, one with the second
order scheme. For these, the initial condition is set at t = 0 and the numerical solution
at t = 0.01 computed using ne = 8× 8× 2 spectral elements of polynomial degree p = 8. The
initial CFL number was scaled from 0.5 · 100 to 0.5 · 10−3, leading to smaller and smaller time
step widths and allowing to investigate the convergence rates of the time stepping scheme.
After attaining a solution on the mesh, the L2 and H1 errors for the velocities
∥ε∥2L2=
∫x∈Ω
(u− uex)2(x) dx (7.4)
|ε|2H1=
∫x∈Ω
(∇(u− uex) · ∇(u− uex))(x) dx (7.5)
and similarly for the pressure are computed on the same grid after interpolating to p = 31,
which is deemed sufficient for resolving the exact solution.
Figure 7.2 depicts the the L2 and H1 error of the first velocity component as well as the
the pressure. For the velocity as well as the pressure, the BDF1 method exhibits a slope of
one, lowering the error by three orders. The BDF2 scheme attains second order and starts
with a lower error and is, furthermore, able to lower it to 10−9, where rounding errors stall
further convergence. Thus, both time-stepping schemes are implemented correctly regarding
pressure coupling, transport terms, and diffusion. While only the results for a homogeneous
x3 direction are shown, computations with permutations of the coordinate directions were
performed with similar results.
7.3 Validation 119
u1 P BDF1 BDF2
10−7 10−6 10−5 10−4 10−3
∆t
10−10
10−8
10−6
10−4
10−2
100∥ε∥ L
2
11
1
2
10−7 10−6 10−5 10−4 10−3
∆t
10−8
10−6
10−4
10−2
100
|ε| H
1
1
1
1
2
Figure 7.3: Error of the two time-stepping schemes BDF1 and BDF2 for the Taylor-Greenvortex with Dirichlet boundary conditions when using ne = 8 × 8 × 2 spectralelements of degree p = 8. Left: L2 error for u1 and the pressure P , right: errors inthe H1 semi norm.
7.3.3 Taylor-Green vortex with Dirichlet boundary conditions
To validate the boundary condition implementation, the tests from the last section were
repeated, but imposing no-slip walls in the first and second direction, whereas the third
direction stayed periodic. Figure 7.3 depicts the L2 and H1 errors resulting from the com-
putations. As in the periodic case, the velocity converges with the expected order for BDF1
and BDF2 and attain similar error margins. The pressure, however, does not converge: The
error stagnates at 10−5 in the L2 norm, with the error being localized in edge modes. These
can be eliminated by further post processing of the solution after the projection step [108].
However, as the convergence rate of the velocities is not impeded by these modes, the issue
is disregarded here.
7.3.4 Turbulent plane channel flow
Until now analytical test cases were utilized to validate the time-stepping scheme and spa-
tial discretization thereof. These, however, do not fully capture the behavior in large-
scale flow simulations where both, low errors and low computing time are key. To vali-
date that Specht FS is capable of resolving turbulent flows, both temporally as well as
spatially, this section considers the turbulent channel flow at Reτ = 180 [96]. In a do-
main Ω = (0, 4π)× (0, 2)× (0, 2π), with periodicity in x1 and x3 direction. The fluid flows
in positive x1 direction and is bounded by walls at x2 = 0 and x2 = 2, with the resulting
boundary layer necessitating a body force to drive the flow. This body force is implemented
with a PI controller fixing the mean velocity to 1. In combination with a bulk Reynolds
120 7 Specht FS – A flow solver computing on heterogeneous hardware
number of 5600, and therefore Re = 2800, a turbulent channel flow develops with Reτ = 180.
Here, the linear profile from [99] serves as initial condition:
u1(x2) = 2 (1− |1− x2|) (7.6a)
u2(x2) = 0 (7.6b)
u3(x2) = 0 . (7.6c)
This distribution is perturbed using element-wise pseudo-random numbers with a maximum
value of 0.2, with one pseudo-random being used per direction, allowing for a maximum
variation by 0.4 between the elements.
A grid consisting of 32× 12× 12 spectral elements of degree p = 16 discretized the channel,
leading to the first mesh point lying at y+1 ≈ 0.4 and the first seven below y+ < 10, which is a
bit coarser than advocated [74], but still lies in the regime of direct numerical simulation [36].
The resulting grid contains approximately 19 million grid points, and therefore 75 million
degrees of freedom.
Here, the time-stepping scheme of second order is utilized, with the CFL number chosen
as CCFL = 0.125 and consistent integration of the convection terms ensured via overintegra-
tion. The Helmholtz equations were solved to a residual of 10−10 in every time step, with
the multigrid solver ktvMG solving the pressure equation. The diffusion equations occurring
in the time stepping scheme are, however, well conditioned and in addition have a good
initial guess. As a result dtCG is faster in solving these equations than the multigrid solver
and therefore used here. Furthermore, the spectral vanishing viscosity model (SVV) was
employed to stabilize the simulation. Here, the power kernel by Moura [97] is utilized with
parameters set to the proposed parameters pSVV = p/2 and εSVV = 0.01.
Figure 7.4 depicts the temporal evolution of kinetic energy and dissipation rate. The pertur-
bation of the initial condition leads to a higher mean energy, represented in high frequency
modes which dissipate until t = 20. Afterwards, turbulence slowly develops, leading to an
increase in the kinetic energy, which stagnates after t = 70. The dissipation rate exhibits
two peaks. At t = 0, the element-wise perturbation leads to a high gradient at the element
boundaries, which quickly gets smoothed out. Thereafter, a drop occurs until the triangu-
lar initial distribution starts transitioning into the velocity profile of a turbulent channel
flow, leading to the peak near t = 10. At t = 70, a mostly constant dissipation rate results.
However, small changes remain, for instance near t = 100 and t = 150. These indicate
that while the domain is large, it is not sufficient to contain a statistically homogeneous
flow at Reτ = 180. Furthermore, the mean dissipation rate does not completely match the
target value of 4.14 · 10−3 but lies below it, resulting in Reτ = 177 instead of the target value
of Reτ = 180. Figure 7.5 depicts the vortex structures occurring in the flow as well as the
instantaneous velocity field at t = 200. Vortices are being generated in the boundary layer.
These lead to variations in the velocity field seen in the background, the boundary layer
7.3 Validation 121
0 50 100 150 200
t
0.50
0.55
0.60
0.65
0.70Meankinetic
energyE
k
0 50 100 150 200
t
0
5
10
15
20
Meandissipationrate
ε·103
Figure 7.4: Temporal evaluation of kinetic energy and dissipation rate in the turbulent channelflow in combination with the time intervals utilized for averaging. Left: mean ki-netic energy over time, with the large rectangle marking the interval with temporalaveraging for the velocities and the pressure and the inner rectangle marking the in-terval with averaging for the Reynolds stresses. Right: mean dissipation rate withthe line denoting the mean energy input required for attaining Reτ = 180 in thissetup ε = 4.14 · 10−3.
Figure 7.5: Vortex structures in the turbulent plane channel at t=200. The fluid flows fromthe front left to the back right. In the foreground instantaneous isosurfaces of the λ2
vortex criterion at λ2 = −1.5 are shown, whereas the background planes of the channelare coloured with the magnitude of the velocity vector. Red corresponds to themaximum velocity and blue to zero.
breaking up, and low-velocity regions reaching towards the middle of the channel. The flow
is turbulent, which was the goal of the simulation.
After reaching a statistically steady state at t = 100, averages for the velocities and the
pressure were computed in the time interval t ∈ [100, 200]. For averaging of the Reynolds
stresses, high quality averages of the velocities are required to accurately compute the fluc-
tuations. Therefore, averaging of the Reynolds stresses began later, at t = 150. The
averaging intervals are shown in Figure 7.4.
Figure 7.6 depicts the temporal averages, which were further reduced by averaging in the
statistically homogeneous x1 and x3 direction, and compares these to the reference data
from [96]. Both u1 and the pressure show a good agreement with the reference, with the
velocity being slightly higher in the lower regions of the channel. The Reynolds stresses
122 7 Specht FS – A flow solver computing on heterogeneous hardware
⟨u1⟩⟨P ⟩
⟨u′1u′1⟩⟨u′1u′2⟩
⟨u′2u′2⟩⟨u′3u′3⟩
KMM
0 5 10 15
⟨u1⟩
0.0
0.2
0.4
0.6
0.8
1.0
x2
0 2 4 6
⟨u′iu′j⟩
0.0
0.2
0.4
0.6
0.8
1.0
x2
Figure 7.6: Averages for the plane channel flow at Reτ = 180 computed with Specht FS, de-noted by the markers, compared to the data from Kim, Moin, and Moser (KMM,denoted by the lines) [96]. Here ⟨·⟩ denotes temporal averaging and ·′ the correspond-ing fluctuation u′i = ui − ⟨ui⟩. The velocity average was normalized with uτ and thepressure and Reynolds stresses with u2τ . Left: Mean values for the x1-velocity andthe pressure. As the spatial mean value of the pressure is arbitrary, the pressurefrom Specht FS at x2 = 0 was corrected to the value from [96] Right: Reynoldsstresses.
show deviations from the reference: While ⟨u′1u
′2⟩ and ⟨u′
2u′2⟩ fit the data perfectly, ⟨u′
1u′1⟩
exhibits a deviation in the middle of the channel and ⟨u′3u
′3⟩ is slightly underpredicted.
However, the overall agreement is good, validating Specht FS for simulation of turbulent
flows.
7.4 Performance evaluation
7.4.1 Turbulent plane channel flow
So far only the solution error from Specht FS was inspected. While low errors are key to
capture the main features of the flow, in practice time to solution is just as important. This
section inspects the efficiency of the components of Specht FS using the turbulent plane
channel flow at Reτ = 180 from Subsection 7.3.4 as testcase. The measuring platform was
one node of the high-performance computer Taurus, consisting of two Intel Xeon E5-2680
CPUs with twelve cores each, running at a fixed frequency of 2.5GHz. To enable runtime
measurements inside the code, Specht FS was instrumented with Score-P [78], allowing
for gathering of in-depth data for the runtime contribution of the different operators in the
time-stepping scheme. After instrumentation, the Intel Fortran compiler v. 2018 compiled
the source with OpenMP enabled.
7.4 Performance evaluation 123
Table 7.2: Speedup for the plane channel flow test case. Setup data and total runtime obtainedwith the flow solver Specht FS when using the new multigrid solver for computing 0.1dimensionless units in time. The number of unknowns is computed as nDOF = 4p3ne.
Number of threads 1 12
Number of time steps ntime 429 429
Number of degrees of freedom nDOF 18 874 368 18 874 368
Number of processes 2 2
Number of cores ncores 2 24
Runtime tWall [s] 3336 491
(nDOF · ntime)/(tWall · ncores) [1/s] 1 213 600 687 126
Table 7.3: Accumulated runtimes for the time stepping procedure of Specht FS when using thenew multigrid solver for computing a time interval of 0.1 for the channel flow over thenumber of threads per process. Two MPI processes were used and only componentsdirectly in the time-stepping procedure were profiled with Score-P.
1 thread 12 threads
Component [s] [%] [s] [%]
Convection terms 2138 34.6 2294 22.0
Diffusion solver 1454 23.6 3160 30.3
Poisson solver 2407 39.0 4551 43.6
Other 175 2.8 395 4.1
Total 6174 100.0 10 400 100.0
Runtime data was gathered using a run on a smaller domain of Ω = (0, 2π)× (0, 2)× (0, π)
which was discretized using 16× 12× 6 elements and therefore comprises a quarter of the
grid points and degrees of freedom of the grid from Subsection 7.3.4. After attaining a
statistically steady state, 0.1 dimensionless time units were computed, requiring nt = 429
time steps. Two processes decomposed the grid in the x1-direction and for these, two runs
were performed. The first utilized only one core per socket, such that the maximum amount
of memory performance per core can be utilized, whereas the second one used all available
twelve cores on the socket to harness the full available compute power.
Table 7.2 summarizes the wall clock time tWall as well as the runtime per degree of freedom,
whereas Table 7.3 lists the contributions of the different components of the flow solver.
When computing with one thread per process, the runtime decomposes into three main
contributions: The diffusion step requires one quarter of the runtime, the calculation of
the convection terms 35 %, and the pressure treatment the remaining 40 % of the runtime.
For this testcase, the main goal of the thesis has been accomplished: Treating the pressure
becomes as cheap as an explicit treatment for the convection terms. When using one thread
per core, Specht FS attains a throughput of over 1 200 000 time steps times degrees of
124 7 Specht FS – A flow solver computing on heterogeneous hardware
Figure 7.7: Isosurfaces of the λ2 vortex criterion for the Taylor-Green vortex at Re = 1600with λ2 = −1.5. The isosurfaces are colored with the magnitude of the velocityvector, where blue corresponds zero and red to a magnitude of U0. The data wastaken from a simulation using ne = 163 spectral elements of polynomial degree p = 16.Left: t = 5T0, middle: t = 7T0, right: t = 9T0.
freedom per core and second. However, the number halves when increasing the core count
to twelve. While the convection terms require few memory accesses and parallelize well,
the Helmholtz solvers do not, as shown in Subsection 5.4.5. The low parallel efficiency
results in an increase of the CPU time spent in the pressure solver. In consequence, the
percentage of the runtime for the pressure treatment increases to 44 %, twice the value for
the convection terms, and the throughput per core drops to 680 000 degrees of freedom per
core and second.
7.4.2 Turbulent Taylor-Green vortex benchmark
The previous benchmark investigated the runtime contributions of the different components
of the flow solver. Here, the overall performance of the flow solver is compared to other flow
solvers using the under-resolved turbulent Taylor-Green vortex benchmark [38, 29]. For
a length scale L, a reference velocity U0, and a periodic domain Ω = (−Lπ, Lπ)3 the initial
velocity components are
u1(x) = +U0 sin(x1
L
)cos(x2
L
)cos(x3
L
)(7.7a)
u2(x) = −U0 cos(x1
L
)sin(x2
L
)cos(x3
L
)(7.7b)
u3(x) = 0 . (7.7c)
Here, L = 1 and U0 = 1. Furthermore, the Reynolds number is set to Re = 1600, resulting
in an unstable flow that quickly transitions to turbulence, as shown in Figure 7.7.
For solution, four meshes are considered. The first one discretizes the domain with ne = 163
elements of polynomial degree p = 8, with the second one refining to ne = 323. The third
and fourth grid use p = 16 with ne = 83 and 163 elements, respectively, leading to the same
7.4 Performance evaluation 125
ne = 83, p = 16
ne = 163, p = 8
ne = 163, p = 16
ne = 323, p = 8
Reference ne = 1283, p = 7
0 5 10 15 20
t/T0
0
5
10
15
-∂tE
k·103
0 5 10 15 20
t/T0
0
5
10
15
ε·1
03
0 5 10 15 20
t/T0
0
5
10
15
ε num·103
Figure 7.8: Results for the turbulent Taylor-Green vortex. Left: Time derivative of the meankinetic energy, middle: mean dissipation rate captured by the grid, right: numericaldissipation. Reference data courtesy of M.Kronbichler [29].
number of degrees of freedom utilized for p = 8. For all four meshes, simulations were car-
ried out until a simulation time T = 20L/U0 = 20T0 was reached. The time-stepping scheme
of second order was utilized with consistent integration of the convection terms and the
time step width imposed by the CFL condition using CCFL = 0.125. As the grids for this
benchmark are deliberately chosen to be very coarse, not every feature of the flow can be
captured by them. Hence, a subgrid-scale (SGS) model is required. Where for the discon-
tinuous Galerkin methods utilized in [38, 29], the flux formulation infers an implicit SGS
model which generates the required dissipation and stabilization. The continuous Galerkin
formulation in Specht FS, however, does not. As a remedy, the SVV model was employed,
using the Moura kernel with the parameters set to pSVV = p/2 and εSVV = 0.01.
Figure 7.8 compares the derivative of the mean kinetic energy and dissipation to DNS data
from [29]. The coarsest grid with ne = 163 and p = 8 resolves the flow until t = 4T0, where
it starts deviating from the reference solution. It is not capable of resolving the peak in time
derivative and resolves only two thirds of the occurring dissipation, with the remaining third
stemming from numerical dissipation and, hence, attributes dissipation to wrong modes.
With a larger number of elements, more dissipation is resolved, lowering the maximum
numerical dissipation by half. However, raising the polynomial degree at a constant number
of degrees of freedom points leads to similar accuracy gains and the grid with p = 16 and 163
elements has the highest accuracy in the dissipation rate and, in turn, the lowest numerical
dissipation.
126 7 Specht FS – A flow solver computing on heterogeneous hardware
Table 7.4: Grids utilized for the turbulent Taylor-Green benchmark in conjunctionwith the respective number of degrees of freedom nDOF, number of datapoints n⋆
DOF = 4(p+ 1)3ne, number of time steps nt, wall clock time tWall, numberof cores nC, CPU time, and computational throughput (nt · n⋆
DOF)/(tWall · nC).
p = 8 p = 16
ne 163 323 43 83 163
nDOF 8 388 608 67 108 864 1 048 576 8 388 608 67 108 864
n⋆DOF 11 943 936 95 551 488 1 257 728 11 943 936 80 494 592
nt 26 076 52 152 26 076 52 152 104 304
e2Ek· 103 3.57 0.487 17.8 1.37 0.229
Runtime tWall [s] 13 964 74 104 2297 25 855 110 058
nP 2 8 2 2 8
nC 24 96 24 24 96
CPU time [CPUh] 93 1976 15 172 2935
Throughput [1/s] 929 356 700 480 594 827 845 664 794 650
The relative L2-error of the time derivative of the kinetic energy Ek serves as quantification
of the accuracy. It computes to
e2Ek=
T∫0
(∂tEk(τ)− ∂tEk,ref(τ))2dτ
T∫0
(∂tEk,ref(τ))2dτ
, (7.8)
where Ek,ref is the kinetic energy from the reference data. Table 7.4 lists the meshes, the
respective accuracy attained by simulations on them as well as the computational through-
put. For comparability with [29], the throughput is computed based on the number of
element-local grid points, (p+ 1)3ne. Both, continuous and discontinuous formulation con-
verge towards the same, smooth result. The extra degrees of freedom allowed for in the
discontinuous case vanish and, hence, using the same basis for comparison is reasonable.
For a constant number of time steps and, hence, the same time step width, using more
degrees of freedom leads to a smaller error. For instance using p = 8 with ne = 323 leads to
a factor of three in accuracy compared to p = 16 and ne = 83. This indicates that the testcase
is spatially under-resolved, as designed, due to the error not solely depending on the time
step width. Except for the coarsest mesh using 43 elements, the computational throughput
is near 800 000 timestep times the number of data points per second and core, with a slightly
higher value for the coarse mesh with p = 8 and a slightly lower one for the same polynomial
degree but 323 elements. The throughput is nearly constant for both, p = 8 as well as p =
7.4 Performance evaluation 127
16. With the multigrid solver using only two iterations for the pressure equation in all
simulations, the constant throughput is a result of the higher computational cost for the
convection terms at higher polynomial degrees offsetting the more efficient multigrid cycle.
Furthermore, slight parallelization losses are present when increasing the number of nodes
from one to four, leading to a higher throughput for computations with fewer number of
elements. Due to homogeneous meshes being utilized, the computational throughput is
higher than in the last section.
In [29] the testcase was run on comparable hardware using deal.II. The computational
throughput attained by Specht FS is a factor of two higher. To the knowledge of the
author, this makes Specht FS the fastest solver for incompressible flow employing high
polynomial degrees, at the time of writing.
7.4.3 Parallelization study
To investigate the efficiency of the parallelization of Specht FS, the turbulent Taylor-
Green vortex from Subsection 7.4.2 is revisited. With the pressure solver relying on ghost
elements for boundary data exchange, the attainable speedup is limited, increasing the do-
main size with the number of processes. Therefore, the weak scalability is investigated.
The number of nodes is scaled from 1 to 128, using two processes per node each processing 83
spectral elements of degree p = 16 using twelve threads. To keep the problem size constant
per processor, the domain size is scaled as
Ω = (−nP,1Lπ, nP,1Lπ)× (−nP,2Lπ, nP,2Lπ)× (−nP,3Lπ, nP,3Lπ) , (7.9)
where nP,i refers to the number of processes decomposing the domain in direction xi. Due
to the 2π periodicity of the problem, the choice of domain suffices to ensure that the same
computations are carried out. The other parameters are kept as in the previous section. The
runtime for computing until t = T0 is measured and the scaleup compared to using one node
as well as the resulting parallel efficiency computed [49].
Table 7.5 lists the grid decompositions and the measured runtimes, whereas Figure 7.9 depicts
the scaleup and parallel efficiency compared to the optimal linear case. The parallel efficiency
stays above 80% until using eight nodes. Due to choice of grid decomposition, this is the
point where every process communicates with at least one neighbour in a different node and
the communication in that direction exhibits a higher latency as well as a lower bandwidth.
Afterwards, the parallel efficiency decreases, to 70% when using 32 nodes, 48% at 64 and
then 37% at 128. While the amount of runtime spent in the coarse grid solver increases
from 1% on one node to 3% at 32 and 7% at 128 nodes, it is not large enough to explain
the low efficiency. The most probable cause for the low scalability is the communication
structure in the multigrid solver. While the halo data is only sent to neighbouring processes,
128 7 Specht FS – A flow solver computing on heterogeneous hardware
Table 7.5: Runtime data for the turbulent Taylor-Green benchmark using 83 elements perprocess. Here nN refers to the number of nodes, nP,i to the number of processes ineach coordinate direction, nC to the total number of cores utilized, tWall to the elapsedwall clock time during the computation and tCoarse to the amount of wall clock timespent in the coarse grid solver. Both, scaleup and parallel efficiency, are based on therun using one node.
nN nP,1 nP,2 nP,3 nC nt tWall [s] Scaleup Efficiency [%] tCoarse
tWall[%]
1 2 1 1 24 2608 2231 1.0 100 0.7
2 2 2 1 48 2608 2345 1.9 95 0.9
4 2 2 2 96 2608 2475 3.6 90 1.1
8 4 2 2 192 2608 2681 6.7 83 1.5
16 4 4 2 384 2608 2985 12.0 75 2.1
32 4 4 4 768 2608 3207 22.3 70 2.9
64 8 4 4 1536 2608 4623 30.9 48 3.7
128 8 8 4 3072 2608 6095 46.8 37 6.3
SPECHT FS Optimal scaleup
101 102 103 104
Number of cores
102
103
104
Scaleup
101 102 103 104
Number of cores
0
20
40
60
80
100
Paralleleffi
ciency
[%]
Figure 7.9: Scaleup and resulting parallel efficiency for the turbulent Taylor-Green benchmarkusing 83 elements per process.
there are 28 neighbours in a structured grid. Sending one halo layer to a facial neighbour lead
to 80MB of data being transfered at p = 16 and 83 elements on the partition. With one pre-
and one postsmoothing step and two iterations, every partition sends more than 2GB of data
per solver call, on the finest grid level alone. At the time of writing, this communication
is implemented in an explicit fashion, encapsulating the communication in a subroutine
to allow for verification of the communication structure. Overlapping of computation and
communication, i.e. overlapping the smoother with the halo transfer, is expected to increase
the scalability significantly and will be investigated in futher work.
7.5 Conclusions 129
7.5 Conclusions
This chapter presented the flow solver Specht FS. It combines the algorithmic improve-
ments for the spectral-element method developed in Chapter 4 and Chapter 5 with the
optimized operators from Chapter 3 and the parallelization scheme proposed in Chapter 6.
Therefore it can capitalize on solving elliptic equations with the runtime scaling linearly with
the number of degrees of freedom, independent of the polynomial degree, exploit hardware-
optimized operators, and harness heterogeneous hardware. Furthermore it is written in an
extendible fashion, allowing to include further variables such as passive and active scalars.
Using multiple test cases, Specht FS was validated. First, analytical test cases served to
validate the coupling between velocity and pressure, the convection terms and then boundary
conditions and the implementation of right-hand sides. In each of these, the time-stepping
scheme converged with the expected order for the velocities, whereas the pressure had a
higher remaining error due to the boundary conditions. This, however, does not impede
the convergence of the velocities. Afterwards, the turbulent plane channel flow at Reτ = 180
served to validate Specht FS for the simulation of turbulent flows. While minor differences
were present between reference and computed Reynolds stresses, the overall aggreement
is good and can probably be decreased by higher resolution and longer averaging times.
Therefore, Specht FS can be seen as validated for simulations of turbulent flows.
After the validation, the runtime behaviour of Specht FS was analyzed. For a polynomial
degree of p = 16 and the maximum RAM bandwidth per core, the explicitly treated con-
vection terms take as much time as the solution of the pressure equation does. This parity
in runtime achieves a goal of this dissertation: The large portion of the runtime spent in
elliptic solvers typically renders time-stepping schemes with more substeps too expensive.
The usage of the multigrid solvers presented in this work allow for efficient solution of elliptic
equations and reduce the portion to that of exlipict terms.
The turbulent Taylor-Green vortex benchmark was then utilized to evaluate the efficiency
For this benchmark Specht FS achieves twice the throughput of the well-tuned deal.II
code on similar hardware. The performance difference stems mainly from the linear scaling
of the multigrid solvers and makes Specht FS, to the best knowledge of the authors, the
fastest high-order flow solver at the time of writing. Lastly, a scalability study showed
that Specht FS scales reasonably well up to 1000 cores, albeit only for weak scalability.
Overlapping of communication and computation is required to improve the scalability and
will be implemented in further work.
130 7 Specht FS – A flow solver computing on heterogeneous hardware
131
Chapter 8
Conclusions and outlook
This work contributes to high-order methods. It aimed at improving the algorithms as well
as future-proofing these with regard to the growing memory as well as to the expected het-
erogeneity in upcoming high-performance computers. To this end, the central component for
computing incompressible fluid flow, the Helmholtz solver, was investigated as it occupies
up to 95% of the runtime [29]. First, regarding the attainable performance using standard
tensor-product operators, then considering static condensation to lower the operation count,
multigrid to lower the number of iterations, and, lastly, exploiting heterogeneity. There-
after, the improvements were implemented in a fully-fledged flow solver for high order and
the attained performance measured.
Chapter 3 investigated the attainable performance with tensor-product operators in three
dimensions. While sum factorization allows to evaluate these with O(p4) operations, matrix
multiplication often ends up being faster [17] and straight-forward implementations show-
cased exhibited the same behavior for the interpolation operator. Adding parametrization
and hardware-adapted optimizations, improved these to, leading to 20GFLOP s−1, half of
the performance available per core being extracted, with the remainder being spent on
loading and storing data. The methods were thereafter applied to larger operators, such
as Helmholtz and fast diagonalization operators. The resulting operators showed a high
degree of efficiency and served as baseline for the further chapters.
Chapter 4 investigated static condensation as a means to bridge the growing memory gap
and lower number of operations. With static condensation, the equation system is reduced to
the boundary nodes of the elements. This decreases the number of unknowns in the equation
system to O(p2). However, these are coupled more tightly, resulting in the operators often
being implemented using matrix-matrix computation and therefore scaling with O(p4) [21,51]. Moreover, the result is not matrix-free by definition and even raises the required memory
bandwidth for inhomogeneous meshes. A tensor-product decomposition of the operator led
to a matrix-free variant of the Helmholtz operator and applying sum factorization as well
132 8 Conclusions and outlook
as product factorization reduced it to linear complexity, i.e.O(p3). The implementation
thereof outpaced highly optimized matrix-matrix products for any polynomial degree and a
tensor-product operator for the full case for p > 7.
To investigate the prospective performance gains obtained by the linearly scaling operator,
solvers using the preconditioned conjugate gradient method solvers were devised. For these
an increasing operator efficiency offsets the growing number of iterations, leading to a linearly
scaling runtime when increasing the polynomial degree. The solvers scratched the 1 µs mark
per degree of freedom, when using only one CPU core and with standard programming
techniques. However, while the solver were faster than expected and exhibited an exceptional
robustness against the aspect ratio, the number of iterations still increased with the number
of elements, rendering them infeasible as pressure solvers for large-scale computations.
To attain robustness with regard to the number of elements, Chapter 5 investigated p-
multigrid. Previous studies showed that overlapping Schwarz methods can generate a
constant iteration count [88, 52]. Moreover, using only six smoothing steps can suffice to
lower the residual by ten orders of magnitude [123] and this holds for the condensed case [51].
However, the overlapping Schwarz smoothers require inversion on small subdomains. Fa-
cilitating these with fast diagonalization method in the full case, or with matrix inversion
in the condensed one, O(p4) operations result. Embedding the static condensed system into
the full one allowed to capitalize on fast diagonalization as tensor-product inverse and gen-
erated a matrix-free inverse, albeit with super-linear scaling. Further factorization led to
a linearly scaling inverse on the Schwarz domains and, in turn, a smoother achieving a
constant runtime per degree of freedom.
The combination of linearly scaling operator and smoother allowed for a multigrid cycle with
a constant runtime per degree of freedom. Furthermore, the resulting multigrid method in-
herited the convergence properties from [122, 51] and therefore attained a constant number
of iterations. This, in turn, led to a constant runtime per degree of freedom, regardless of
the polynomial degree or number of elements. The method improved upon the condensed
solvers and attained 0.5 µs as runtime per unknown when computing on one sole core of a
standard CPU. Moreover, the solver outperformed the deal.II library by a factor of four,
when comparing their p = 8 to p = 32 with the proposed multigrid solver [82]. There-
fore the proposed method unlocked computation with extremely high-orders. Moreover, it
constituted a three-fold improvement over the work that stimulated this research [51].
To address the increasing heterogeneity in the high-performance computers, Chapter 6
investigated heterogeneous computing on the CPU-GPU coupling. The hybrid program-
ming model [49] was expanded to a two-level parallelization with the coarse layer decom-
posing the mesh and the fine layer harnessing the capabilities of the specific hardware.
When using pragma-based language extensions, such as OpenMP [102], OpenACC [101],
or OmpSs [15], the model keeps variants for every kind of hardware inside a single source.
133
Not only does this increase the maintainability, in addition the different variants can share
the same communication pattern. Usage of the same communication patterns allows the
domain decomposition layer to then fuse the different programs into a single heterogeneous
system.
With a heterogeneous system available, load balancing was investigated on the Helmholtz
solvers. While using a linear fit as functional performance model and treating the runtime
as a black box suffices for many applications [30, 140, 85], this was not the case for the pCG
solver. Modelled and resulting runtimes exhibited large differences, with the prediction
error ranging up to 40%. These errors were induced by disregarding the communication
pattern: Exchange of boundary data and scalar products incured synchronization, splitting
the computation into multiple parts and digressing with the assumptions of a monolithic
runtime. A load balancing model accounting for these further synchronizations was devised.
While leading to further restrictions and therefore a lower predicted speedup, the runtime
expectation and result were in agreement: Where formerly the error ranged up to 40%, it was
now 5-10%, allowing accurate predictions for the heterogeneous system and outperforming
both homogeneous configurations.
Chapter 7 combined the previous advances into the flow solver Specht FS: The static
condensed Helmholtz solvers from Chapter 4, the multigrid solver from Chapter 5, and
the programming model from Chapter 6. Therefore, the resulting solver is capable of solv-
ing Helmholtz equations in linear runtime and can compute on heterogeneous systems.
Analytical test cases validated components of Specht FS, whereas the turbulent channel
flow served as validation for time-resolved simulation of turbulent flows. Thereafter, the
runtime distribution inside the solver was investigated. Treating the pressure can account
for 90% of the runtime, in Specht FS, however, the implicit treatment of the pressure
required only as much as the explicit treatment of the convection terms. Lastly, the under-
resolved turbulent Taylor-Green benchmark investigated the performance of the proposed
flow solver, where Specht FS achieved twice the throughput on similar hardware compared
to deal.II [29].
The algorithmic improvements presented in this work allow for solutions of the Helmholtz
equation in linear runtime. Not only does this allow for the implicit treatment of the pressure
equation to take as long as the explicit treatment of the convection terms, it also allows to
use far higher polynomial degrees and therefore higher convergence rates. Furthermore, the
algorithms bridge the growing memory gap, as loading and storing in the static condensed,
matrix-free solvers scales with O(p2), whereas the number of operations scale with O(p3).However, the methods are currently restricted to structured Cartesian meshes and the con-
tinuous spectral-element method. Therefore, future work will expand these to discontinuous
spectral-element methods and investigate similar structure exploitation for curvilinear ele-
ments and first results for the former are available in [65].
134 8 Conclusions and outlook
135
Appendix A
Further results for wildly heterogeneous
systems
Table A.1: Iteration time per degree of freedom measured in ns when computing on either multiplecores with OpenMP or GK210 GPU chips on ns sockets.
p = 7 p = 11 p = 15
ne ne ne
ns Setup 83 123 163 83 123 163 83 123 163
1 1 core 97.1 102.8 106.5 119.5 122.7 123.8 113.1 114.9 114.8
1 4 cores 25.4 26.6 31.4 30.6 32.1 32.4 29.4 30.1 29.9
1 8 cores 13.4 14.3 17.8 22.2 17.8 18.1 16.1 16.6 16.6
1 12 cores 9.7 10.4 11.9 11.6 13.1 13.5 11.9 12.6 12.4
1 1 GK210 7.5 5.1 4.3 6.7 5.5 5.1 6.8 6.1 5.9
1 2 GK210 6.1 3.2 2.5 4.2 3.1 2.8 3.8 3.3 3.1
2 1 core 49.3 49.1 52.0 57.7 60.7 61.9 55.8 57.3 57.4
2 4 cores 13.4 12.9 13.6 14.9 15.8 16.2 14.4 15.1 15.1
2 8 cores 7.3 6.8 7.4 7.8 8.6 9.0 7.8 8.4 9.7
2 12 cores 5.6 4.8 6.1 5.6 6.3 6.6 5.7 6.2 6.4
2 1 GK210 6.6 3.3 2.5 4.2 3.1 2.7 3.8 3.3 3.1
2 2 GK210 6.1 2.3 1.6 2.7 1.9 1.5 2.2 1.8 1.6
136 A Further results for wildly heterogeneous systems
Table A.2: Grids utilized for the homogeneous computations and resulting runtimes per itera-tion tIter when using either ten cores of the CPU (CPU) or one GK210 chip (GPU).
tIter [ms]
ne Decomposition CPU GPU
100 10× 10× 1 0.49 1.02
200 20× 10× 1 0.88 1.20
300 20× 15× 1 1.26 1.32
400 20× 20× 1 1.63 1.47
500 25× 20× 1 2.24 1.58
600 30× 20× 1 2.39 1.73
700 35× 20× 1 2.79 1.93
800 40× 20× 1 3.35 2.11
900 36× 25× 1 3.62 2.29
1000 40× 25× 1 4.11 2.47
Table A.3: Element distribution for the heterogeneous case using the single-step model for loadbalancing and ten cores of the CPU (C) as well as one GK210 chip (G) with modelledand measured times per iteration tIter as well as the resulting relative error.
Distribution tIter [ms]
ne n(C)e n
(G)e Measurement Model Relative error [%]
200 160 40 1.01 0.71 43.7
300 180 120 1.17 0.83 40.8
400 220 180 1.35 0.95 42.0
500 240 260 1.53 1.08 41.8
600 280 320 1.72 1.20 43.2
700 300 400 1.83 1.33 37.4
800 340 460 2.00 1.45 37.8
900 350 550 2.23 1.60 39.6
1000 400 600 2.34 1.70 38.0
137
Bibliography
[1] NVIDIA CUDA programming guide (version 1.0). NVIDIA: Santa Clara, CA, 2007.
[2] S. Abhyankar, J. Brown, E. M. Constantinescu, D. Ghosh, B. F. Smith, and
H. Zhang. PETSc/TS: A modern scalable ODE/DAE solver library. arXiv preprint
arXiv:1806.01437, 2018.
[3] M. Atak, A. Beck, T. Bolemann, D. Flad, H. Frank, and C.-D. Munz. High fi-
delity scale-resolving computational fluid dynamics using the high order discontinuous
Galerkin spectral element method. In High Performance Computing in Science and
Engineering ´15, pages 511–530. Springer, 2016.
[4] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a unified plat-
form for task scheduling on heterogeneous multicore architectures. Concurrency and
Computation: Practice and Experience, 23(2):187–198, 2011.
[5] W. Bangerth, R. Hartmann, and G. Kanschat. deal.II — a general-purpose object-
oriented finite element library. ACM Transactions on Mathematical Software (TOMS),
33(4):24, 2007.
[6] A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh. A package for OpenCL based het-
erogeneous computing on clusters with many GPU devices. In Cluster Computing
Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Con-
ference on, pages 1–7. IEEE, 2010.
[7] P. Bastian, C. Engwer, D. Goddeke, O. Iliev, O. Ippisch, M. Ohlberger, S. Turek,
J. Fahlke, S. Kaulmann, S. Muthing, and D. Ribbrock. EXA-DUNE: Flexible PDE
solvers, numerical methods and applications. In L. Lopes, J. Zilinskas, A. Costan,
R. G. Cascella, G. Kecskemeti, E. Jeannot, M. Cannataro, L. Ricci, S. Benkner, S. Pe-
tit, V. Scarano, J. Gracia, S. Hunold, S. L. Scott, S. Lankes, C. Lengauer, J. Car-
retero, J. Breitbart, and M. Alexander, editors, Euro-Par 2014: Parallel Processing
Workshops, pages 530–541, Cham, 2014. Springer International Publishing.
[8] A. D. Beck, T. Bolemann, D. Flad, H. Frank, G. J. Gassner, F. Hindenlang, and C.-
D. Munz. High-order discontinuous Galerkin spectral element methods for transitional
and turbulent flow simulations. International Journal for Numerical Methods in Fluids,
76(8):522–548, 2014.
[9] T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. Memory access patterns: the missing
piece of the multi-GPU puzzle. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, page 19. ACM, 2015.
138
[10] M. Berger, M. Aftosmis, D. Marshall, and S. Murman. Performance of a new CFD
flow solver using a hybrid programming paradigm. Journal of Parallel and Distributed
Computing, 65:414–423, 2005.
[11] H. M. Blackburn and S. Sherwin. Formulation of a Galerkin spectral element–Fourier
method for three-dimensional incompressible flows in cylindrical geometries. Journal
of Computational Physics, 197(2):759–778, 2004.
[12] J. Bramble. Multigrid methods. Pitman Res. Notes Math. Ser. 294. Longman Scientific
& Technical, Harlow, UK, 1995.
[13] A. Brandt. Multi-level adaptive technique (MLAT) for fast numerical solution to
boundary value problems. In H. Cabannes and R. Temam, editors, Proceedings of
the 3rd International Conference on Numerical Methods in Fluid Dynamics (Berlin),
pages 82–89. Springer-Verlag, 1973.
[14] A. Brandt. Guide to multigrid development. In Multigrid Methods, volume 960 of
Lecture Notes in Mathematics, pages 220–312. Springer Berlin/Heidelberg, 1982.
[15] J. Bueno, J. Planas, A. Duran, R. Badia, X. Martorell, E. Ayguade, and J. Labarta.
Productive programming of GPU clusters with OmpSs. In Parallel Distributed Process-
ing Symposium (IPDPS), 2012 IEEE 26th International, pages 557–568, May 2012.
[16] P. E. Buis and W. R. Dyksen. Efficient vector and parallel manipulation of tensor
products. ACM Transactions on Mathematical Software (TOMS), 22(1):18–23, 1996.
[17] C. Cantwell, S. Sherwin, R. Kirby, and P. Kelly. From h to p efficiently: Strategy
selection for operator evaluation on hexahedral and tetrahedral elements. Computers
& Fluids, 43(1):23–28, 2011.
[18] C. D. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo,
D. De Grazia, S. Yakovlev, J.-E. Lombard, D. Ekelschot, et al. Nektar++: An open-
source spectral/hp element framework. Computer Physics Communications, 192:205–
219, 2015.
[19] J. Castrillon, M. Lieber, S. Kluppelholz, M. Volp, N. Asmussen, U. Assmann,
F. Baader, C. Baier, G. Fettweis, J. Frohlich, A. Goens, S. Haas, D. Habich, H. Hartig,
M. Hasler, I. Huismann, T. Karnagel, S. Karol, A. Kumar, W. Lehner, L. Leuschner,
S. Ling, S. Marcker, C. Menard, J. Mey, W. Nagel, B. Nothen, R. Penaloza, M. Raitza,
J. Stiller, A. Ungethum, A. Voigt, and S. Wunderlich. A hardware/software stack
for heterogeneous systems. IEEE Transactions on Multi-Scale Computing Systems,
4(3):243–259, July 2018.
139
[20] C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mo-
hanti, Y. Yao, and D. Chavarrıa-Miranda. An evaluation of global address space
languages: Co-array Fortran and Unified Parallel C. In Proceedings of the tenth ACM
SIGPLAN symposium on Principles and practice of parallel programming, pages 36–47.
ACM, 2005.
[21] W. Couzy and M. Deville. A fast Schur complement method for the spectral element
discretization of the incompressible Navier-Stokes equations. Journal of Computational
Physics, 116(1):135 – 142, 1995.
[22] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. Design
of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of
Solid-State Circuits, 9(5):256–268, 1974.
[23] M. Deville, P. Fischer, and E. Mund. High-Order Methods for Incompressible Fluid
Flow. Cambridge University Press, 2002.
[24] T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra. Hydrodynamic
computation with hybrid programming on CPU-GPU clusters. Innovative Computing
Laboratory, University of Tennessee, 2013. Online available at http://citeseerx.
ist.psu.edu/viewdoc/download?doi=10.1.1.423.4016&rep=rep1&type=pdf.
[25] T. Dong et al. A step towards energy efficient computing: Redesigning a hydrodynamic
application on CPU-GPU. In Parallel and Distributed Processing Symposium, 2014
IEEE 28th International, pages 972–981. IEEE, 2014.
[26] J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J.-C. Andre, D. Barkai,
J.-Y. Berthou, T. Boku, B. Braunschweig, et al. The international exascale software
project roadmap. International Journal of High Performance Computing Applications,
25(1):3–60, 2011.
[27] M. Dryja, B. F. Smith, and O. B. Widlund. Schwarz analysis of iterative substructur-
ing algorithms for elliptic problems in three dimensions. SIAM journal on numerical
analysis, 31(6):1662–1694, 1994.
[28] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark
silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th
Annual International Symposium on, pages 365–376. IEEE, 2011.
[29] N. Fehn, W. A. Wall, and M. Kronbichler. Efficiency of high-performance discontinuous
Galerkin spectral element methods for under-resolved turbulent incompressible flows.
International Journal for Numerical Methods in Fluids, 2018.
140
[30] C. Feichtinger, J. Habich, H. Kostler, U. Rude, and T. Aoki. Performance modeling
and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters.
Parallel Computing, 46:1–13, 2015.
[31] X. Feng and O. A. Karakashian. Two-level additive Schwarz methods for a discon-
tinuous Galerkin approximation of second order elliptic problems. SIAM Journal on
Numerical Analysis, 39(4):1343–1365, 2001.
[32] J. H. Ferziger and M. Peric. Computational Methods for Fluid Dynamics. Springer-
Verlag, 2002.
[33] P. Fischer and J. Mullen. Filter-based stabilization of spectral element methods.
Comptes Rendus de l’Academie des Sciences-Series I-Mathematics, 332(3):265–270,
2001.
[34] P. F. Fischer. An overlapping Schwarz method for spectral element solution of the
incompressible Navier–Stokes equations. Journal of Computational Physics, 133(1):84–
101, 1997.
[35] P. F. Fischer, J. W. Lottes, and S. G. Kerkemeier. nek5000 web page, 2008.
[36] J. Frohlich. Large Eddy Simulation turbulenter Stromungen. Teubner, Wiesbaden,
2006. In German.
[37] M. J. Gander et al. Schwarz methods over the course of time. Electron. Trans. Numer.
Anal, 31(5):228–255, 2008.
[38] G. J. Gassner and A. D. Beck. On the accuracy of high-order discretizations for
underresolved turbulence simulations. Theoretical and Computational Fluid Dynamics,
27(3-4):221–237, 2013.
[39] M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. Kelly. Performance
analysis of the OP2 framework on many-core architectures. ACM SIGMETRICS Per-
formance Evaluation Review, 38(4):9–15, 2011.
[40] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski,
and S. Turek. Exploring weak scalability for FEM calculations on a GPU-enhanced
cluster. Parallel Computing, 33(10-11):685–699, 2007.
[41] G. H. Golub and Q. Ye. Inexact preconditioned conjugate gradient method with inner-
outer iteration. SIAM Journal on Scientific Computing, 21(4):1305–1320, 1999.
[42] K. Goto and R. Van De Geijn. High-performance implementation of the level-3 BLAS.
ACM Transactions on Mathematical Software, 35(1), Jul 2008.
141
[43] A. E. Green and G. I. Taylor. Mechanism of the production of small eddies from larger
ones. Proc. Royal Soc., 158, 1937.
[44] C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU
vs. GPU performance without the answer. In Proceedings of the IEEE International
Symposium on Performance Analysis of Systems and Software, ISPASS ’11, pages
134–144, Washington, DC, USA, 2011. IEEE Computer Society.
[45] L. Grinberg, D. Pekurovsky, S. Sherwin, and G. E. Karniadakis. Parallel performance of
the coarse space linear vertex solver and low energy basis preconditioner for spectral/hp
elements. Parallel Computing, 35(5):284–304, 2009.
[46] J.-L. Guermond, P. Minev, and J. Shen. An overview of projection methods for
incompressible flows. Computer methods in applied mechanics and engineering,
195(44):6011–6045, 2006.
[47] W. Hackbusch. Multigrid Methods and Applications, volume 4 of Computational Math-
ematics. Springer, 1985.
[48] D. Hackenberg, R. Schone, T. Ilsche, D. Molka, J. Schuchart, and R. Geyer. An
energy efficiency feature survey of the Intel Haswell processor. In Parallel Distributed
Processing Symposium Workshops (IPDPSW), 2015 IEEE International, 2015.
[49] G. Hager and G. Wellein. Introduction to High Performance Computing for Scientists
and Engineers. CRC Press, Boca Raton, FL, USA, 1st edition, jul 2010.
[50] R. Hartmann, M. Lukacova-Medvid’ova, and F. Prill. Efficient preconditioning for the
discontinuous Galerkin finite element method by low-order elements. Applied Numer-
ical Mathematics, 59(8):1737 – 1753, 2009.
[51] L. Haupt. Erweiterte mathematische Methoden zur Simulation von turbulenten
Stromungsvorgangen auf parallelen Rechnern. PhD thesis, Centre for Information Ser-
vices and High Performance Computing (ZIH), TU Dresden, Dresden, 2017.
[52] L. Haupt, J. Stiller, and W. E. Nagel. A fast spectral element solver combining static
condensation and multigrid techniques. Journal of Computational Physics, 255(0):384
– 395, 2013.
[53] E. Hermann et al. Multi-GPU and multi-CPU parallelization for interactive physics
simulations. Euro-Par 2010-Parallel Processing, pages 235–246, 2010.
[54] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear
systems. Journal of Research of the National Bureau of Standards, 49(6):409–436,
1952.
142
[55] J. S. Hesthaven and T. Warburton. Nodal discontinuous Galerkin methods: algorithms,
analysis, and applications. Springer Science & Business Media, 2007.
[56] F. Hindenlang, G. J. Gassner, C. Altmann, A. Beck, M. Staudenmaier, and C.-D.
Munz. Explicit discontinuous Galerkin methods for unsteady problems. Computers &
Fluids, 61:86–93, 2012.
[57] I. Huismann, L. Haupt, J. Stiller, and J. Frohlich. Sum factorization of the static
condensed Helmholtz equation in a three-dimensional spectral element discretization.
PAMM, 14(1):969–970, 2014.
[58] I. Huismann, M. Lieber, J. Stiller, and J. Frohlich. Load balancing for CPU-GPU
coupling in computational fluid dynamics. In Parallel Processing and Applied Mathe-
matics, pages 371–380. Springer, 2017.
[59] I. Huismann, J. Stiller, and J. Frohlich. Two-level parallelization of a fluid mechanics
algorithm exploiting hardware heterogeneity. Computers & Fluids, 117(0):114 – 124,
2015.
[60] I. Huismann, J. Stiller, and J. Frohlich. Fast static condensation for the Helmholtz
equation in a spectral-element discretization. In Parallel Processing and Applied Math-
ematics, pages 371–380. Springer, 2016.
[61] I. Huismann, J. Stiller, and J. Frohlich. Building blocks for a leading edge high-order
flow solver. PAMM, 17(1), 2017.
[62] I. Huismann, J. Stiller, and J. Frohlich. Factorizing the factorization – a spectral-
element solver for elliptic equations with linear operation count. Journal of Computa-
tional Physics, 346:437–448, oct 2017.
[63] I. Huismann, J. Stiller, and J. Frohlich. Scaling to the stars – a linearly scaling elliptic
solver for p-multigrid. Journal of Computational Physics, 398:108868, 2019.
[64] I. Huismann, J. Stiller, and J. Frohlich. Efficient high-order spectral element dis-
cretizations for building block operators of CFD. Computers & Fluids, 197:104386,
2020.
[65] I. Huismann, J. Stiller, and J. Frohlich. Linearizing the hybridizable discontinuous
Galerkin method: a linearly scaling operator. arXiv preprint arXiv:2007.11891, 2020.
submitted.
[66] W. Hundsdorfer and S. J. Ruuth. IMEX extensions of linear multistep methods with
general monotonicity and boundedness properties. Journal of Computational Physics,
225(2):2016–2042, 2007.
143
[67] H. Iwai. Future of nano CMOS technology. Solid-State Electronics, 112:56–67, 2015.
[68] D. Jacob, J. Petersen, B. Eggert, A. Alias, O. B. Christensen, L. M. Bouwer, A. Braun,
A. Colette, M. Deque, G. Georgievski, et al. EURO-CORDEX: new high-resolution
climate change projections for European impact research. Regional Environmental
Change, 14(2):563–578, 2014.
[69] A. Jameson. Time dependent calculations using multigrid, with applications to un-
steady flows past airfoils and wings. In 10th Computational Fluid Dynamics Confer-
ence, page 1596, 1991.
[70] H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman. High per-
formance computing using MPI and OpenMP on multi-core parallel systems. Parallel
Computing, 37:562–575, 2011.
[71] L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system
based on C++. In ACM Sigplan Notices, volume 28, pages 91–108. ACM, 1993.
[72] G. Kanschat. Robust smoothers for high-order discontinuous Galerkin discretizations
of advection–diffusion problems. Journal of Computational and Applied Mathematics,
218(1):53–60, 2008.
[73] T. Karnagel, D. Habich, and W. Lehner. Limitations of intra-operator parallelism
using heterogeneous computing resources. In ADBIS 2016, pages 291–305, 2016.
[74] G. Karniadakis and S. Sherwin. Spectral/hp Element Methods for CFD. Oxford Uni-
versity Press, 1999.
[75] G. E. Karniadakis, M. Israeli, and S. A. Orszag. High-order splitting methods for the
incompressible Navier-Stokes equations. Journal of Computational Physics, 97(2):414–
443, 1991.
[76] R. M. Kirby, S. J. Sherwin, and B. Cockburn. To CG or to HDG: a comparative study.
Journal of Scientific Computing, 51(1):183–212, 2012.
[77] A. Klockner, T. Warburton, J. Bridge, and J. Hesthaven. Nodal discontinuous Galerkin
methods on graphics processors. Journal of Computational Physics, 228(21):7863 –
7882, 2009.
[78] A. Knupfer, C. Rossel, D. a. Mey, S. Biersdorff, K. Diethelm, D. Eschweiler, M. Geimer,
M. Gerndt, D. Lorenz, A. Malony, W. E. Nagel, Y. Oleynik, P. Philippen, P. Saviankou,
D. Schmidl, S. Shende, R. Tschuter, M. Wagner, B. Wesarg, and F. Wolf. Score-P: A
joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU,
and Vampir. In H. Brunst, M. S. Muller, W. E. Nagel, and M. M. Resch, editors,
144
Tools for High Performance Computing 2011, pages 79–91, Berlin, Heidelberg, 2012.
Springer Berlin Heidelberg.
[79] D. Komatitsch, G. Erlebacher, D. Goddeke, and D. Michea. High-order finite-element
seismic wave propagation modeling with MPI on a large GPU cluster. Journal of
Computational Physics, 229(20):7692 – 7714, 2010.
[80] D. Koschichow, J. Frohlich, R. Ciorciari, and R. Niehuis. Analysis of the influence
of periodic passing wakes on the secondary flow near the endwall of a linear LPT
cascade using DNS and U-RANS. In ETC2015-151, Proceedings of the 11th European
Conference on Turbomachinery Fluid Dynamics and Thermodynamics, 2015.
[81] E. Krause. Fluid mechanics. Springer, 2005.
[82] M. Kronbichler and W. A. Wall. A performance comparison of continuous and dis-
continuous Galerkin methods with fast multigrid solvers. SIAM Journal on Scientific
Computing, 40(5):A3423–A3448, 2018.
[83] Y.-Y. Kwan and J. Shen. An efficient direct parallel spectral-element solver for sepa-
rable elliptic problems. Journal of Computational Physics, 225(2):1721 – 1735, 2007.
[84] F. Lemaitre and L. Lacassagne. Batched Cholesky factorization for tiny matrices. In
Design and Architectures for Signal and Image Processing (DASIP), 2016 Conference
on, pages 130–137. IEEE, 2016.
[85] C. Liu and J. Shen. A phase field model for the mixture of two incompressible fluids
and its approximation by a fourier-spectral method. Physica D: Nonlinear Phenomena,
179(3–4):211 – 228, 2003.
[86] X. Liu et al. A hybrid solution method for CFD applications on GPU-accelerated
hybrid HPC platforms. Future Generation Computer Systems, 56:759–765, 2016.
[87] J.-E. W. Lombard, D. Moxey, S. J. Sherwin, J. F. A. Hoessler, S. Dhandapani, and
M. J. Taylor. Implicit large-eddy simulation of a wingtip vortex. AIAA Journal, pages
1–13, Nov. 2015.
[88] J. W. Lottes and P. F. Fischer. Hybrid multigrid/Schwarz algorithms for the spectral
element method. Journal of Scientific Computing, 24(1):45–78, 2005.
[89] R. Lynch, J. Rice, and D. Thomas. Direct solution of partial difference equations by
tensor product methods. Numerische Mathematik, 6(1):185–199, 1964.
[90] M. Manna, A. Vacca, and M. O. Deville. Preconditioned spectral multi-domain dis-
cretization of the incompressible Navier–Stokes equations. Journal of Computational
Physics, 201(1):204 – 223, 2004.
145
[91] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Don-
garra. High-performance matrix-matrix multiplications of very small matrices. In
European Conference on Parallel Processing, pages 659–671. Springer, 2016.
[92] C. P. Mellen, J. Frohlich, and W. Rodi. Lessons from LESFOIL project on large-eddy
simulation of flow around an airfoil. AIAA journal, 41(4):573–581, 2003.
[93] E. Merzari, W. Pointer, and P. Fischer. Numerical simulation and proper orthogonal
decomposition of the flow in a counter-flow T-junction. Journal of Fluids Engineering,
135(9):091304, 2013.
[94] H. Meuer, E. Trohmaier, J. Dongarra, and H. Simon. Top500 list – june 2018. June
2018. available online at www.top500.org.
[95] G. E. Moore. Electronics. Electronics, 38:114, 1965.
[96] R. D. Moser, J. Kim, and N. N. Mansour. Direct numerical simulation of turbulent
channel flow up to Reτ = 590. Physics of Fluids, 11(4):943–945, 1999.
[97] R. Moura, S. Sherwin, and J. Peiro. Eigensolution analysis of spectral/hp continu-
ous Galerkin approximations to advection–diffusion problems: Insights into spectral
vanishing viscosity. Journal of Computational Physics, 307:401–422, 2016.
[98] D. Moxey, C. Cantwell, R. Kirby, and S. Sherwin. Optimising the performance of
the spectral/hp element method with collective linear algebra operations. Computer
Methods in Applied Mechanics and Engineering, 310:628–645, 2016.
[99] K. Nelson and O. Fringer. Reducing spin-up time for simulations of turbulent channel
flow. Physics of Fluids, 29(10):105101, 2017.
[100] C. W. Oosterlee and T. Washio. An evaluation of parallel multigrid as a solver and a
preconditioner for singularly perturbed problems. SIAM Journal on Scientific Com-
puting, 19(1):87–110, 1998.
[101] OpenACC-Standard.org. The OpenACC application programming interface version
2.6, 2017. Published: online available specification.
[102] OpenMP Architecture Review Board. OpenMP application program interface version
4.5, 2015. Published: online available specification.
[103] S. Pall et al. Tackling exascale software challenges in molecular dynamics simulations
with GROMACS. In EASC 2014, pages 3–27. 2015.
[104] R. Pasquetti and F. Rapetti. p-multigrid method for Fekete-Gauss spectral element
approximations of elliptic problems. Communications in Computational Physics, 5(2-
4):667–682, Feb 2009.
146
[105] A. T. Patera. A spectral element method for fluid dynamics: laminar flow in a channel
expansion. Journal of Computational Physics, 54(3):468 – 488, 1984.
[106] L. F. Pavarino and O. B. Widlund. A polylogarithmic bound for an iterative substruc-
turing method for spectral elements in three dimensions. SIAM journal on numerical
analysis, 33(4):1303–1335, 1996.
[107] L. F. Pavarino and O. B. Widlund. Iterative substructuring methods for spectral
elements: Problems in three dimensions based on numerical quadrature. Computers
& Mathematics with Applications, 33(1):193–209, 1997.
[108] R. Peyret. Spectral methods for incompressible viscous flow, volume 148. Springer
Science & Business Media, 2013.
[109] C. Rahm. Validierung und Erweiterung eines Stromungsloser hoher ordnung fur het-
erogene HPC-systeme. Master’s thesis, Institut fur Stromungsmechanik, TU Dresden,
Dresden, Germany, 2017. in German.
[110] I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. Kelly, and
D. Radford. Acceleration of a full-scale industrial CFD application with OP2. IEEE
Transactions on Parallel and Distributed Systems, 27(5):1265–1278, 2016.
[111] J. Reid. Coarrays in Fortran 2008. In Proceedings of the Third Conference on Parti-
tioned Global Address Space Programing Models, PGAS ’09, pages 4:1–4:1, New York,
NY, USA, 2009. ACM.
[112] N. A. Rink, I. Huismann, A. Susungi, J. Castrillon, J. Stiller, J. Frohlich, and C. Ta-
donki. CFDlang: High-level code generation for high-order methods in fluid dynam-
ics. In Proceedings of the Real World Domain Specific Languages Workshop 2018,
RWDSL2018, pages 5:1–5:10, New York, NY, USA, 2018. ACM.
[113] M. P. Robson, R. Buch, and L. V. Kale. Runtime coordinated heterogeneous tasks in
Charm++. In Extreme Scale Programming Models and Middlewar (ESPM2), Inter-
national Workshop on, pages 40–43. IEEE, 2016.
[114] E. M. Rønquist and A. T. Patera. Spectral element multigrid. I. formulation and
numerical results. Journal of Scientific Computing, 2(4):389–406, 1987.
[115] H. Schlichting and K. Gersten. Boundary-layer theory. Springer, 9 edition, 2017.
[116] J. Schmidt. NAM – network attached memory. In Doctoral Showcase summary
and poster at International Conference for High Performance Computing, Network-
ing, Storage and Analysis, SC, 2016.
147
[117] S. J. Sherwin and M. Casarin. Low-energy basis preconditioning for elliptic substruc-
tured solvers based on unstructured spectral/hp element discretization. Journal of
Computational Physics, 171(1):394–417, 2001.
[118] J. R. Shewchuk. An introduction to the conjugate gradient method without the ago-
nizing pain. Technical report, Pittsburgh, PA, USA, 1994.
[119] J. Slotnick, A. Khodadoust, J. Alonso, D. Darmofal, W. Gropp, E. Lurie, and
D. Mavriplis. CFD vision 2030 study: a path to revolutionary computational aero-
sciences. 2014.
[120] B. F. Smith. A parallel implementation of an iterative substructuring algorithm for
problems in three dimensions. SIAM Journal on Scientific Computing, 14(2):406–423,
1993.
[121] T. Steinke, A. Reinefeld, and T. Schutt. Experiences with high-level programming of
fpgas on cray xd1. Cray Users Group (CUG 2006), 2006.
[122] J. Stiller. Robust multigrid for high-order discontinuous Galerkin methods: A fast
Poisson solver suitable for high-aspect ratio Cartesian grids. Journal of Computational
Physics, 327:317–336, 2016.
[123] J. Stiller. Nonuniformly weighted Schwarz smoothers for spectral element multigrid.
Journal of Scientific Computing, 72(1):81–96, 2017.
[124] J. Stiller. Robust multigrid for cartesian interior penalty DG formulations of the
Poisson equation in 3D. In Spectral and High Order Methods for Partial Differential
Equations ICOSAHOM 2016, pages 189–201. Springer, 2017.
[125] J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for
heterogeneous computing systems. Computing in science & engineering, 12(3):66–73,
2010.
[126] H. Sundar, G. Stadler, and G. Biros. Comparison of multigrid algorithms for high-order
continuous finite element discretizations. Numerical Linear Algebra with Applications,
22(4):664–680, 2015.
[127] A. Susungi, N. A. Rink, J. Castrillon, I. Huismann, A. Cohen, C. Tadonki, J. Stiller,
and J. Frohlich. Towards compositional and generative tensor optimizations. In Conf.
Generative Programming: Concepts & Experience (GPCE’17), pages 169–175, 2017.
[128] The MPI Forum. MPI: A message passing interface version 3.1, 2015.
[129] T. N. Theis and H.-S. P. Wong. The end of Moore’s law: A new beginning for infor-
mation technology. Computing in Science & Engineering, 19(2):41–50, 2017.
148
[130] U. Trottenberg, C. Oosterlee, and A. Schuller. Multigrid. Academic Press, 2001.
[131] M. Volp, S. Kluppelholz, J. Castrillon, H. Hartig, N. Asmussen, U. Assmann,
F. Baader, C. Baier, G. Fettweis, J. Frohlich, A. Goens, S. Haas, D. Habich, M. Hasler,
I. Huismann, T. Karnagel, S. Karol, W. Lehner, L. Leuschner, M. Lieber, S. Ling,
S. Marcker, J. Mey, W. Nagel, B. Nothen, R. Penaloza, M. Raitza, J. Stiller,
A. Ungethum, and A. Voigt. The Orchestration Stack: The impossible task of designing
software for unknown future post-CMOS hardware. In Proceedings of the 1st Interna-
tional Workshop on Post-Moore’s Era Supercomputing (PMES), Co-located with The
International Conference for High Performance Computing, Networking, Storage and
Analysis (SC16), Salt Lake City, USA, Nov. 2016.
[132] Wikipedia contributors. List of Intel CPU microarchitectures — Wikipedia, the free
encyclopedia, 2018. [Online available at https://en.wikipedia.org/w/index.php?
title=List_of_Intel_CPU_microarchitectures&oldid=864965902; accessed 27th
Octobre 2018].
[133] M. V. Wilkes. The memory gap and the future of high performance memories. ACM
SIGARCH Computer Architecture News, 29(1):2–7, 2001.
[134] S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual perfor-
mance model for multicore architectures. Communications of the ACM, 52(4):65–76,
2009.
[135] E. L. Wilson. The static condensation algorithm. International Journal for Numerical
Methods in Engineering, 8(1):198–203, 1974.
[136] C. S. Woodward. A Newton-Krylov-multigrid solver for variably saturated flow prob-
lems. WIT Transactions on Ecology and the Environment, 24, 1998.
[137] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang,
Z. Wang, W. Liu, and X. Cheng. Collaborating CPU and GPU for large-scale high-
order CFD simulations with complex grids on the TianHe-1A supercomputer. Journal
of Computational Physics, 278:275–297, 2014.
[138] S. Yakovlev, D. Moxey, R. Kirby, and S. Sherwin. To CG or to HDG: A comparative
study in 3D. Journal of Scientific Computing, pages 1–29, 2015.
[139] C. Yang et al. Adaptive optimization for petascale heterogeneous CPU/GPU com-
puting. In Cluster Computing (CLUSTER), 2010 IEEE International Conference on,
pages 19–28. IEEE, 2010.
[140] Z. Zhong, V. Rychkov, and A. Lastovetsky. Data partitioning on multicore and multi-
GPU platforms using functional performance models. IEEE T Comput, 64(9):2506–
2518, 2015.