8/3/2019 01Front Cai
1/22
Region-based Techniques forModeling and Enhancing Cluster
OpenMP Performance
Jie Cai
August 2011
A thesis submitted for the degree of Doctor of Philosophy
of the Australian National University
8/3/2019 01Front Cai
2/22
c Jie Cai 2011
This document was produced using TEX , LATEX and BIBTEX
8/3/2019 01Front Cai
3/22
For my wife, Ruru, who greatly supported my PhD research...
...and my loving parents.
8/3/2019 01Front Cai
4/22
8/3/2019 01Front Cai
5/22
Declaration
I declare that the work in this thesis is entirely my own and that to the best of
my knowledge it does not contain any materials previously published or written by
another person except where otherwise indicated.
Jie Cai
8/3/2019 01Front Cai
6/22
8/3/2019 01Front Cai
7/22
Acknowledgements
During my PhD, many people have offered kind help and generous support. I would
like to thank them and appreciate their help.
Supervisors
Dr. Peter Strazdins for the guidance and advise during my whole doctoral
research; Dr. Alistair Rendell for the time we spent together before paper
deadlines; Dr. Eric McCreath for the useful comments.
Readers
For reading my thesis and providing valuable feedbacks, thank you.
Warren Armstrong, Muhammad Atif, Michael Chapman, Pete Janes, Josh
Milthorpe, Peter Strazdins, and Jin Wong.
Computer System Group Members
For the cheerful four years of my PhD, thank you geeks.
Joseph Anthony, Ting Cao, Elton Tian, Xi Yang, Fangzhou Xiao and more ...
Industry Partners
For their generous financial contribution to support my research.
Australian Research Council, Intel, Sun Microsystems (Oracle)
Last but definitely not least
For being so supportive, NCI NF colleagues.
Ben Evans, Robin Humble, Judy Jenkinson and David Singleton.
8/3/2019 01Front Cai
8/22
8/3/2019 01Front Cai
9/22
Abstract
Cluster OpenMP enables the use of the OpenMP shared memory programming
model on distributed memory cluster environments. Intel has released a cluster
OpenMP implementation called Intel Cluster OpenMP (CLOMP). While this offers
better programmability than message passing alternatives such as the Message
Passing Interface (MPI), such convenience comes with overheads resulting from
having to maintain the consistency of underlying shared memory abstractions.
CLOMP is no exception. This thesis introduces models for understanding these
overheads of cluster OpenMP implementations like CLOMP and proposes tech-
niques for enhancing their performance.
Cluster OpenMP systems are usually implemented using page-based software
distributed shared memory (sDSM) systems, which create and maintain virtual
global shared memory spaces in pages. A key issue for such system is maintaining
the consistency of the shared memory space. This forms a major source of overhead,
and it is driven by detecting and servicing page faults.
To investigate and understand these systems, we evaluate their performance
with different OpenMP applications, and we also develop a benchmark, called
MCBENCH, to characterize the memory consistency costs. Using MCBENCH, we
discover that this overhead is proportional to the number of writers to the same
shared page and the number of shared pages.
Furthermore, we divide an OpenMP program into separate parallel and serial
regions. Based on the regions, we develop two region-based models to rationalize
the numbers and types of the page faults and their associated costs to performance.
The models highlight the fact that the major overhead is servicing the type of
page faults, which requires data (a page or its modifications, known as diffs) to
be transferred across a network.
With this understanding, we have developed three region-based prefetch
(ReP) techniques based on the execution history of each parallel and sequential
region. The first ReP technique (TReP) considers temporal page faulting behaviour
between consecutive executions of the same region. The second technique (HReP)
considers both the temporal page faulting behaviour between consecutive execu-
tions of the same region and the spatial paging behaviour within an execution of
a region. The last technique (DReP) utilizes our proposed novel stride-augmented
run length encoding (sRLE) method to address the both the temporal and spatial
page faulting behaviour between consecutive executions of the same region. These
techniques effectively reduce the number of page faults and aggregate data (pages
and diffs) into larger transfers, which leverages the network bandwidth provided
8/3/2019 01Front Cai
10/22
by interconnects.
All three ReP techniques are implemented into runtime libraries of CLOMP
to enhance its performance. Both the original and the enhanced CLOMP are
evaluated using the NAS Parallel Benchmark OpenMP (NPB-OMP) suite, and two
LINPACK OpenMP benchmarks on different hardware platforms, including two
clusters connected with Ethernet and InfiniBand interconnects. The performance
data is quantitatively analyzed and modeled. Also, MCBENCH is used to evaluate
the impact of ReP techniques on memory consistency cost.
The evaluation results demonstrate that, on average, CLOMP spends 75% and
55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and
double data rate InfiniBand network respectively. These ratios of the NPB-OMP
benchmarks are reduced effectively by 60% and 40% after implementing the
ReP techniques on to the CLOMP runtime. For the LINPACK benchmarks, withthe assistance of sRLE, DReP significantly outperforms the other ReP techniques
with effectively reducing 50% and 58% of page fault handling costs on Gigabit
Ethernet and InfiniBand networks respectively.
8/3/2019 01Front Cai
11/22
Contents
Declaration v
Acknowledgements vii
Abstract ix
I Introduction and Background 1
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 6
1.2.2 Region-based Performance Models . . . . . . . . . . . . . . . . 7
1.2.3 Region-based Prefetch Techniques . . . . . . . . . . . . . . . . 8
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Background 112.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Synchronization Operations . . . . . . . . . . . . . . . . . . . . 16
2.2 Cluster OpenMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . 18
2.2.2 Software Distributed Shared Memory Systems . . . . . . . . . 19
2.2.3 Intel Cluster OpenMP . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Alternative Approaches to sDSMs . . . . . . . . . . . . . . . . 26
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Prefetch Techniques for sDSM Systems . . . . . . . . . . . . . 31
2.3.3 Run-Length Encoding Methods . . . . . . . . . . . . . . . . . . 35
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
II Performance Issues of Intel Cluster OpenMP 39
3 Performance of Original Intel Cluster OpenMP System 41
3.1 Hardware and Software Setup . . . . . . . . . . . . . . . . . . . . . . . 42
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
12/22
3.2 Performance of CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 NPB OpenMP Benchmarks Sequential Performance . . . . . . 44
3.2.2 Comparison of CLOMP and Intel Native OpenMP on a Single
Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 CLOMP with Single Thread per Compute Node . . . . . . . . 48
3.2.4 CLOMP with Multiple Threads per Compute Node . . . . . . 48
3.2.5 Elapsed Time Breakdown for NPB-OMP Benchmarks . . . . . 53
3.3 Memory Consistency Cost of CLOMP . . . . . . . . . . . . . . . . . . . 55
3.3.1 Memory Consistency Cost Micro-Benchmark MCBENCH . . 56
3.3.2 MCBENCH Evaluation of CLOMP . . . . . . . . . . . . . . . . 57
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Region-Based Performance Models 654.1 Regions of OpenMP Programs . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 SIGSEGV Driven Performance (SDP) Models . . . . . . . . . . . . . . 67
4.2.1 Critical Path Model . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Aggregated Model . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 Coefficient Measurement . . . . . . . . . . . . . . . . . . . . . . 71
4.3 SDP Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 Critical Path Model Estimates . . . . . . . . . . . . . . . . . . 73
4.3.2 Aggregate Model Estimates . . . . . . . . . . . . . . . . . . . . 74
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
III Optimizations: Design, Implementation and Evaluation 79
5 Region-Based Prefetch Techniques 81
5.1 Limitations of Current Prefetch Techniques for sDSM Systems . . . 82
5.1.1 Parallel Application Examples . . . . . . . . . . . . . . . . . . 82
5.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.3 Prefetch Technique Design Assumptions . . . . . . . . . . . . . 88
5.2 Evaluation Metrics of Prefetch Techniques . . . . . . . . . . . . . . . 895.3 Temporal ReP (TReP) Technique . . . . . . . . . . . . . . . . . . . . . 90
5.4 Hybrid ReP (HReP) Technique . . . . . . . . . . . . . . . . . . . . . . 90
5.5 ReP Technique for Dynamic Memory Accessing Applications (DReP) 93
5.5.1 Stride-augmented Run-length Encoded Page Fault Records . 93
5.5.2 Page Miss Prediction . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Offline Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.2 Simulation Results and Discussions . . . . . . . . . . . . . . . 98
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
13/22
6 Implementation and Evaluation 111
6.1 ReP Prefetch Techniques Implementation Issues . . . . . . . . . . . . 112
6.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1.2 New Region Notification . . . . . . . . . . . . . . . . . . . . . . 114
6.1.3 Record Encoding and Flush Filter enabled Decoding . . . . . . 116
6.1.4 Prefetch Page Prediction . . . . . . . . . . . . . . . . . . . . . . 116
6.1.5 Prefetch Request and Event Handling . . . . . . . . . . . . . . 117
6.1.6 Page State Transition . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.7 Garbage Collection Mechanism . . . . . . . . . . . . . . . . . . 119
6.2 Theoretical Performance of the ReP Enhanced CLOMP . . . . . . . . 120
6.3 Performance Evaluation of the ReP Enhanced CLOMP . . . . . . . . 123
6.3.1 MCBENCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3.2 NPB OpenMP Benchmarks . . . . . . . . . . . . . . . . . . . . 130
6.3.3 LINPACK Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.4 ReP Techniques with Multiple Threads per Process . . . . . . 142
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
IV Conclusions and Future Work 147
7 Conclusions and Future Work 149
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.1.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 150
7.1.2 SIGSEGV Driven Performance Models . . . . . . . . . . . . . . 152
7.1.3 Performance Enhancement by RePs . . . . . . . . . . . . . . . 152
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 156
7.2.2 Performance Optimizations . . . . . . . . . . . . . . . . . . . . 156
7.2.3 Adapting ReP Techniques to the Latest Technologies . . . . . 156
7.2.4 Potential Use of sRLE . . . . . . . . . . . . . . . . . . . . . . . 157
V Appendices 159
A Algorithms Used in DReP 161
A.1 Stride-augmented Run-length Encoding Algorithms . . . . . . . . . . 162
A.1.1 Algorithm 1: Page Fault Record Reconstruction Step (a) . . . 162
A.1.2 Algorithm 2: Page Fault Record Reconstruction Step (b) . . . 162
A.1.3 Algorithm 3: Page Fault Record Reconstruction Step (c) . . . . 163
A.2 Algorithm 4: DReP Predictor . . . . . . . . . . . . . . . . . . . . . . . 163
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
14/22
B Tsegv,local and Nftotal for Theoretical ReP Speedup Calculation 165
B.1 NPB-OMP Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166
B.2 LINPACK Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166
C TReP and DReP Performance Results of the NPB-OMP benchmarks
on a 4-node Intel Cluster 169
C.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.2 Sequential Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.3 TReP and DReP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 170
C.3.1 Elapsed Time over Gigabit Ethernet . . . . . . . . . . . . . . . 170
C.3.2 Elapsed Time over DDR InfiniBand . . . . . . . . . . . . . . . 173
D MultiRail Networks Optimization for the Communication Layer 177
D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
D.2 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
D.2.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
D.2.2 Single-Rail Benchmark . . . . . . . . . . . . . . . . . . . . . . . 179
D.2.3 Multirail Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 180
D.3 Bandwidth and Latency Experiments . . . . . . . . . . . . . . . . . . 181
D.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 182
D.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
D.3.3 Uni-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 185
D.3.4 Bi-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 186D.3.5 Elapsed Time Breakdown . . . . . . . . . . . . . . . . . . . . . 188
D.4 Related Work on Multirail InfiniBand Network . . . . . . . . . . . . . 190
D.5 Challenge and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 191
E Performance of CAL 193
E.1 Bandwidth and Latency of CAL . . . . . . . . . . . . . . . . . . . . . . 194
E.2 Comparison Between OpenMPI and CAL . . . . . . . . . . . . . . . . 194
Bibliography 197
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
15/22
List of Figures
2.1 OpenMP fork-join multi-threading parallelism mechanism [93] . . . . 13
2.2 OpenMP parallel directives and associated clauses in C and C++. . . . 13
2.3 OpenMP for directives and associated clauses in C and C++. . . . . . 14
2.4 An example OpenMP program in C using parallel for directives. . . . . 15
2.5 OpenMP synchronization directives in C and C++ languages: (a)
barrier, and (b) flush. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 OpenMP threadprivate directive in C and C++ languages. . . . . . . . 16
2.7 Processes and threads in CLOMP . . . . . . . . . . . . . . . . . . . . . 23
2.8 State machine of CLOMP (derived from [47], [38], and experimental
observation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Illustration of two prefetch modes for Adaptive++ techniques. . . . . 34
3.1 Comparison of performance between native Intel OpenMP and
CLOMP on a XE compute node. . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Comparison of performance between native Intel OpenMP and
CLOMP on a VAYU compute node. . . . . . . . . . . . . . . . . . . . . 46
3.3 Performance of CLOMP on XE with a single thread per compute node. 49
3.4 Performance of CLOMP on VAYU with a single thread per compute
node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Performance of CLOMP on XE with multi-threads per compute node. 52
3.6 Performance of CLOMP on VAYU with multi-threads per compute
node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7 MCBENCH An array of size a-bytes is divided into chunks of c-
bytes. The benchmark consists of Change and Read phases that can
be repeated for multiple iterations. Entering the Change phase of
the first iteration, the chunks are distributed to the available threads
(four in this case) in a round-robin fashion. In the Read phase after
the barrier, each thread reads from the chunk that its neighbour had
written to. This is followed by a barrier which ends the first iteration.
For the subsequent iteration, the chunks to Change are the same
as in the previous Read phase. That is, the shifting of the chunk
distribution only takes place when moving from the Change to Read
phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
16/22
3.8 MCBENCH evaluation results of CLOMP on XE with both Ethernet
and InfiniBand interconnects: 64KB, 4MB and 8MB array sizes
are used in these three figures respectively; comparison among
difference chunk sizes 4B, 2KB and 4KB is illustrated in each figure
for both Ethernet and InfiniBand. . . . . . . . . . . . . . . . . . . . . . 59
4.1 Illustration of regions in an OpenMP parallel program. . . . . . . . . 67
4.2 Schematic illustration of timing breakdown for parallel region using
the SDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 The algorithm used to determine the SDP coefficients. The code
shown is in a parallel region. R is a private array while S is a shared
one. Variables Dw and Dr represent reference times for accessing
private array R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Pseudo code to demonstrate the memory access patterns of the naive
LINPACK OpenMP benchmark implementation for an nn column-
major matrix A with blocking factor nb. . . . . . . . . . . . . . . . . . 83
5.2 Naive OpenMP LINPACK program with n n matrix: (a) memory
access areas for different iterations. (b) page fault areas for different
iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Pseudo code to demonstrate the memory access patterns of the
optimized LINPACK OpenMP benchmark implementation for ann n column-major matrix A with blocking factor nb. . . . . . . . . . 86
5.4 Optimized OpenMP LINPACK program: (a) memory access areas for
different iterations illustrated on a nn matrix panel. (b) page fault
areas for different iterations illustrated on the n n matrix panel. . 87
5.5 The page fault record entry for TReP and HReP prefetch techniques. 90
5.6 A flowchart of the HReP predictor. . . . . . . . . . . . . . . . . . . . . 92
5.7 Two levels of stride-augmented run-length encoding (sRLE) method:
(a) Based on strides between consecutive pages, sorted missed
pages are broken into small sub-arrays, and those consecutivepages with the same stride are stored in the same array. (b)
The sub-arrays are compressed in to the first level sRLE records
in a (StartPageID, CommonStride,RunLength) format. (c) Based
on the stride between the start pages of consecutive first level
sRLE records, they are further compressed into the second level
sRLE format, (FirstLevelRecord,CommonStride,RunLength) (more
details in Section 5.5.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.8 Page fault record of region execution reconstructed via run-length
encoding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
17/22
5.9 The effective page miss rate reduction for different prefetch tech-
niques on 2 threads (a), 4 threads (b) and 8 threads (c). . . . . . . . . 103
6.1 Intel Cluster OpenMP runtime structure. . . . . . . . . . . . . . . . . 1126.2 Data structure for stride-augmented run-length encoded page fault
records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 ReP prefetch record data structure. . . . . . . . . . . . . . . . . . . . . 115
6.4 User interactive interface of new region notification. . . . . . . . . . . 115
6.5 The round-robin prefetch request communication pattern. . . . . . . 118
6.6 New page state machine after introduced Prefetched diff and
Prefetched page states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.7 RePs VS. Original CLOMP: MCBENCH with 4B chunk size over both
the GigE and IB networks. (a) 64KB array size, (b) 4MB array size,(c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.8 RePs VS. Original CLOMP: MCBENCH with 2048 bytes chunk size
over both the GigE and IB networks. (a) 64KB array size, (b) 4MB
array size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . 127
6.9 RePs VS. Original CLOMP: MCBENCH with 4KB chunk size over
both the GigE and IB networks. (a) 64KB array size, (b) 4MB array
size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.10 RePs VS. Original CLOMP: BT speedup comparison on both GigE
and IB networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.11 RePs VS. Original CLOMP: the naive LINPACK evaluation results
comparison usingNNmatrix (N = 4096) with blocking factorNB =
64 via both GigE and IB. . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.12 RePs VS. Original CLOMP: the optimized LINPACK evaluation
results comparison using N N matrix (N = 8192) with blocking
factor NB = 64 via both GigE and IB. . . . . . . . . . . . . . . . . . . 141
6.13 DReP vs Original CLOMP: the optimized LINPACK benchmark (N =
8192 and NB = 64) results comparison with multiple threads per
process via both GigE and IB. (a) 2 threads per process, (b) 4 threads
per process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.1 Speedup of the BT and CG benchmarks over Gigabit Ethernet. . . . 171
C.2 Speedup of IS and LU benchmarks over Gigabit Ethernet. . . . . . . 172
C.3 Speedup of BT and CG benchmarks over DDR InfiniBand. . . . . . . 174
C.4 Speedup of IS and LU benchmarks over DDR InfiniBand. . . . . . . . 175
D.1 Single-rail bandwidth benchmark . . . . . . . . . . . . . . . . . . . . . 179
D.2 Multirail communication memory access pattern. . . . . . . . . . . . 180
D.3 Non-threaded multirail bandwidth benchmark . . . . . . . . . . . . . 182
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
18/22
D.4 Threaded multirail benchmark design. . . . . . . . . . . . . . . . . . . 183
D.5 RDMA write latency comparison. . . . . . . . . . . . . . . . . . . . . . 184
D.6 Uni-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . 185
D.7 Uni-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 186
D.8 Bi-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . . 187
D.9 Bi-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 188
D.10 Benchmarks elapsed time breakdown for 512bytes message. . . . . . 188
D.11 Benchmarks elapsed time breakdown for 4KB message. . . . . . . . . 189
D.12 Different ways to configure a InfiniBand multirail network [62]. . . . 190
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
19/22
List of Tables
2.1 OpenMP synchronization operations. . . . . . . . . . . . . . . . . . . . 17
3.1 Evaluation experimental hardware platforms. . . . . . . . . . . . . . 43
3.2 Sequential elapsed time (sec) of NPB with CLOMP. . . . . . . . . . . 44
3.3 Page faults handling cost (SEGV Cost) of CLOMP for NPB bench-
marks as a ratio to corresponding elapsed time with single thread
per proess on XE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Page faults handling cost breakdown for CLOMP for class A NPB
benchmarks with multiple threads per process on XE. SEGV
represents the ratio of page faults handling cost to the corresponding
elapsed time; SEGV Lock in turn represents a ratio ofpthread mutex
cost within SEGV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Critical path page faults counts for the NPB-OMP benchmarks run
using CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Comparison between observed and estimated speedup for running
NPB class A and C on the AMD cluster with CLOMP . . . . . . . . . 77
4.3 Average relative errors for the predicted NPB speedups evaluated
using the critical path and aggregate (f = 0) SDP models and datafrom Tables 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Threshold effects of ReP techniques for naive LINPACK benchmark. 98
5.2 Simulation prefetch efficiency (E) and coverage (Nu/Nf) for Adapt-
ive++, TODFCM (1 page), TReP, HReP and DReP techniques. . . . . 108
5.3 Breakdown of prefetches issued by different prefetch modes and
chosen list deployed in HReP. . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Comparison of F-HReP and HReP with the LU benchmark. . . . . . 109
6.1 Bandwidth and latency measured by the communication layer (CAL)of CLOMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 ReP techniques prefetch efficiency and coverage for MCBNECH with
4MB array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Message transfer counts (1000) comparison between RePs enhanced
CLOMP and the original CLOMP for MCBENCH with 4B chunk . . 126
6.4 Message transfer counts (1000) comparison between RePs enhanced
CLOMP and the original CLOMP for MCBENCH with 2KB chunk . 128
6.5 Message transfer counts (1000) comparison between RePs enhanced
CLOMP and the original CLOMP for MCBENCH with 4KB chunk . 129
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
20/22
6.6 Page fault handling costs comparison for BT benchmark among the
original CLOMP, the theoretical and the ReP techniques enchanced
CLOMP. The computation part of elapsed time is common to all
compared items. The page fault handling costs of the original
CLOMP is presented in second, and that of others are presented as a
reduction ratio (e.g. OrigTRePOrig
). . . . . . . . . . . . . . . . . . . . . . 132
6.7 Page fault handling costs reduction ratio (TorigsegvTsegv
Torigsegv
) comparison for
other NPB benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.8 Detailed Tsegv breakdown analysis of the IS Class A Benchmark
for the ReP techniques. Overall Tsegv stands for overall CLOMP
overhead. TMK Comm stands for the communication time spent
by TMK for data transfer. TMK local stands for the local software
overhead of TMK layer. ReP Comm stands for the communication
time spent on prefetching data. ReP local stands for the local
software overhead introduced by using the ReP prefetch techniques.
Tsegv is presented in seconds and its components are presented as a
ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.9 Detailed Tsegv breakdown analysis of the IS Class C Benchmark
for the ReP techniques. Overall Tsegv stands for overall CLOMP
overhead. TMK Comm stands for the communication time spent
by TMK for data transfer. TMK local stands for the local software
overhead of TMK layer. ReP Comm stands for the communication
time spent on prefetching data. ReP local stands for the local
software overhead introduced by using the ReP prefetch techniques.
Tsegv is presented in seconds and its components are presented as a
ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.10 Sequential elapsed time for LINPACK benchmarks . . . . . . . . . . 138
6.11 Page fault handling costs comparison for LINPACK benchmarks
among the original CLOMP, the theoretical and the ReP techniquesenchanced CLOMP. The computation part of elapsed time is common
to all compared items. The page fault handling costs of the original
CLOMP is presented in second, and that of others are presented as a
reduction ratio (e.g. OrigTRePOrig
). . . . . . . . . . . . . . . . . . . . . . 139
6.12 Page faults handling cost comparison between DReP and the ori-
ginal CLOMP for the optimized LINPACK benchmark with multiple
threads per process. SEGV represents the ratio of page faults
handling cost to the corresponding elapsed time; SEGV Lock in
turn represents a ratio of pthread mutex cost within SEGV. . . . . . 142
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
21/22
B.1 Tsegv,local (sec) for some NPB-OMP benchmarks with different num-
ber of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.2 Nftotal for some NPB-OMP benchmarks with different number of
processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.3 Tsegvtotal (sec) for LINPACK benchmarks with different number of
processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.4 Nftotal for LINPACK benchmarks with different number of processes. 167
C.1 Elapsed Time (sec) of some NPB-OMP Benchmarks on one thread. . 170
E.1 Complete bandwidth and latency measured by the communication
layer (CAL) of CLOMP on XE. . . . . . . . . . . . . . . . . . . . . . . . 194
E.2 Comparison of CAL and OpenMPI: bandwidth and latency measured
on XE via GigE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
E.3 Comparison of CAL and OpenMPI: bandwidth and latency measured
on XE via DDR IB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-8/3/2019 01Front Cai
22/22