-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathHPCToolkit-users-manual.tex
2622 lines (1949 loc) · 161 KB
/
HPCToolkit-users-manual.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% $HeadURL$
% $Id$
%
% Copyright ((c)) 2002-2022, Rice University
% All rights reserved.
% See file LICENSE for details.
%
% ***************************************************************************
%
% ***************************************************************************
\documentclass[11pt,twoside,letterpaper]{report}
\setcounter{topnumber}{4}
\setcounter{totalnumber}{4}
\renewcommand{\floatpagefraction}{0.7}
% ***************************************************************************
% Standard packages
% ***************************************************************************
\usepackage{fixltx2e}
%\usepackage{fixpdftex}
% ==========================================================
% formatting, graphics, tables, figures, etc.
% ==========================================================
\usepackage[toc]{appendix}
%\usepackage{geometry}
\usepackage{comment}
\usepackage{fullpage}
\usepackage{indentfirst}
\usepackage[bf,normalsize]{caption}
\usepackage{subcaption}
\usepackage{setspace} % setspace (better than doublespace)
\usepackage{cite}
\usepackage{verbatim,moreverb,fancyvrb}
\usepackage{listings}
\lstset{%
%basicstyle=\small, % print whole listing small
%keywordstyle=\color{black}\bfseries\underbar,
language=C++,
columns=fullflexible,
numbers=left, numberstyle=\scriptsize, stepnumber=1, numbersep=5pt, %\tiny
escapeinside={(@}{@)}
}
\usepackage[table]{xcolor}
\definecolor{clr:bluegrey1}{HTML}{F1F5FA}
\definecolor{clr:bluegrey2}{HTML}{ECF3FE}
% Generally load hyperref last, unless otherwise specified
\usepackage[breaklinks=true]{hyperref} %bookmarksopen,bookmarksnumbered
%\hypersetup{
% colorlinks=false,%
% pdfborder = 0 0 0
%}
% Cf. http://www.nersc.no/~knutal/latex_tips.html
% To use epstopdf: pdflatex --shell-escape <*.tex>
\usepackage{ifpdf}
\ifpdf
\usepackage[pdftex]{graphicx}
\usepackage{epstopdf}
\usepackage{pdfpages}
\else
\usepackage[dvips]{graphicx}
\usepackage{breakurl} % fix hyperref
\fi
% ==========================================================
% symbols, etc.
% ==========================================================
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
% ***************************************************************************
% Customizations
% ***************************************************************************
\specialcomment{BookVDM}{}{}
\excludecomment{BookVDM}
\setlength{\textwidth}{6.0 in}
\setlength{\oddsidemargin}{0.5 in}
\setlength{\evensidemargin}{0.5 in} % 0.0 for twoside
\clubpenalty=10000
\widowpenalty=10000
% Sanitize placement of figures
\renewcommand{\topfraction}{0.85}
\renewcommand{\textfraction}{0.1}
\renewcommand{\floatpagefraction}{0.75}
\input{myconfig}
% ***************************************************************************
% Document
% ***************************************************************************
\begin{document}
% ***************************************************************************
% ***************************************************************************
\title{\HPCToolkit{} User's Manual\\[.5in]Version 2024.01}
%\subtitle{}
\author{John Mellor-Crummey,\\
Laksono Adhianto,
Jonathon Anderson,
Mike Fagan,
Dragana Grbic,
Marty Itzkowitz,\\
Mark Krentel,
Xiaozhu Meng,
Nathan Tallent,
Keren Zhou\\
\\
\\
Rice University\\
}
\date{March 2024}
\maketitle
% ***************************************************************************
% ***************************************************************************
\pagenumbering{roman}
\setcounter{page}{1}
%\cleardoublepage
%\chapter*{Preface}
% ***************************************************************************
% ***************************************************************************
%\chapter*{Acknowledgements}
% ***************************************************************************
% ***************************************************************************
\
\begin{singlespace}
% \newpage
\pagestyle{empty}
\thispagestyle{empty}
\tableofcontents
% \newpage
% \pagestyle{empty}
% \thispagestyle{empty}
% \listoffigures
% \newpage
% \pagestyle{empty}
% \thispagestyle{empty}
% \listofalgorithms
\end{singlespace}
% ***************************************************************************
% ***************************************************************************
%\newpage
% \cleardoublepage
\pagestyle{plain}
\pagenumbering{arabic}
\chapter{Introduction}
\setcounter{page}{1}
\HPCToolkit{}~\cite{Adhianto-etal:2010:CPE-hpctoolkit,hpctoolkit-www} is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the world's largest supercomputers.
\HPCToolkit{} provides accurate measurements of a program's work, resource consumption, and inefficiency, correlates these metrics with the program's source code, works with multilingual, fully optimized binaries, has low measurement overhead, and scales to large parallel systems.
\HPCToolkit{}'s measurements provide support for analyzing a program execution cost, inefficiency, and scaling characteristics both within and across nodes of a parallel system.
\HPCToolkit{} principally monitors an execution of a multithreaded and/or multiprocess program using asynchronous sampling, unwinding thread call stacks, and attributing the metric value associated with a sample event in a thread to the calling context of the thread/process in which the event occurred. \HPCToolkit{}'s asynchronous sampling is typically triggered by the expiration of a Linux timer or a hardware performance monitoring unit event, such reaching a threshold value for a hardware performance counter.
Sampling has several advantages over instrumentation for measuring program performance: it requires no modification of source code, it avoids potential blind spots (such as code available in only binary form), and it has lower overhead.
\HPCToolkit{} typically adds measurement overhead of only a few percent to an execution for reasonable sampling rates~\cite{Tallent-MC-Fagan:2009:PLDI-hpctoolkit-binary-analysis}.
Sampling enables fine-grain measurement and attribution of costs in both serial and parallel programs.
For parallel programs, one can use HPCToolkit to measure
the fraction of time threads are idle, working, or communicating.
To obtain detailed information about a program's computation
performance, one can collect samples using a processor's built-in performance monitoring
units to measure metrics such as
operation counts, pipeline stalls, cache misses, and data movement
between processor sockets. Such detailed measurements are essential
to understand the performance characteristics of applications
on modern multicore microprocessors that employ instruction-level
parallelism, out-of-order execution, and complex memory hierarchies.
With \HPCToolkit{}, one can also easily compute derived metrics such as cycles
per instruction, waste, and relative efficiency to provide insight
into a program's shortcomings.
\begin{figure}[t]
\centering{\includegraphics[width=.8\textwidth]{fig/hpctoolkit-code-centric}}
\caption{A code-centric view of an execution of the University of Chicago's FLASH code executing on 8192 cores of a Blue Gene/P. This bottom-up view shows that 16\% of the execution time was spent in IBM's DCMF messaging layer. By tracking these costs up the call chain, we can see that most of this time was spent on behalf of calls to {\tt pmpi\_allreduce} on line 419 of {\tt amr\_comm\_setup}.}
\label{fig:code-centric}
\end{figure}
A unique capability of \HPCToolkit{} is its ability to unwind the call stack of a thread executing highly optimized code to attribute time, hardware counter metrics, as well as software metrics (e.g., context switches) to a full calling context.
Call stack unwinding is often difficult for highly optimized code~\cite{Tallent-MC-Fagan:2009:PLDI-hpctoolkit-binary-analysis}. For accurate call stack unwinding, HPCToolkit employs two strategies:
interpreting compiler-recorded information in DWARF Frame Descriptor Entries (FDEs) and binary analysis
to compute unwind recipes directly from an application's machine instructions.
On ARM processors, HPCToolkit uses {\tt libunwind} exclusively. On Power processors, HPCToolkit uses
binary analysis exclusively.
On x86\_64 processors, HPCToolkit employs both strategies in an integrated fashion.
\begin{figure}[t]
\centering{\includegraphics[width=.8\textwidth]{fig/hpctoolkit-thread-centric}}
\caption{A thread-centric view of the performance of a parallel radix sort application executing on 960 cores of a Cray XE6. The bottom pane shows a calling context for {\tt usort} in the execution. The top pane shows a graph of how much time each thread spent executing calls to {\tt usort} from the highlighted context. On a Cray XE6, there is one MPI helper thread for each compute node in the system; these helper threads spent no time executing {\tt usort}. The graph shows that some of the MPI ranks spent twice as much time in {\tt usort} as others. This happens because the radix sort divides up the work into 1024 buckets. In an execution on 960 cores, 896 cores work on one bucket and 64 cores work on two. The middle pane shows an alternate view of the thread-centric data as a histogram.}
\label{fig:thread-centric}
\end{figure}
\HPCToolkit{} assembles performance measurements into a call path profile that associates the costs of each function call with its full calling context.
In addition, \HPCToolkit{} uses binary analysis to attribute program performance metrics with detailed precision -- full dynamic calling contexts augmented with information about call sites, inlined functions and templates, loops, and source lines.
Measurements can be analyzed in a variety of ways: top-down in a calling context tree, which associates costs with the full calling context in which they are incurred; bottom-up in a view that apportions costs associated with a function to each of the contexts in which the function is called; and in a flat view that aggregates all costs associated with a function independent of calling context.
This multiplicity of code-centric perspectives is essential to understanding a program's performance for tuning under various circumstances.
\HPCToolkit{} also supports a thread-centric perspective, which enables one to see how a performance metric for a calling context differs across threads, and a time-centric perspective, which enables a user to see how an execution unfolds over time. Figures~\ref{fig:code-centric}--\ref{fig:time-centric} show samples of HPCToolkit's code-centric, thread-centric, and time-centric views.
\begin{figure}[t]
\centering{\includegraphics[width=.8\textwidth]{fig/hpctoolkit-time-centric}}
\caption{A time-centric view of part of an execution of the University of Chicago's FLASH code on 256 cores of a Blue Gene/P. The figure shows a detail from the end of the initialization phase and part of the first iteration of the solve phase. The largest pane in the figure shows the activity of cores 2--95 in the execution during a time interval ranging from 69.376s--85.58s during the execution. Time lines for threads are arranged from top to bottom and time flows from left to right. The color at any point in time for a thread indicates the procedure that the thread is executing at that time. The right pane shows the full call stack of thread 85 at 84.82s into the execution, corresponding to the selection shown by the white crosshair; the outermost procedure frame of the call stack is shown at the top of the pane and the innermost frame is shown at the bottom. This view highlights that even though FLASH is an SPMD program, the behavior of threads over time can be quite different. The purple region highlighted by the cursor, which represents a call by all processors to {\tt mpi\_allreduce}, shows that the time spent in this call varies across the processors. The variation in time spent waiting in {\tt mpi\_allreduce} is readily explained by an imbalance in the time processes spend a prior prolongation step, shown in yellow. Further left in the figure, one can see differences among ranks executing on different cores in each node as they await the completion of an {\tt mpi\_allreduce}. A rank executing on one core of each node waits in {\tt DCMF\_Messager\_advance} (which appears as blue stripes) while ranks executing on other cores in each node wait in a helper function (shown in green). In this phase, ranks await the delayed arrival of a few of their peers who have extra work to do inside {\tt simulation\_initblock} before they call {\tt mpi\_allreduce}. }
\label{fig:time-centric}
\end{figure}
By working at the machine-code level, \HPCToolkit{} accurately measures and attributes costs in executions of multilingual programs, even if they are linked with libraries available only in binary form.
\HPCToolkit{} supports performance analysis of fully optimized code.
It measures and attributes performance metrics to shared libraries that are dynamically loaded at run time.
The low overhead of \HPCToolkit{}'s sampling-based measurement is particularly important
for parallel programs because measurement overhead can distort program behavior.
\HPCToolkit{} is also especially good at pinpointing scaling losses in parallel codes, both within multicore nodes and across the nodes in a parallel system.
Using differential analysis of call path profiles collected on different numbers of threads or processes enables one to quantify scalability losses and pinpoint their causes to individual lines of code executed in particular calling contexts~\cite{Coarfa-MC:2007:ICS-scalability}.
We have used this technique to quantify scaling losses in leading science applications across thousands of processor cores on Cray and IBM Blue Gene systems, associate them with individual lines of source code in full calling context~\cite{Tallent-MC-etal:2009:SC-hpctoolkit-petascale,Tallent-MC-etal:2010:SC-hpctoolkit-load-imbalance}, and quantify scaling losses in science applications within compute nodes at the loop nest level due to competition for memory bandwidth in multicore processors~\cite{Tallent-etal:2008:SciDAC-hpctoolkit}.
We have also developed techniques for efficiently attributing the idleness in one thread to its cause in another thread~\cite{Tallent-MC:2009:PPoPP-hpctoolkit-work-stealing,Tallent-MC-Porterfield:2010:PPoPP-hpctoolkit-lock-contention}.
\HPCToolkit{} is deployed on many DOE supercomputers, including
the Sierra supercomputer (IBM Power9 + NVIDIA V100 GPUs) at Lawrence Livermore National Laboratory,
Cray XC40 systems at Argonne's Leadership Computing Facility and the National Energy
Research Scientific Computing Center; the Summit supercomputer (IBM Power9 + NVIDIA V100 GPUs) at Oak Ridge Leadership Computing Facility
as well as other clusters and supercomputers based on x86\_64, Power, and ARM processors.
% ***************************************************************************
% ***************************************************************************
\cleardoublepage
\chapter{\HPCToolkit{} Overview}
\begin{figure}[t]
\centering{\includegraphics[width=.8\textwidth]{fig/hpctoolkit-gpu-workflow}}
\caption{Overview of \HPCToolkit{}'s tool work flow.}
\label{fig:hpctoolkit-overview:a}
\end{figure}
\HPCToolkit{}'s work flow is organized around four principal capabilities, as shown in Figure~\ref{fig:hpctoolkit-overview:a}:
\begin{enumerate}
\item \emph{measurement} of context-sensitive performance metrics using call-stack unwinding
while an application executes;
\item \emph{binary analysis} to recover program structure from the application binary and the shared libraries
and GPU binaries used in the run;
\item \emph{attribution} of performance metrics by correlating dynamic performance metrics with static program structure; and
\item \emph{presentation} of performance metrics and associated source code.
\end{enumerate}
To use \HPCToolkit{} to measure and analyze an application's performance, one first compiles and links the application for a production run, using \emph{full} optimization and including debugging symbols.%
\footnote{%
For the most detailed attribution of application performance data using \HPCToolkit{}, one should ensure that the compiler includes line map information in the object code it generates. While \HPCToolkit{} does not need this information to function, it can be helpful to users trying to interpret the results. Since compilers can usually provide line map information for fully optimized code, this requirement need not require a special build process. For instance, with the Intel compiler we recommend using \texttt{-g -debug inline\_debug\_info}.}
Second, one launches an application with \HPCToolkit{}'s measurement tool, \hpcrun{}, which uses statistical sampling to collect a performance profile.
Third, one invokes \hpcstruct{}, \HPCToolkit{}'s tool for analyzing an application binary and any shared objects and GPU binaries
it used in the data collection run, as stored in the measurements directory. It recovers
information about source files, procedures, loops, and inlined code.
Fourth, one uses \hpcprof{} to combine information about an application's structure with dynamic performance measurements to produce a performance database.
Finally, one explores a performance database with \HPCToolkit{}'s \hpcviewer{} and/or \hpctraceviewer{} graphical presentation tools.
The rest of this chapter briefly discusses unique aspects of \HPCToolkit{}'s measurement, analysis and presentation capabilities.
\section{Asynchronous Sampling and Call Path Profiling}
Without accurate measurement, performance analysis results may be of questionable value.
As a result, a principal focus of work on \HPCToolkit{} has been the design and implementation of techniques to provide accurate fine-grain measurements of production applications running at scale.
For tools to be useful on production applications on large-scale parallel systems, large measurement overhead is unacceptable.
For measurements to be accurate, performance tools must avoid introducing measurement error.
Both source-level and binary instrumentation can distort application performance through a variety of mechanisms~\cite{Mytkowicz:2009:PWD:2528521.1508275}.
Frequent calls to small instrumented procedures can lead to considerable measurement overhead.
Furthermore, source-level instrumentation can distort application performance by interfering with inlining and template optimization.
To avoid these effects, many instrumentation-based tools intentionally refrain from instrumenting certain procedures.
Ironically, the more this approach reduces overhead, the more it introduces \emph{blind spots}, \ie{}, intervals of unmonitored execution.
For example, a common selective instrumentation technique is to ignore small frequently executed procedures --- but these may be just the thread synchronization library routines that are critical.
Sometimes, a tool unintentionally introduces a blind spot.
A typical example is that source code instrumentation suffers from blind spots when source code is unavailable, a common condition for math and communication libraries.
To avoid these problems, \HPCToolkit{} eschews instrumentation and favors the use of \emph{asynchronous sampling} to measure and attribute performance metrics.
During a program execution, sample events are triggered by periodic interrupts induced by an interval timer or overflow of hardware performance counters.
One can sample metrics that reflect work (\eg{}, instructions, floating-point operations), consumption of resources (\eg{}, cycles, bandwidth consumed in the memory hierarchy by data transfers in response to cache misses), or inefficiency (\eg{}, stall cycles).
For reasonable sampling frequencies, the overhead and distortion introduced by sampling-based measurement is typically much lower than that introduced by instrumentation~\cite{Froyd-MC-Fo:2005:ICS-csprof}.
For all but the most trivially structured programs, it is important to associate the costs incurred by each procedure with the contexts in which the procedure is called.
Knowing the context in which each cost is incurred is essential for understanding why the code performs as it does.
This is particularly important for code based on application frameworks and libraries.
For instance, costs incurred for calls to communication primitives (\eg{}, \mytt{MPI_Wait}) or code that results from instantiating C++ templates for data structures can vary widely depending how they are used in a particular context.
Because there are often layered implementations within applications and libraries, it is insufficient either to insert instrumentation at any one level or to distinguish costs based only upon the immediate caller.
For this reason, \HPCToolkit{} uses call path profiling to attribute costs to the full calling contexts in which they are incurred.
\HPCToolkit{}'s \hpcrun{} call path profiler uses call stack unwinding to attribute execution costs of optimized executables to the full calling context in which they occur.
Unlike other tools, to support asynchronous call stack unwinding during execution of optimized code, \hpcrun{} uses on-line binary analysis to locate procedure bounds and compute an unwind recipe for each code range within each procedure~\cite{Tallent-MC-Fagan:2009:PLDI-hpctoolkit-binary-analysis}.
These analyses enable \hpcrun{} to unwind call stacks for optimized code with little or no information other than an application's machine code.
\begin{comment}
To attribute performance back to source code, \HPCToolkit{} combines a call path profile with information gleaned
through post-mortem analysis of an application's object code and its debugging sections.
This post-mortem analysis of an executable recovers its program structure and reconstructs a mapping from
instructions back to source lines, loops, inlined functions, and procedures.
\HPCToolkit{}'s ability to attribute costs to dynamic call paths, including loops and inlined functions,
for optimized code without a special-purpose compiler is unique.
\end{comment}
The output of a run with \hpcrun{} is a \emph{measurements directory} containing the data, and the information necessary
to recover the names of all shared libraries and GPU binaries.
\section{Recovering Static Program Structure}
To enable effective analysis, call path profiles for executions of optimized programs must be correlated
with important source code abstractions.
Since measurements refer only to instruction addresses within the running application,
it is necessary to map measurements back to the program source.
The mappings include those of the application and any shared libraries referenced during the
run, as well as those for any GPU binaries executed on GPUs during the run.
To associate measurement data with the static structure of fully-optimized executables,
we need a mapping between object code and its associated source code structure.\footnote{This object to source code mapping should be contrasted with the binary's line map, which
(if present) is typically fundamentally line based.}
\HPCToolkit{} constructs this mapping using binary analysis; we call this process
\emph{recovering program structure}~\cite{Tallent-MC-Fagan:2009:PLDI-hpctoolkit-binary-analysis}.
\HPCToolkit{} focuses its efforts on recovering source files, procedures, inlined functions and templates, as well as
loop nests as the most important elements of source code structure.
To recover program structure, \HPCToolkit's \hpcstruct{} utility parses a binary's machine instructions,
reconstructs a control flow graph, combines line map and DWARF information about inlining with interval
analysis on the control flow graph in a way that enables it to relate machine code after optimization
back to the original source.
One important benefit accrues from this approach.
\HPCToolkit{} can expose the structure of and assign metrics to the code is actually executed, \emph{even if source code is unavailable}.
For example, \hpcstruct{}'s program structure naturally reveals transformations such as loop fusion and scalarization
loops that arise from compilation of Fortran 90 array notation.
Similarly, it exposes calls to compiler support routines and wait loops in communication libraries of which one would otherwise be unaware.
\section{Reducing Performance Measurements}
\HPCToolkit{} combines (post-mortem) the recovered static program structure with dynamic call paths to expose inlined frames and loop nests.
This enables us to attribute the performance of samples in their full static and dynamic context and correlate it with source code.
The data reduction is done by \HPCToolkit's \hpcprof{} utility, invoked on the \emph{measurements directory} recorded by \hpcrun{} and augmented with program structure information by \hpcstruct{}.
From the measurements and structure, \hpcprof{} generates a \emph{database directory} containing performance data presentable by \hpcviewer{}.
In most cases \hpcprof{} is able to complete the reduction in a matter of minutes, however for especially large experiments (more than about 100,000 threads or GPU streams~\cite{10.1145/3524059.3532397}) its multi-node sibling \hpcprofmpi{} may be substantially faster.
\hpcprofmpi{} is an MPI application identical to \hpcprof{}, except that it additionally can exploit multiple compute nodes during the reduction.
In our experience, exploiting 8-10 compute nodes via \hpcprofmpi{} can be as much as $5\times$ faster than \hpcprof{} for sufficiently large experiments.
\section{Presenting Performance Measurements}
To enable an analyst to rapidly pinpoint and quantify performance bottlenecks, tools must present the performance measurements in a way that engages the analyst, focuses attention on what is important, and automates common analysis subtasks to reduce the mental effort and frustration of sifting through a sea of measurement details.
To enable rapid analysis of an execution's performance bottlenecks, we have carefully designed the \hpcviewer{}
- a code-centric presentation tool~\cite{Adhianto-MC-Ta:2010:PSTI-hpcviewer}. It also includes a time-centric tab
~\cite{Tallent-MC-etal:2011:ICS-hpctoolkit-scalable-tracing}.
\hpcviewer{} combines a relatively small set of complementary presentation techniques that, taken together, rapidly focus an analyst's attention on performance bottlenecks rather than on unimportant information.
To facilitate the goal of rapidly focusing an analyst's attention on performance bottlenecks \hpcviewer{}
extends several existing presentation techniques.
In particular, \hpcviewer{} (1) synthesizes and presents three complementary views of calling-context-sensitive metrics;
(2) treats a procedure's static structure as first-class information with respect to both performance metrics
and constructing views; (3) enables a large variety of user-defined metrics to describe performance inefficiency;
and (4) automatically expands hot paths based on arbitrary performance metrics --- through calling contexts and static structure --- to rapidly highlight important performance data.
The trace tab enables an application developer to visualize how a parallel execution unfolds over time.
This view facilitates identification of important inefficiencies such as serialization and load imbalance, among others.
% ***************************************************************************
% ***************************************************************************
\cleardoublepage
\chapter{Quick Start}
\label{chpt:quickstart}
This chapter provides a rapid overview of analyzing the performance of an application using \HPCToolkit{}.
It assumes an operational installation of \HPCToolkit{}.
% ===========================================================================
% ===========================================================================
\section{Guided Tour}
\label{chpt:quickstart:tour}
\begin{figure}[t]
\centering{\includegraphics[width=.8\textwidth]{fig/hpctoolkit-gpu-workflow}}
\caption{Overview of \HPCToolkit{} tool's work flow.}
\label{fig:hpctoolkit-overview:b}
\end{figure}
\HPCToolkit{}'s work flow is summarized in Figure~\ref{fig:hpctoolkit-overview:b} (on page~\pageref{fig:hpctoolkit-overview:b}) and is organized around four principal capabilities:
\begin{enumerate}
\item \emph{measurement} of context-sensitive performance metrics while an application executes;
\item \emph{binary analysis} to recover program structure from CPU and GPU binaries;
\item \emph{attribution} of performance metrics by correlating dynamic performance metrics with static program structure; and
\item \emph{presentation} of performance metrics and associated source code.
\end{enumerate}
To use \HPCToolkit{} to measure and analyze an application's performance, one first compiles and links the application for a production run, using \emph{full} optimization.
Second, one launches an application with \HPCToolkit{}'s measurement tool, \hpcrun{}, which uses statistical sampling to collect a performance profile.
Third, one applies \hpcstruct{} to an application's measurement directory to recover program structure information from any CPU or GPU binary that was measured.
Program structure, which includes information about files, procedures, inlined code, and loops, is used to relate performance measurements to source code.
Fourth, one uses \hpcprof{} to combine information about an application's structure with dynamic performance measurements to produce a performance database.
Finally, one explores a performance database with \HPCToolkit{}'s graphical user interface: \hpcviewer{} which presents
both a code-centric analysis of performance metrics and a time-centric (trace-based) analysis of an execution.
The following subsections explain \HPCToolkit{}'s work flow in more detail.
% ==========================================================
% ==========================================================
\subsection{Compiling an Application}
For the most detailed attribution of application performance data using \HPCToolkit{}, one should compile so as to include with line map information in the generated object code.
This usually means compiling with options similar to `\texttt{-g -O3}'. Check your compiler's documentation for information about the right set of options to have the compiler record information about inlining and the mapping of machine instructions to source lines. We advise picking options that indicate they will record information that relates machine instructions to source code without compromising optimization. For instance, the Portland Group (PGI) compilers, use \texttt{-gopt} in place of \texttt{-g} to collect information without interfering with optimization.
While \HPCToolkit{} does not need information about the mapping between machine instructions and source code to function,
having such information included in the binary code by the compiler can be helpful to users trying to interpret performance measurements.
Since compilers can usually provide information about line mappings and inlining for fully-optimized code,
this requirement usually involves a one-time trivial adjustment to the an application's build scripts
to provide a better experience with tools. Such mapping information enables tools such as \HPCToolkit{},
race detectors, and memory analysis tools to attribute information more precisely.
For statically linked executables, such as those often used on Cray supercomputers, the final link step is done with \hpclink{}.
% ==========================================================
% ==========================================================
\subsection{Measuring Application Performance}
\label{chpt:quickstart:tour:measurement}
Measurement of application performance takes two different forms depending on whether your application is dynamically or statically linked.
To monitor a dynamically linked application, simply use \hpcrun{} to launch the application.
To monitor a statically linked application, the data to be collected is specified by environment variables.
In either case, the application may be sequential, multithreaded or based on MPI.
The commands below give examples for an application named \texttt{app}.
%
\begin{itemize}
\item Dynamically linked applications:\hfill
Simply launch your application with \hpcrun{}:
\begin{quote}
\verb|[<mpi-launcher>] hpcrun [hpcrun-options] app [app-arguments]|
\end{quote}
Of course, \texttt{<mpi-launcher>} is only needed for MPI programs and is sometimes a program like \texttt{mpiexec} or \texttt{mpirun}, or a workload manager's utilities such as Slurm's {\tt srun} or IBM's Job Step Manager utility {\tt jsrun}.
\item Statically linked applications:\hfill
First, link \hpcrun{}'s monitoring code into \texttt{app}, using \hpclink{}:
\begin{quote}
\verb|hpclink <linker> -o app <linker-arguments>|
\end{quote}
Then monitor \texttt{app} by passing \hpcrun{} options through environment variables.
For instance:
\begin{quote}
\begin{verbatim}
export HPCRUN_EVENT_LIST="CYCLES"
[<mpi-launcher>] app [app-arguments]
\end{verbatim}
\end{quote}
\hpclink{}'s \mytt{--help} option gives a list of environment variables that affect monitoring.
See Chapter~\ref{chpt:statically-linked-apps} for more information.
\end{itemize}
%
Any of these commands will produce a measurements database that contains separate measurement information for each MPI rank and thread in the application.
The database is named according the form:
\begin{quote}
\verb|hpctoolkit-app-measurements[-<jobid>]|
\end{quote}
If the application \texttt{app} is run under control of a recognized batch job scheduler (such as Slurm, Cobalt, or IBM's Job Manager), the name of the measurements directory will contain the corresponding job identifier \texttt{<jobid>}.
Currently, the database contains measurements files for each thread that are named using the following templates:
\begin{quote}
\verb|app-<mpi-rank>-<thread-id>-<host-id>-<process-id>.<generation-id>.hpcrun|
\verb|app-<mpi-rank>-<thread-id>-<host-id>-<process-id>.<generation-id>.hpctrace|
\end{quote}
\subsubsection{Specifying CPU Sample Sources}
\HPCToolkit{} primarily monitors an application using asynchronous sampling.
Consequently, the most common option to \hpcrun{} is a list of sample sources that define how samples are generated.
A sample source takes the form of an event name $e$ and \texttt{howoften}, specified as \texttt{$e$@howoften}. The specifier \texttt{howoften} may
be a number, indicating a period, \eg{} \mytt{CYCLES@4000001} or it may be \texttt{f} followed by a number, \mytt{CYCLES@f200} indicating a frequency in samples/second.
For a sample source with event $e$ and period $p$, after every \emph{p} instances of \emph{e}, a sample is generated that causes \hpcrun{} to inspect the and record information about the monitored application.
To configure \hpcrun{} with two samples sources, \texttt{$e_1$@howoften$_1$} and \texttt{$e_2$@howoften$_2$}, use the following options:
\begin{quote}
\texttt{--event $e_1$@howoften$_1$ --event $e_2$@howoften$_2$}
\end{quote}
To use the same sample sources with an \hpclink{}-ed application, use a command similar to:
\begin{quote}
\texttt{export HPCRUN\_EVENT\_LIST="$e_1$@howoften$_1$ $e_2$@howoften$_2$"}
\end{quote}
\subsubsection{Measuring GPU Computations}
One can simply profile and optionally trace computations offloaded onto AMD, Intel, and NVIDIA GPUs by using one of the following event specifiers:
\begin{itemize}
\item {\tt -e gpu=nvidia} is used with CUDA and OpenMP on NVIDIA GPUs
\item {\tt -e gpu=amd} is used with HIP and OpenMP on AMD GPUs
\item {\tt -e gpu=level0} is used with Intel's Level Zero runtime for Data Parallel C++ and OpenMP
\item {\tt -e gpu=opencl} can be used on any of the GPU platforms.
\end{itemize}
Adding a {\tt -t} to \hpcrun{}'s command line when profiling GPU computations will trace them as well.
For more information about how to use PC sampling (NVIDIA GPUs only) or binary instrumentation (Intel GPUs) for instruction-level performance measurement of GPU kernels, see Chapter~\ref{chpt:gpu}.
% ==========================================================
% ==========================================================
\subsection{Recovering Program Structure}
Typically, \hpcstruct{} is launched without any options, with an argument that is a \HPCToolkit{} \emph{measurement directory}.
\hpcstruct{} identifies the application as well as any shared libraries and GPU binaries it invokes.
It processes each of them and records information its program structure in the \emph{measurements directory}.
Program structure for a binary includes information about its source files, procedures, inlined code, loop nests, and statements.
When applied to a measurements directory, \hpcstruct{} analyzes multiple binaries concurrently by default.
It analyzes each small binary using a few threads and each large binary using more threads.
Although not usually necessary, one can apply \hpcstruct{} to recover program structure information for a single CPU or GPU binary.
To recover static program structure for a single binary \texttt{b}, use the command:
\begin{quote}
\verb|hpcstruct b|
\end{quote}
This command analyzes the binary and saves this information in a file named \texttt{b.hpcstruct}.
% ==========================================================
% ==========================================================
\subsection{Analyzing Measurements \& Attributing Them to Source Code}
To analyze \HPCToolkit{}'s measurements and attribute them to the application's source code, use \hpcprof{}, typically invoked as follows:
\begin{quote}
\begin{verbatim}
hpcprof hpctoolkit-app-measurements
\end{verbatim}
\end{quote}
This command will produce an \HPCToolkit{} performance database with the name \texttt{hpctoolkit-app-database}.
If this database directory already exists, \hpcprof{} will form a unique name by appending a random hexadecimal qualifier.
\hpcprof{} performs this analysis in parallel using multithreading.
By default all available threads are used.
If this is not wanted (e.g. using sharing a single machine), the thread count can be specified with \texttt{-j <threads>}.
\hpcprof{} usually completes this analysis in a matter of minutes.
For especially large experiments (applications using thousands of threads and/or GPU streams), the sibling \hpcprofmpi{} may produce results faster by exploiting additional compute nodes\footnote{
We recommend running \hpcprofmpi{} across 8-10 compute nodes. More than this may not improve or may degrade the overall speed of the analysis.
}.
Typically \hpcprofmpi{} is invoked as follows, using 8 ranks and compute nodes:
\begin{quote}
\begin{verbatim}
<mpi-launcher> -n 8 hpcprof-mpi hpctoolkit-app-measurements
\end{verbatim}
\end{quote}
Note that additional options may be needed to grant \hpcprofmpi{} access to all threads on each node, check the documentation for your scheduler and MPI implementation for details.
If possible, \hpcprof{} will copy the sources for the application and any libraries into the resulting database.
If the source code was moved since or was mounted at a different location than when the application was compiled, the resulting database may be missing some important source files.
In these cases, the \texttt{-R/--replace-path} option may be specified to provide substitute paths based on prefixes.
For example, if the application was compiled from source at \texttt{/home/joe/app/src/} but it is mounted at \texttt{/extern/homes/joe/app/src/} when running \hpcprof{}, the source files can be made available by invoking \hpcprof{} as follows:
\begin{quote}
\begin{verbatim}
hpcprof -R `/home/joe/app/src/=/extern/homes/joe/app/src/' \
hpctoolkit-app-measurements
\end{verbatim}
\end{quote}
Note that on systems where MPI applications are restricted to a scratch file system, it is the users responsibility to copy any wanted source files and make them available to \hpcprof{}.
% ==========================================================
% ==========================================================
\subsection{Presenting Performance Measurements for Interactive Analysis}
To interactively view and analyze an \HPCToolkit{} performance database, use \hpcviewer{}.
\hpcviewer{} may be launched from the command line or by double-clicking on its icon on MacOS or Windows.
The following is an example of launching from a command line:
\begin{quote}
\verb|hpcviewer hpctoolkit-app-database|
\end{quote}
Additional help for \hpcviewer{} can be found in a help pane available from \hpcviewer{}'s \emph{Help} menu.
% ==========================================================
% ==========================================================
\subsection{Effective Performance Analysis Techniques}
To effectively analyze application performance, consider using one of the following strategies, which are described in more detail in Chapter~\ref{chpt:effective-performance-analysis}.
\begin{itemize}
\item
A waste metric, which represents the difference between achieved performance and potential peak performance is a good way of understanding the potential for tuning the node performance of codes (Section~\ref{sec:effective-performance-analysis:inefficiencies}).
\hpcviewer{} supports synthesis of derived metrics to aid analysis.
Derived metrics are specified within \hpcviewer{} using spreadsheet-like formula.
See the \hpcviewer{} help pane for details about how to specify derived metrics.
\item
Scalability bottlenecks in parallel codes can be pinpointed by differential analysis of two profiles with different degrees of parallelism (Section~\ref{sec:effective-performance-analysis:scalability}).
\end{itemize}
% ===========================================================================
% ===========================================================================
\section{Additional Guidance}
For additional information, consult the rest of this manual and other documentation:
First, we summarize the available documentation and command-line help:
\begin{description}
\item[Command-line help.]\hfill
Each of \HPCToolkit{}'s command-line tools can generate a help message summarizing the tool's usage, arguments and options.
To generate this help message, invoke the tool with \mytt{-h} or \mytt{--help}.
\item[Man pages.]\hfill
Man pages are available either via the Internet (\url{http://hpctoolkit.org/documentation.html}) or from a local \HPCToolkit{} installation (\mytt{<hpctoolkit-installation>/share/man}).
\item[Manuals.]\hfill
Manuals are available either via the Internet (\url{http://hpctoolkit.org/documentation.html}) or from a local \HPCToolkit{} installation (\mytt{<hpctoolkit-installation>/share/doc/hpctoolkit/documentation.html}).
\item[Articles and Papers.]\hfill
There are a number of articles and papers that describe various aspects of \HPCToolkit{}'s measurement, analysis, attribution and presentation technology.
They can be found at \url{http://hpctoolkit.org/publications.html}.
\end{description}
% ***************************************************************************
% ***************************************************************************
\cleardoublepage
\chapter{Effective Strategies for Analyzing Program Performance}
\label{chpt:effective-performance-analysis}
This chapter describes some proven strategies for using performance measurements to identify performance bottlenecks in both serial and parallel codes.
% ===========================================================================
% ===========================================================================
\section{Monitoring High-Latency Penalty Events}
\label{sec:effective-performance-analysis:penalty-events}
A very simple and often effective methodology is to profile with respect to cycles and high-latency penalty events.
If \HPCToolkit{} attributes a large number of penalty events with a particular source-code statement, there is an extremely high likelihood of significant exposed stalling.
This is true even though (1) modern out-of-order processors can overlap the stall latency of one instruction with nearby independent instructions and (2) some penalty events ``over count''.%
\footnote{For example, performance monitoring units often categorize a prefetch as a cache miss.}
If a source-code statement incurs a large number of penalty events and it also consumes a non-trivial amount of cycles, then this region of code is an opportunity for optimization.
Examples of good penalty events are last-level cache misses and TLB misses.
% ===========================================================================
% ===========================================================================
\section{Computing Derived Metrics}
\label{sec:effective-performance-analysis:derived-metrics}
Modern computer systems provide access to a rich set of hardware performance counters that can directly measure various aspects of a program's performance.
Counters in the processor core and memory hierarchy enable one to collect measures of work (\eg, operations performed), resource consumption (\eg, cycles), and inefficiency (\eg, stall cycles).
One can also measure time using system timers.
Values of individual metrics are of limited use by themselves.
For instance, knowing the count of cache misses for a loop or routine is of little value by itself; only when combined with other information such as the number of instructions executed or the total number of cache accesses does the data become informative.
While a developer might not mind using mental arithmetic to evaluate the relationship between a pair of metrics for a particular program scope (\eg, a loop or a procedure), doing this for many program scopes is exhausting.
To address this problem, \hpcviewer{} supports calculation of derived metrics.
\hpcviewer{} provides an interface that enables a user to specify spreadsheet-like formula that can be used to calculate a derived metric for every program scope.
% For instance, if one wants to compute the cache miss rate in a scope, one could divide the total number of cache misses in a scope by the sum of counts of loads and stores in the scope. On the other hand, if one wanted to compute the fraction of a program's cache misses that occurred in a particular scope, one could divide the number of misses in the scope by the total number of misses in the program.
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/cycles-per-inst.png}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Computing a derived metric (cycles per instruction) in \hpcviewer{}.}
\label{fig:cycles-per-inst}
\end{figure}
Figure~\ref{fig:cycles-per-inst} shows how to use \hpcviewer{} to compute a \emph{cycles}/\emph{instruction} derived metric from measured metrics \mytt{PAPI_TOT_CYC} and \mytt{PAPI_TOT_INS}; these metrics correspond to {\em cycles} and {\em total instructions executed} measured with the PAPI hardware counter interface.
To compute a derived metric, one first depresses the button marked $f(x)$ above the metric pane; that will cause the pane for computing a derived metric to appear.
Next, one types in the formula for the metric of interest.
When specifying a formula, existing columns of metric data are referred to using a positional name \$$n$ to refer to the $n^{th}$ column, where the first column is written as \$0.
The metric pane shows the formula $\$1/\$3$.
Here, \$1 refers to the column of data representing the exclusive value for \mytt{PAPI_TOT_CYC} and \$3 refers to the column of data representing the exclusive value for \mytt{PAPI_TOT_INS}.%
\footnote{An {\em exclusive} metric for a scope refers to the quantity of the metric measured for that scope alone; an \emph{inclusive} metric for a scope represents the value measured for that scope as well as costs incurred by any functions it calls. In \hpcviewer{}, inclusive metric columns are marked with ``(I)'' and exclusive metric columns are marked with ``(E).''}
Positional names for metrics you use in your formula can be determined using the \emph{Metric} pull-down menu in the pane.
If you select your metric of choice using the pull-down, you can insert its positional name into the formula using the {\em insert metric} button, or you can simply type the positional name directly into the formula.
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/cycles-per-inst-2}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Displaying the new {\em cycles/ instruction} derived metric in \hpcviewer{}.}
\label{fig:cycles-per-inst-2}
\end{figure}
At the bottom of the derived metric pane, one can specify a name for the new metric.
One also has the option to indicate that the derived metric column should report for each scope what percent of the total its quantity represents; for a metric that is a ratio, computing a percent of the total is not meaningful, so we leave the box unchecked.
After clicking the OK button, the derived metric pane will disappear and the new metric will appear as the rightmost column in the metric pane.
If the metric pane is already filled with other columns of metric, you may need to scroll right in the pane to see the new metric.
Alternatively, you can use the metric check-box pane (selected by depressing the button to the right of $f(x)$ above the metric pane) to hide some of the existing metrics so that there will be enough room on the screen to display the new metric.
Figure~\ref{fig:cycles-per-inst-2} shows the resulting \hpcviewer{} display after clicking OK to add the derived metric.
The following sections describe several types of derived metrics that are of particular use to gain insight into performance bottlenecks and opportunities for tuning.
% ===========================================================================
% ===========================================================================
\section{Pinpointing and Quantifying Inefficiencies}
\label{sec:effective-performance-analysis:inefficiencies}
While knowing where a program spends most of its time or executes most of its floating point operations may be interesting, such information may not suffice to identify the biggest targets of opportunity for improving program performance.
For program tuning, it is less important to know how much resources (\eg, time, instructions) were consumed in each program context than knowing where resources were consumed {\em inefficiently}.
To identify performance problems, it might initially seem appealing to compute ratios to see how many events per cycle occur in each program context.
For instance, one might compute ratios such as FLOPs/cycle, instructions/cycle, or cache miss ratios.
However, using such ratios as a sorting key to identify inefficient program contexts can misdirect a user's attention.
There may be program contexts (\eg, loops) in which computation is terribly inefficient (\eg, with low operation counts per cycle); however, some or all of the least efficient contexts may not account for a significant amount of execution time.
Just because a loop is inefficient doesn't mean that it is important for tuning.
The best opportunities for tuning are where the aggregate performance losses are greatest.
For instance, consider a program with two loops.
The first loop might account for 90\% of the execution time and run at 50\% of peak performance.
The second loop might account for 10\% of the execution time, but only achieve 12\% of peak performance.
In this case, the total performance loss in the first loop accounts for 50\% of the first loop's execution time, which corresponds to 45\% of the total program execution time.
The 88\% performance loss in the second loop would account for only 8.8\% of the program's execution time.
In this case, tuning the first loop has a greater potential for improving the program performance even though the second loop is less efficient.
A good way to focus on inefficiency directly is with a derived {\em waste} metric.
Fortunately, it is easy to compute such useful metrics.
However, there is no one {\em right} measure of waste for all codes.
Depending upon what one expects as the rate-limiting resource (\eg, floating-point computation, memory bandwidth, etc.), one can define an appropriate waste metric (\eg, FLOP opportunities missed, bandwidth not consumed) and sort by that.
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/fpwaste.png}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Computing a floating point waste metric in \hpcviewer{}.}
\label{fig:fpwaste}
\end{figure}
For instance, in a floating-point intensive code, one might consider keeping the floating point pipeline full as a metric of success.
One can directly quantify and pinpoint losses from failing to keep the floating point pipeline full {\em regardless of why this occurs}.
One can pinpoint and quantify losses of this nature by computing a {\em floating-point waste} metric that is calculated as the difference between the potential number of calculations that could have been performed if the computation was running at its peak rate minus the actual number that were performed.
To compute the number of calculations that could have been completed in each scope, multiply the total number of cycles spent in the scope by the peak rate of operations per cycle.
Using \hpcviewer{}, one can specify a formula to compute such a derived metric and it will compute the value of the derived metric for every scope.
Figure~\ref{fig:fpwaste} shows the specification of this floating-point waste metric for a code.\footnote{Many recent processors have trouble accurately counting floating-point operations accurately, which is unfortunate. If your processor can't accurately count floating-point operations, a floating-point waste metric will be less useful.}
Sorting by a waste metric will rank order scopes to show the scopes with the greatest waste.
Such scopes correspond directly to those that contain the greatest opportunities for improving overall program performance.
A waste metric will typically highlight loops where
\begin{itemize}
\item a lot of time is spent computing efficiently, but the aggregate inefficiencies accumulate,
\item less time is spent computing, but the computation is rather inefficient, and
\item scopes such as copy loops that contain no computation at all, which represent a complete waste according to a metric such as floating point waste.
\end{itemize}
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/fp-efficiency.png}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Computing floating point efficiency in percent using \hpcviewer{}.}
\label{fig:fpefficiency}
\end{figure}
Beyond identifying and quantifying opportunities for tuning with a waste metric, one can compute a companion derived metric {\em relative efficiency} metric to help understand how easy it might be to improve performance.
A scope running at very high efficiency will typically be much harder to tune than running at low efficiency.
For our floating-point waste metric, we one can compute the floating point efficiency metric by dividing measured FLOPs by potential peak FLOPs and multiplying the quantity by 100.
Figure~\ref{fig:fpefficiency} shows the specification of this floating-point efficiency metric for a code.
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/fp-efficiency-loop.png}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Using floating point waste and the percent of floating point efficiency to evaluate opportunities for optimization.}
\label{fig:fpefficiency-loop}
\end{figure}
Scopes that rank high according to a waste metric and low according to a companion relative efficiency metric often make the best targets for optimization.
Figure~\ref{fig:fpefficiency-loop} shows the specification of this floating-point efficiency metric for a code.
Figure~\ref{fig:fpefficiency-loop} shows an \hpcviewer{} display that shows the top two routines that collectively account for 32.2\% of the floating point waste in a reactive turbulent combustion code.
The second routine (\mytt{ratt}) is expanded to show the loops and statements within.
While the overall floating point efficiency for \mytt{ratt} is at 6.6\% of peak (shown in scientific notation in the \hpcviewer{} display), the most costly loop in \mytt{ratt} that accounts for 7.3\% of the floating point waste is executing at only 0.114\% efficiency.
Identifying such sources of inefficiency is the first step towards improving performance via tuning.
% ===========================================================================
% ===========================================================================
\section{Pinpointing and Quantifying Scalability Bottlenecks}
\label{sec:effective-performance-analysis:scalability}
On large-scale parallel systems, identifying impediments to scalability is of paramount importance.
On today's systems fashioned out of multicore processors, two kinds of scalability are of particular interest:
\begin{itemize}
\item scaling within nodes, and
\item scaling across the entire system.
\end{itemize}
\HPCToolkit{} can be used to readily pinpoint both kinds of bottlenecks.
Using call path profiles collected by \hpcrun{}, it is possible to quantify and pinpoint scalability bottlenecks of any kind, {\em regardless of cause}.
To pinpoint scalability bottlenecks in parallel programs, we use {\em differential profiling} --- mathematically combining corresponding buckets of two or more execution profiles.
Differential profiling was first described by McKenney~\cite{McKenney:1999:differential}; he used differential profiling to compare two {\em flat} execution profiles.
Differencing of flat profiles is useful for identifying what parts of a program incur different costs in two executions.
Building upon McKenney's idea of differential profiling, we compare call path profiles of parallel executions at different scales to pinpoint scalability bottlenecks.
Differential analysis of call path profiles pinpoints not only differences between two executions (in this case scalability losses), but the contexts in which those differences occur.
Associating changes in cost with full calling contexts is particularly important for pinpointing context-dependent behavior.
Context-dependent behavior is common in parallel programs.
For instance, in message passing programs, the time spent by a call to \mytt{MPI_Wait} depends upon the context in which it is called.
Similarly, how the performance of a communication event scales as the number of processors in a parallel execution increases depends upon a variety of factors such as whether the size of the data transferred increases and whether the communication is collective or not.
% ==========================================================
% ==========================================================
\subsection{Scalability Analysis Using Expectations}
Application developers have expectations about how the performance of their code should scale as the number of processors in a parallel execution increases.
Namely,
\begin{itemize}
\item when different numbers of
processors are used to solve the same problem (strong scaling), one
expects an execution's speedup to increase linearly with the number of processors employed;
\item when
different numbers of processors are used but the amount of computation
per processor is held constant (weak scaling), one expects the execution
time on a different number of processors to be the same.
\end{itemize}
In both of these situations, a code developer can express their expectations for how performance will scale as a formula that can be used to predict execution performance on a different number of processors.
One's expectations about how overall application performance should scale can be applied to each context in a program
to pinpoint and quantify deviations from expected scaling.
Specifically, one can scale and difference the performance of an application on different numbers of processors to pinpoint contexts that are not scaling ideally.
To pinpoint and quantify scalability bottlenecks in a parallel application, we first use \hpcrun{} to a collect call path profile for an application on two different numbers of processors.
Let $E_p$ be an execution on $p$ processors and $E_q$ be an execution on $q$ processors.
Without loss of generality, assume that $q > p$.
In our analysis, we consider both {\it inclusive} and {\it exclusive} costs for CCT nodes.
The inclusive cost at $n$ represents the sum of all costs attributed to $n$ and any of its descendants in the CCT, and is denoted by $I(n)$.
The exclusive cost at $n$ represents the sum of all costs attributed strictly to $n$, and we denote it by $E(n)$.
If $n$ is an interior node in a CCT, it represents an invocation of a procedure.
If $n$ is a leaf in a CCT, it represents a statement inside some procedure. For leaves, their inclusive and exclusive costs are equal.
It is useful to perform scalability analysis for both inclusive and exclusive costs; if the loss of scalability attributed to the inclusive costs of a function invocation is roughly equal to the loss of scalability due to its exclusive costs, then we know that the computation in that function invocation does not scale.
However, if the loss of scalability attributed to a function invocation's inclusive costs outweighs the loss of scalability accounted for by exclusive costs, we need to explore the scalability of the function's callees.
Given CCTs for an ensemble of executions, the next step to analyzing the scalability of their performance is to clearly define our expectations.
Next, we describe performance expectations for weak scaling and intuitive metrics that represent how much performance deviates from our expectations.
More information about our scalability analysis technique can be found elsewhere~\cite{Coarfa-MC:2007:ICS-scalability,Tallent-MC-etal:2009:SC-hpctoolkit-petascale}.
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/flash-scalability.png}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Computing the scaling loss when weak scaling a white dwarf detonation simulation with FLASH3 from 256 to 8192 cores. For weak scaling, the time on an MPI rank in each of the simulations will be the same. In the figure, column 0 represents the inclusive cost for one MPI rank in a 256-core simulation; column 2 represents the inclusive cost for one MPI rank in an 8192-core simulation. The difference between these two columns, computed as {\tt \$2-\$0},
represents the excess work present in the larger simulation for each unique program context in the calling context tree. Dividing that by the total time in the 8192-core execution {\tt @2} gives the fraction of wasted time. Multiplying through by 100 gives the percent of the time wasted in the 8192-core execution, which corresponds to the \%~scalability loss.}
\label{fig:scaling-loss}
\end{figure}
\begin{figure}[t]
\center{\includegraphics[width=1.0\textwidth]{fig/scaling-loss-2.png}}
%Two possible representations for the call path fragment $\ldots s_1 \rightarrow s_2 \ldots$, where $s_1$ and $s_2$ are call sites and where $s_1$ represents a call from $p$ to $q$ and $s_2$ a call from $q'$ to $r$.
\caption{Using the fraction the scalability loss metric of Figure~\ref{fig:scaling-loss} to rank order loop nests by their scaling loss.}
\label{fig:scaling-loss-2}
\end{figure}
\subsubsection*{Weak Scaling}
Consider two weak scaling experiments executed on $p$ and $q$ processors, respectively, $p<q$.
In Figure~\ref{fig:scaling-loss} shows how we can use a derived metric to compute and attribute scalability losses.
Here, we compute the difference in inclusive cycles spent on one core of a 8192-core run and one core in a 256-core run in a weak scaling experiment.
If the code had perfect weak scaling, the time for an MPI rank in each of the executions would be identical. In this case, they are not.
We compute the excess work by computing the difference for each scope between the time on the 8192-core run and the time on the 256-core core run.
We normalize the differences of the time spent in the two runs by dividing then by the total time spent on the 8192-core run. This yields the fraction of wasted effort
for each scope when scaling from 256 to 8192 cores. Finally, we multiply these resuls by 100 to compute the \% scalability loss.
This example shows how one can compute a derived metric to that pinpoints and quantifies scaling losses across different node counts of a Blue Gene/P system.
A similar analysis can be applied to compute scaling losses between jobs that use different numbers of core counts on individual processors.
Figure~\ref{fig:scaling-loss-2} shows the result of computing the scaling loss for each loop nest when scaling from one to eight cores on a multicore node and rank order loop nests by their scaling loss metric. Here, we simply compute the scaling loss as the difference between the cycle counts of the eight-core and the one-core runs, divided through by the aggregate cost of the process executing on eight cores. This figure shows the scaling lost written in scientific notation as a fraction rather than multiplying through by 100 to yield a percent.
In this figure, we examine scaling losses in the flat view, showing them for each loop nest.
The source pane shows the loop nest responsible for the greatest scaling loss when scaling from one to eight cores.
Unsurprisingly, the loop with the worst scaling loss is very memory intensive.
Memory bandwidth is a precious commodity on multicore processors.
While we have shown how to compute and attribute the fraction of excess work in a weak scaling experiment, one can compute a similar quantity for experiments with strong scaling. When differencing the costs summed across all of the threads in a pair of strong-scaling experiments, one uses exactly the same approach as shown in Figure~\ref{fig:scaling-loss}. If comparing weak scaling costs summed across all ranks in $p$ and $q$ core executions, one can simply scale the aggregate costs by $1/p$ and $1/q$ respectively before differencing them.
\subsubsection{Exploring Scaling Losses}
Scaling losses can be explored in \hpcviewer{} using any of its three views.
\begin{itemize}
\item {\em Top-down view.} This view represents the dynamic calling contexts (call paths) in which costs were incurred.
\item {\em Bottom-up view.} This view enables one to look upward along call paths. This view is particularly useful for understanding the performance of software components or procedures that are used in more than one context, such as communication library routines.
\item {\em Flat view.} This view organizes performance measurement data according to the static structure of an application. All costs incurred in {\em any} calling context by a procedure are aggregated together in the flat view.
\end{itemize}
\hpcviewer{} enables developers to explore top-down, bottom-up, and flat views of CCTs annotated with costs, helping to quickly pinpoint performance bottlenecks.
Typically, one begins analyzing an application's scalability and performance using the top-down calling context tree view.
Using this view, one can readily see how costs and scalability losses are associated with different calling contexts.
If costs or scalability losses are associated with only a few calling contexts, then this view suffices for identifying the bottlenecks.
When scalability losses are spread among many calling contexts, \eg, among different invocations of \mytt{MPI_Wait}, often it is useful to switch to the bottom-up of the data to see if many losses are due to the same underlying cause.
In the bottom-up view, one can sort routines by their exclusive scalability losses and then look upward to see how these losses accumulate from the different calling contexts in which the routine was invoked.
Scaling loss based on excess work is intuitive; perfect scaling corresponds to a excess work value of $0$, sublinear scaling yields positive values, and superlinear scaling yields negative values.
Typically, CCTs for SPMD programs have similar structure.
If CCTs for different executions diverge, using \hpcviewer{} to compute and report excess work will highlight these program regions.
Inclusive excess work and exclusive excess work serve as useful measures of scalability associated with nodes in a calling context tree (CCT).
By computing both metrics, one can determine whether the application scales well or not at a CCT node and also pinpoint the cause of any lack of scaling.
If a node for a function in the CCT has comparable positive values for both inclusive excess work and exclusive excess work, then the loss of scaling is due to computation in the function itself.
However, if the inclusive excess work for the function outweighs that accounted for by its exclusive costs, then one should explore the scalability of its callees.
To isolate code that is an impediment to scalable performance, one can use the {\em hot path} button in \hpcviewer{} to trace a path down through the CCT to see where the cost is incurred.
% ===========================================================================
% ===========================================================================
%\section{Identifying Load Imbalance}
%\label{sec:effective-performance-analysis:load-imbalance}
% ***************************************************************************
% ***************************************************************************
\cleardoublepage
\chapter{Monitoring Dynamically-linked Applications with \hpcrun{}}
\label{chpt:hpcrun}
\input{hpcrun}
% ***************************************************************************
% ***************************************************************************
\cleardoublepage
\chapter{Monitoring Statically Linked Applications with \hpclink{}}
\label{chpt:statically-linked-apps}
% ===========================================================================
% ===========================================================================
On modern Linux systems, dynamically linked executables are the default.