-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathhpcrun.tex
1019 lines (843 loc) · 49.9 KB
/
hpcrun.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% -*- Mode: latex; -*-
% $HeadURL$
% $Id$
This chapter describes the mechanics of using \hpcrun{} and \hpclink{}
to profile an application and collect performance data. For advice on
how to choose events, perform scaling studies, etc., see
Chapter~\ref{chpt:effective-performance-analysis} {\it Effective
Strategies for Analyzing Program Performance}.
\section{Using \hpcrun{}}
The \hpcrun{} launch script is used to run an application and collect
call path profiles and call path traces data for {\it dynamically linked\/} binaries. For
dynamically linked programs, this requires no change to the program
source and no change to the build procedure. You should build your
application natively with full optimization. \hpcrun{} inserts its
profiling code into the application at runtime via \verb|LD_PRELOAD|.
\hpcrun{} monitors the execution of applications on a CPU using asynchronous sampling. If \hpcrun{} is used without any arguments to measure a program
\begin{quote}
\begin{verbatim}
hpcrun app arg ...
\end{verbatim}
\end{quote}
\noindent
it will the measure the program's execution by sampling its CPUTIME and collect a call path profile for each thread in the execution. More about the CPUTIME metric can be found in Section~\ref{linux-timers}.
In addition to a call path profile, \hpcrun{} can collect a call path trace of an execution if the \verb|-t| (or \verb|--trace|) option is used
turn on tracing. The following use of \hpcrun{} will collect both a call path profile and a call path trace of CPU execution using the default CPUTIME sample source.
\begin{quote}
\begin{verbatim}
hpcrun -t app arg ...
\end{verbatim}
\end{quote}
\noindent
Traces are most useful for understanding the execution dynamics of multithreaded or multi-process applications; however, you may find a trace of a single-threaded application to be useful to understand how an execution unfolds over time.
While CPUTIME is used as the default sample source if no other sample source is specified, many other sample sources are available.
Typically, one uses the \verb|-e| (or \verb|--event|) to
specify a sample source and sampling rate.\footnote{GPU and OpenMP measurement events don't accept a rate.}
Sample sources are specified as `\verb|event@howoften|'
where \verb|event| is the name of the source and \verb|howoften| is either
a number specifying the period (threshold) for that event, or \verb|f| followed by a number, \eg{}, \verb|@f100|
specifying a target sampling frequency for the event in samples/second.\footnote{Frequency-based sampling and
the frequency-based notation for {\tt howoften} is only
available for sample sources managed by Linux {\tt perf\_events}. For Linux {\tt perf\_events}, \HPCToolkit{} uses
a default sampling frequency of 300 samples/second.}
Note that a higher period implies a lower rate of sampling.
The \verb|-e| option may be used multiple times to specify that multiple
sample sources be used for measuring an execution.
The basic syntax for profiling an application with
\hpcrun{} is:
\begin{quote}
\begin{verbatim}
hpcrun -t -e event@howoften ... app arg ...
\end{verbatim}
\end{quote}
For example, to profile an application using hardware counter sample sources
provided by Linux \verb|perf_events| and sample cycles at 300 times/second (the default sampling frequency) and sample every 4,000,000 instructions,
you would use:
\begin{quote}
\begin{verbatim}
hpcrun -e CYCLES -e INSTRUCTIONS@4000000 app arg ...
\end{verbatim}
\end{quote}
The units for timer-based sample sources (\verb|CPUTIME| and \verb|REALTIME| are microseconds,
so to sample an application with tracing every 5,000 microseconds
(200~times/second), you would use:
\begin{quote}
\begin{verbatim}
hpcrun -t -e CPUTIME@5000 app arg ...
\end{verbatim}
\end{quote}
\hpcrun{} stores its raw performance data in a {\it measurements}
directory with the program name in the directory name. On systems
with a batch job scheduler (eg, PBS) the name of the job is appended
to the directory name.
\begin{quote}
\begin{verbatim}
hpctoolkit-app-measurements[-jobid]
\end{verbatim}
\end{quote}
It is best to use a different measurements directory for each run.
So, if you're using \hpcrun{} on a local workstation without a job
launcher, you can use the `\verb|-o dirname|' option to specify an
alternate directory name.
For programs that use their own launch script (eg, \verb|mpirun| or
\verb|mpiexec| for MPI), put the application's run script on the
outside (first) and \hpcrun{} on the inside (second) on the command
line. For example,
\begin{quote}
\begin{verbatim}
mpirun -n 4 hpcrun -e CYCLES mpiapp arg ...
\end{verbatim}
\end{quote}
Note that \hpcrun{} is intended for profiling dynamically linked {\it
binaries}. It will not work well if used to profile a shell script.
At best, you would be profiling the shell interpreter, not the script
commands, and sometimes this will fail outright.
It is possible to use \hpcrun{} to launch a statically linked binary,
but there are two problems with this. First, it is still necessary to
build the binary with \hpclink{}. Second, static binaries are
commonly used on parallel clusters that require running the binary
directly and do not accept a launch script. However, if your system
allows it, and if the binary was produced with \hpclink, then
\hpcrun{} will set the correct environment variables for profiling
statically or dynamically linked binaries. All that \hpcrun{} really
does is set some environment variables (including \verb|LD_PRELOAD|)
and \verb|exec| the binary.
\subsection{If \hpcrun{} causes your application to fail}
\hpcrun{} can cause applications to fail in certain circumstances. Here, we describe two kind of failures that may arise and how to sidestep them.
\subsubsection{\hpcrun{} causes failures related to loading or using shared libraries}
\label{hpcrun-audit}
Unfortunately, the Glibc implementations used today on most platforms have known bugs monitoring loading and unloading of shared libraries and calls to a shared library's API.
While the best approach for coping with these problems is to use a system running Glibc 2.35 or later, for most people, this is not an option:
the system administrator picks the operating system version, which determines the Glibc version available to developers.
To understand what kinds of problems that you may encounter with shared libraries and how you can work around them, it is helpful to understand how \HPCToolkit{} monitors shared libraries.
On Power and \verb|x86_64| architectures, by default \hpcrun{} uses \verb|LD_AUDIT| to monitor an application's use of dynamic libraries.
Use of \verb|LD_AUDIT| is the only strategy for monitoring shared libraries that will not cause a change in application behavior when libraries contain a \verb|RUNPATH|. However, Glibc's implementation of \verb|LD_AUDIT| has a number of bugs that may crash the application:
\begin{itemize}
\item Until Glibc 2.35, most applications running on ARM will crash. This was caused by a fatal flaw in Glibc's PLT handler for ARM, where an argument register that should have been saved was instead replaced with a junk pointer value. This register is used to return C/C++ \verb|struct| values from functions and methods, including some C++ constructors.
\item Until Glibc 2.35, applications and libraries using \texttt{dlmopen} will crash. While most applications do not use \texttt{dlmopen}, an example of a library that does is Intel's GTPin, which \hpcrun{} uses to instrument Intel GPU code.
\item Applications and libraries using significant amounts of static TLS space may crash with the message ``\texttt{cannot allocate memory in static TLS block}.''
This is caused by a flaw in Glibc causing it to allocate insufficient static TLS space when \verb|LD_AUDIT| is enabled.
For Glibc 2.35 and newer, setting the environment variable
\begin{verbatim}
export GLIBC_TUNABLES=glibc.rtld.optional_static_tls=0x1000000
\end{verbatim}
will instruct Glibc to allocate $16$MB of static TLS memory per thread, in our experience this is far more than any application will use (however the value can be adjusted freely).
For older Glibc, the only option is to disable \hpcrun{}'s use of \verb|LD_AUDIT|.
\end{itemize}
% If your program uses {\tt dlmopen}, you have two alternatives: ask \hpcrun{} to use an alternative strategy for monitoring shared libraries, namely by wrapping {\tt dlopen} and {\tt dlclose}.
The following options direct \hpcrun{} to adjust the strategy it uses for monitoring dynamic libraries. We suggest that you don't consider using any of these options unless your program fails using \hpcrun{}'s defaults.
\begin{description}
\item{{\tt --disable-auditor}} This option instructs hpcrun
to track dynamic library operations by intercepting
{\tt dlopen} and {\tt dlclose} instead of using \verb|LD_AUDIT|. Note
that this alternate approach can cause problem with
libraries and applications that specify a \verb|RUNPATH|.
\item{{\tt --enable-auditor}} This option is default, except on ARM or when Intel GTPin instrumentation is enabled.
Passing this option instructs \hpcrun{} to use \verb|LD_AUDIT| in all cases.
\item{{\tt --disable-auditor-got-rewriting}}
When using an \texttt{LD\_AUDIT}, Glibc unnecessarily intercepts
every call to a function in a shared library. \hpcrun{}
avoids this overhead by rewriting each shared library's
global offset table (GOT). Such rewriting is tricky.
This option can be used to disable GOT rewriting if
it is believed that the rewriting is causing the
application to fail.
\item{{\tt --namespace-single}} {\tt dlmopen} may load a shared library into an alternate
namespace, which crashes on Glibc until 2.35. This option instructs
\hpcrun{} to override \texttt{dlmopen} to instead load all
shared libraries within the application namespace.
This may significantly change application behavior, but may be helpful to avoid crashing.
This option is default when Intel GTPin instrumentation is enabled.
\item{{\tt --namespace-multiple}} This option is the opposite of \texttt{--namespace-single}, and will instruct \hpcrun{} to \textit{not} override \texttt{dlmopen} and thus retain its normal function.
This option is default except when Intel GTPin instrumentation is enabled.
\end{description}
If your code fails to find libraries when it is monitoring your code by wrapping {\tt dlopen} and {\tt dlclose} rather than using \verb|LD_AUDIT|, you can sidestep this problem by adding any library paths listed in the {\tt RUNPATH} of your application or library to your \verb|LD_LIBRARY_PATH| environment variable before launching \hpcrun{}.
\subsubsection{ \hpcrun{} causes your application to fail when {\tt gprof} instrumentation is present}
When an application has been compiled with the compiler flag \verb|-pg|,
the compiler adds instrumentation to collect performance measurement data for
the \verb|gprof| profiler. Measuring application performance with
\HPCToolkit{}'s measurement subsystem and \verb|gprof| instrumentation
active in the same execution may cause the execution
to abort. One can detect the presence of \verb|gprof| instrumentation in an
application by the presence of \verb|__monstartup| and \verb|_mcleanup| symbols
in a executable.
One can disable \verb|gprof| instrumentation when measuring the performance of
a dynamically-linked application by using the \verb|--disable-gprof|
argument to \hpcrun{}.
% ===========================================================================
\section{Hardware Counter Event Names}
HPCToolkit uses libpfm4\cite{libpfm-www} to translate from an event name string to an event code recognized by the kernel.
An event name is case insensitive and is defined as followed:
\begin{verbatim}
[pmu::][event_name][:unit_mask][:modifier|:modifier=val]
\end{verbatim}
\begin{itemize}
\item \textbf{pmu}. Optional name of the PMU (group of events) to which the event belongs to. This is useful to disambiguate events in case events from difference sources have the same name. If no pmu is specified, the first match event is used.
\item \textbf{event\_name}. The name of the event. It must be the complete name, partial matches are not accepted.
\item \textbf{unit\_mask}. Some events can be refined using sub-events. A {\tt unit\_mask} designates an optional sub-event. An event may have multiple unit masks and it is possible to combine them (for some events) by repeating \texttt{:unit\_mask} pattern.
\item \textbf{modifier}. A modifier is an optional filter that restricts when an event counts.
The form of a modifier may be either \texttt{:modifier} or \texttt{:modifier=val}.
For modifiers without a value, the presence of the modifier is
interpreted as a restriction. Events may allow use of multiple modifiers
at the same time.
\begin{itemize}
\item \textbf{hardware event modifiers}. Some hardware events support one or more modifiers that restrict counting to a subset of events. For instance, on an Intel Broadwell EP, one can add a modifier to \verb|MEM_LOAD_UOPS_RETIRED| to count only load operations that are
an \verb|L2_HIT| or an \verb|L2_MISS|. For information about all modifiers for hardware events,
one can direct \HPCToolkit{}'s measurement subsystem to list all native events and their modifiers
as described in Section~\ref{sample-sources}.
\item \textbf{precise\_ip}. For some events, it is possible to control the amount of skid.
Skid is a measure of how many instructions may execute between an event and the PC where the event is reported.
Smaller skid enables more accurate attribution of events to instructions. Without a skid modifier, hpcrun allows arbitrary skid because some architectures
don't support anything more precise. One may optionally specify one of the following as a skid modifier:
\begin{itemize}
\item \verb|:p| : a sample must have constant skid.
\item \verb|:pp| : a sample is requested to have 0 skid.
\item \verb|:ppp| : a sample must have 0 skid.
\item \verb|:P| : autodetect the least skid possible.
\end{itemize}
NOTE: If the kernel or the hardware does not support the specified value of the skid, no error message will be reported
but no samples will be recorded.
\end{itemize}
\end{itemize}
% ===========================================================================
\section{Sample Sources}
\label{sample-sources}
This section provides an overview of how to use sample sources supported by HPCToolkit. To
see a list of the available sample sources and events that \hpcrun{}
supports, use `\verb|hpcrun -L|' (dynamic) or set
`\verb|HPCRUN_EVENT_LIST=LIST|' (static). Note that on systems with
separate compute nodes, it is best to run this on a compute node.
\newcommand{\perfevents}{{\tt perf\_events}}
\subsection{Linux \perfevents}
Linux \perfevents{} provides a powerful interface that supports
measurement of both application execution and kernel activity.
Using
\perfevents{}, one can measure both hardware and software events.
Using a processor's hardware performance monitoring unit (PMU), the
\perfevents{} interface can measure an execution using any hardware counter
supported by the PMU. Examples of hardware events include cycles, instructions
completed, cache misses, and stall cycles. Using instrumentation built in to the Linux kernel,
the \perfevents{} interface can measure software events. Examples of software events include page
faults, context switches, and CPU migrations.
\subsubsection{Capabilities of HPCToolkit's \perfevents{} Interface}
\paragraph{Frequency-based sampling.}
The Linux \perfevents{} interface supports frequency-based sampling.
With frequency-based sampling, the kernel automatically selects and adjusts an event period with the
aim of delivering samples for that event at a target sampling frequency.\footnote{The
kernel may be unable to deliver the desired frequency if
there are fewer events per second than the desired frequency.}
Unless a user explicitly specifies an event count threshold for an event,
HPCToolkit's measurement interface will use frequency-based sampling by default.
HPCToolkit's default sampling frequency is ${\rm min}(300,M-1)$, where $M$ is the
value specified in the system configuration file \verb|/proc/sys/kernel/perf_event_max_sample_rate|.
For circumstances where the user wants to use frequency-based sampling but
HPCToolkit's default sampling frequency is inappropriate,
one can specify the target sampling frequency for a particular event using the notation
{\em event}{\tt @f}{\em rate} when specifying an event or change the default sampling frequency.
When measuring a dynamically-linked executable using {\tt hpcrun}, one can change the default sampling frequency using {\tt hpcrun}'s {\tt -c} option. To set a new default sampling frequency for a statically-linked executable instrumented with {\tt hpclink}, set the \verb|HPCRUN_PERF_COUNT| environment variable.
The section below entitled {\em Launching} provides
examples of how to monitor an execution using frequency-based sampling.
\paragraph{Multiplexing.}
Using multiplexing enables one to monitor more events
in a single execution than the number of hardware counters a processor
can support for each thread. The number of events that can be monitored in
a single execution is only limited by the maximum number of concurrent
events that the kernel will allow a user to multiplex using the
\perfevents{} interface.
When more events are specified than can be monitored simultaneously
using a thread's hardware counters,\footnote{How many events can be
monitored simultaneously on a particular processor may depend on the
events specified.} the kernel will employ multiplexing and divide
the set of events to be monitored into groups, monitor only one group
of events at a time, and cycle repeatedly through the groups
as a program executes.
For applications that have very regular,
steady state behavior, e.g., an iterative code with lots of
iterations, multiplexing will yield results that are suitably representative
of execution behavior. However, for executions that consist of
unique short phases, measurements collected using multiplexing may
not accurately represent the execution behavior. To obtain
more accurate measurements, one can run an application multiple times and in
each run collect a subset of events that can be measured without multiplexing.
Results from several such executions can be imported into HPCToolkit's \hpcviewer{}
and analyzed together.
\paragraph{Thread blocking.} When a program executes,
a thread may block waiting for the kernel to complete some operation on its behalf.
For instance, a thread may block waiting for data to become available so that a {\tt read} operation
can complete. On systems running Linux 4.3 or newer, one can use the \perfevents{} sample source to monitor how much time a thread is blocked and where the blocking occurs. To measure
the time a thread spends blocked, one can profile with \verb|BLOCKTIME| event and
another time-based event, such as \verb|CYCLES|. The \verb|BLOCKTIME| event shouldn't have any frequency or period specified, whereas \verb|CYCLES| may have a frequency or period specified.
\subsubsection{Launching}
\label{sec:perf-launching}
When sampling with native events, by default hpcrun will profile using \perfevents{}.
To force HPCToolkit to use PAPI rather than \perfevents{} to oversee monitoring of a PMU event
(assuming that HPCToolkit has been configured to include support for PAPI),
one must prefix the event with \lq{\verb|papi::|}\rq{} as follows:
\begin{quote}
\begin{verbatim}
hpcrun -e papi::CYCLES
\end{verbatim}
\end{quote}
\noindent For PAPI presets, there is no need to prefix the event with
\lq{\verb|papi::|}\rq. For instance it is sufficient to specify \verb|PAPI_TOT_CYC| event
without any prefix to profile using PAPI. For more information about using PAPI, see Section~\ref{section:papi}.
Below, we provide some examples of various ways to measure \verb|CYCLES|
and \verb|INSTRUCTIONS| using \HPCToolkit{}'s \perfevents{} measurement substrate:
To sample an execution 100 times per second (frequency-based sampling) counting \verb|CYCLES|
and 100 times a second counting \verb|INSTRUCTIONS|:
\begin{quote}
\begin{verbatim}
hpcrun -e CYCLES@f100 -e INSTRUCTIONS@f100 ...
\end{verbatim}
\end{quote}
To sample an execution every 1,000,000 cycles and every 1,000,000 instructions using period-based sampling:
\begin{quote}
\begin{verbatim}
hpcrun -e CYCLES@1000000 -e INSTRUCTIONS@1000000
\end{verbatim}
\end{quote}
By default, hpcrun uses frequency-based sampling with the rate
300 samples per second per event type. Hence the following command causes \HPCToolkit{} to
sample \verb|CYCLES| at 300 samples per second and \verb|INSTRUCTIONS| at 300 samples per second:
\begin{quote}
\begin{verbatim}
hpcrun -e CYCLES -e INSTRUCTIONS …
\end{verbatim}
\end{quote}
One can specify a different default sampling period or frequency using the \verb|-c| option.
The command below will sample \verb|CYCLES| and \verb|INSTRUCTIONS| at 200 samples per second each:
\begin{quote}
\begin{verbatim}
hpcrun -c f200 -e CYCLES -e INSTRUCTIONS …
\end{verbatim}
\end{quote}
\subsubsection{Notes}
\begin{itemize}
\item
Linux \perfevents{} uses one file descriptor for each event to be monitored.
Furthermore, since \hpcrun{} generates one hpcrun file for each thread, and an additional hpctrace file if traces is enabled.
Hence for $e$ events and $t$ threads, the required number of file descriptors is:
\begin{quote}
$t \times e + t + t$ (if trace is enabled)
\end{quote}
For instance, if one profiles a multi-threaded program that executes with 500 threads using 4 events,
then the required number of file descriptors is
\begin{quote}
500 threads $\times$ 4 events + 500 hpcrun files + 500 hpctrace files \\
= 3000 file descriptors
\end{quote}
If the number of file descriptors exceeds the number of maximum number of open files, then the program will crash.
To remedy this issue, one needs to increase the number of maximum number of open files allowed.
\item
\sloppy
When a system is configured with suitable permissions, HPCToolkit will sample call stacks
within the Linux kernel in addition to application-level call stacks. This feature can be useful to measure kernel activity on behalf of a thread (e.g., zero-filling allocated pages when they are first touched)
or to observe where, why, and how long a thread blocks.
For a user to be able to sample kernel call stacks, the configuration file
\verb|/proc/sys/kernel/perf_event_paranoid| must have a value $\leq 1$. To associate addresses
in kernel call paths with function names, the value of
\verb|/proc/sys/kernel/kptr_restrict| must be 0 (number zero). If these settings are not configured in this way on your system, you will need someone with administrator privileges to change them for you to
be able to sample call stacks within the kernel.
\item
Due to a limitation present in all Linux kernel versions currently available,
HPCToolkit's measurement subsystem can only approximate a thread's blocking time.
At present, Linux reports when a thread blocks but does not report when a thread resumes execution.
For that reason, HPCToolkit's measurement subsystem approximates the time a thread spends blocked using sampling as the time between when the thread blocks and when the thread receives its first sample
after resuming execution.
\item
Users need to be cautious when considering measured counts of events that have been collected using
hardware counter multiplexing. Currently, it is not obvious to a user
if a metric was measured using a multiplexed counter. This information is
present in the measurements but is not currently visible in \hpcviewer.
\end{itemize}
\subsection{PAPI}
\label{section:papi}
PAPI, the Performance API, is a library for providing access to the
hardware performance counters. PAPI aims to provide a
consistent, high-level interface that consists of a universal set of event names that can be used
to measure performance on any processor, independent of any processor-specific event names.
In some cases, PAPI event names
represent quantities synthesized by combining measurements based on multiple native events
available on a particular processor.
For instance, in some cases PAPI reports
total cache misses by measuring and combining data misses and instruction misses.
PAPI is available from the University of Tennessee at \url{http://icl.cs.utk.edu/papi}.
PAPI focuses mostly on in-core CPU events: cycles, cache misses,
floating point operations, mispredicted branches, etc. For example,
the following command samples total cycles and L2 cache misses.
\begin{quote}
\begin{verbatim}
hpcrun -e PAPI_TOT_CYC@15000000 -e PAPI_L2_TCM@400000 app arg ...
\end{verbatim}
\end{quote}
The precise set of PAPI preset and native events is highly system
dependent. Commonly, there are events for machine cycles, cache
misses, floating point operations and other more system specific
events. However, there are restrictions both on how many events can
be sampled at one time and on what events may be sampled together and
both restrictions are system dependent. Table~\ref{tab:papi-events}
contains a list of commonly available PAPI events.
To see what PAPI events are available on your system, use the
\verb|papi_avail| command from the \verb|bin| directory in your PAPI
installation. The event must be both available and not derived to be
usable for sampling. The command \verb|papi_native_avail| displays
the machine's native events. Note that on systems with separate
compute nodes, you normally need to run \verb|papi_avail| on one of
the compute nodes.
\begin{table}
\begin{center}
\begin{tabular}{|l|l|}
\hline
\verb|PAPI_BR_INS| & Branch instructions \\
\hline
\verb|PAPI_BR_MSP| & Conditional branch instructions mispredicted \\
\hline
\verb|PAPI_FP_INS| & Floating point instructions \\
\hline
\verb|PAPI_FP_OPS| & Floating point operations \\
\hline
\verb|PAPI_L1_DCA| & Level 1 data cache accesses \\
\hline
\verb|PAPI_L1_DCM| & Level 1 data cache misses \\
\hline
\verb|PAPI_L1_ICH| & Level 1 instruction cache hits \\
\hline
\verb|PAPI_L1_ICM| & Level 1 instruction cache misses \\
\hline
\verb|PAPI_L2_DCA| & Level 2 data cache accesses \\
\hline
\verb|PAPI_L2_ICM| & Level 2 instruction cache misses \\
\hline
\verb|PAPI_L2_TCM| & Level 2 cache misses \\
\hline
\verb|PAPI_LD_INS| & Load instructions \\
\hline
\verb|PAPI_SR_INS| & Store instructions \\
\hline
\verb|PAPI_TLB_DM| & Data translation lookaside buffer misses \\
\hline
\verb|PAPI_TOT_CYC| & Total cycles \\
\hline
\verb|PAPI_TOT_IIS| & Instructions issued \\
\hline
\verb|PAPI_TOT_INS| & Instructions completed \\
\hline
\end{tabular}
\end{center}
\caption{Some commonly available PAPI events.
The exact set of available events is system dependent.}
\label{tab:papi-events}
\end{table}
When selecting the period for PAPI events, aim for a rate of
approximately a few hundred samples per second. So, roughly several
million or tens of million for total cycles or a few hundred thousand
for cache misses. PAPI and \hpcrun{} will tolerate sampling rates as
high as 1,000 or even 10,000 samples per second (or more). However, rates
higher than a few hundred samples per second will only increase measurement
overhead and distort the execution of your program; they won't yield more
accurate results.
Beginning with Linux kernel version 2.6.32,
support for accessing performance counters
using the Linux \perfevents{} performance monitoring subsystem is
built into the kernel. \perfevents{} provides a measurement substrate for PAPI on Linux.
On modern Linux systems that include support
for \verb|perf_events|, PAPI is only recommended for monitoring
events outside the scope of the \verb|perf_events| interface.
\paragraph{Proxy Sampling}
\HPCToolkit{} supports proxy sampling for derived PAPI events.
For \HPCToolkit{} to sample a PAPI event directly, the event must not be
derived and must trigger hardware interrupts when a threshold is exceeded.
For events that cannot trigger interrupts directly, HPCToolkit's proxy sampling
sample on another event that is supported directly and then reads the
counter for the derived event. In this case,
a native event can serve as a proxy for one or more derived events.
To use proxy sampling, specify the \hpcrun{} command line as usual and
be sure to include at least one non-derived PAPI event. The derived
events will be accumulated automatically when processing a sample trigger for a native event.
We recommend adding \verb|PAPI_TOT_CYC| as a native event when using proxy sampling, but
proxy sampling will gather data as long as the event set contains at least one
non-derived PAPI event. Proxy sampling requires one non-derived PAPI event to serve as the proxy;
a Linux timer can't serve as the proxy for a PAPI derived event.
For example, on newer Intel CPUs, often PAPI floating point events are
all derived and cannot be sampled directly. In that case, you could
count FLOPs by using cycles a proxy event with a command line such as
the following. The period for derived events is ignored and may be
omitted.
\begin{quote}
\begin{verbatim}
hpcrun -e PAPI_TOT_CYC@6000000 -e PAPI_FP_OPS app arg ...
\end{verbatim}
\end{quote}
Attribution of proxy samples is not as accurate as regular samples.
The problem, of course, is that the event that triggered the sample
may not be related to the derived counter. The total count of events
should be accurate, but their location at the leaves in the Calling
Context tree may not be very accurate. However, the higher up the
CCT, the more accurate the attribution becomes. For example, suppose
you profile a loop of mixed integer and floating point operations and
sample on \verb|PAPI_TOT_CYC| directly and count \verb|PAPI_FP_OPS|
via proxy sampling. The attribution of flops to individual statements
within the loop is likely to be off. But as long as the loop is long
enough, the count for the loop as a whole (and up the tree) should be
accurate.
\subsection{REALTIME and CPUTIME}
\label{linux-timers}
\HPCToolkit{} supports two timer-based sample sources: \verb|CPUTIME| and
\verb|REALTIME|.
The unit for periods of these timers is microseconds.
Before describing this capability further, it is worth noting
that the CYCLES event supported by Linux \perfevents{} or PAPI's \verb|PAPI_TOT_CYC|
are generally superior to any of the timer-based sampling sources.
\sloppy
The \verb|CPUTIME| and \verb|REALTIME| sample sources are based on the POSIX
timers \verb|CLOCK_THREAD_CPUTIME_ID| and \verb|CLOCK_REALTIME| with
the Linux \verb|SIGEV_THREAD_ID| extension.
\verb|CPUTIME| only counts time when the CPU is running;
\verb|REALTIME| counts
real (wall clock) time, whether the process is running or not.
Signal delivery for these timers is thread-specific, so these timers are suitable for
profiling multithreaded programs.
Sampling using the \verb|REALTIME| sample source
may break some applications that don't handle interrupted syscalls well. In that
case, consider using \verb|CPUTIME| instead.
The following example, which specifies a period of 5000 microseconds will sample
each thread in \verb|app| at a rate of approximately 200 times per second.
\begin{quote}
\begin{verbatim}
hpcrun -e REALTIME@5000 app arg ...
\end{verbatim}
\end{quote}
\noindent {\it Note:} do not use more than one timer-based sample source to monitor a program execution.
When using a sample source such as \verb|CPUTIME| or \verb|REALTIME|,
we recommend not using another time-based sampling source such as
Linux \perfevents{} CYCLES or PAPI's \verb|PAPI_TOT_CYC|.
Technically, this is feasible and \hpcrun{} won't die.
However, multiple time-based sample sources would compete with one another to measure the
execution and likely lead to dropped samples and possibly distorted results.
\subsection{IO}
The \verb|IO| sample source counts the number of bytes read and
written. This displays two metrics in the viewer: ``IO Bytes Read''
and ``IO Bytes Written.'' The \verb|IO| source is a synchronous sample
source.
It overrides the functions \verb|read|, \verb|write|, \verb|fread|
and \verb|fwrite| and records the number of bytes read or
written along with their dynamic context synchronously rather
than relying on data collection triggered by interrupts.
To include this source, use the \verb|IO| event (no period). In the
static case, two steps are needed. Use the \verb|--io| option for
\hpclink{} to link in the \verb|IO| library and use the \verb|IO| event
to activate the \verb|IO| source at runtime. For example,
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -e IO app arg ...| \\
(static) & \verb|hpclink --io gcc -g -O -static -o app file.c ...| \\
& \verb|export HPCRUN_EVENT_LIST=IO| \\
& \verb|app arg ...|
\end{tabular}
\end{quote}
The \verb|IO| source is mainly used to find where your program reads or
writes large amounts of data. However, it is also useful for tracing
a program that spends much time in \verb|read| and \verb|write|. The
hardware performance counters do not advance while running in
the kernel, so the trace viewer may misrepresent the amount of time
spent in syscalls such as \verb|read| and \verb|write|. By adding the
\verb|IO| source, \hpcrun{} overrides \verb|read| and \verb|write| and
thus is able to more accurately count the time spent in these
functions.
\subsection{MEMLEAK}
The \verb|MEMLEAK| sample source counts the number of bytes allocated
and freed. Like \verb|IO|, \verb|MEMLEAK| is a synchronous sample
source and does not generate asynchronous interrupts. Instead, it
overrides the malloc family of functions (\verb|malloc|, \verb|calloc|,
\verb|realloc| and \verb|free| plus \verb|memalign|, \verb|posix_memalign|
and \verb|valloc|) and records the number of bytes
allocated and freed along with their dynamic context.
\verb|MEMLEAK| allows you to find locations in your program that
allocate memory that is never freed. But note that failure to free a
memory location does not necessarily imply that location has leaked
(missing a pointer to the memory). It is common for programs to
allocate memory that is used throughout the lifetime of the process
and not explicitly free it.
To include this source, use the \verb|MEMLEAK| event (no period).
Again, two steps are needed in the static case. Use the \verb|--memleak|
option for \hpclink{} to link in the \verb|MEMLEAK| library
and use the \verb|MEMLEAK| event to activate it at runtime. For
example,
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -e MEMLEAK app arg ...| \\
(static) & \verb|hpclink --memleak gcc -g -O -static -o app file.c ...| \\
& \verb|export HPCRUN_EVENT_LIST=MEMLEAK| \\
& \verb|app arg ...|
\end{tabular}
\end{quote}
If a program allocates and frees many small regions, the \verb|MEMLEAK|
source may result in a high overhead. In this case, you may reduce
the overhead by using the memleak probability option to record only a
fraction of the mallocs. For example, to monitor 10\% of the mallocs,
use:
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -e MEMLEAK --memleak-prob 0.10 app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST=MEMLEAK| \\
& \verb|export HPCRUN_MEMLEAK_PROB=0.10| \\
& \verb|app arg ...|
\end{tabular}
\end{quote}
It might appear that if you monitor only 10\% of the program's
mallocs, then you would have only a 10\% chance of finding the leak.
But if a program leaks memory, then it's likely that it does so many
times, all from the same source location. And you only have to find
that location once. So, this option can be a useful tool if the
overhead of recording all mallocs is prohibitive.
Rarely, for some programs with complicated memory usage patterns, the
\verb|MEMLEAK| source can interfere with the application's memory
allocation causing the program to segfault. If this happens, use the
\verb|hpcrun| debug ({\tt dd}) variable \verb|MEMLEAK_NO_HEADER| as a
workaround.
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -e MEMLEAK -dd MEMLEAK_NO_HEADER app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST=MEMLEAK| \\
& \verb|export HPCRUN_DEBUG_FLAGS=MEMLEAK_NO_HEADER| \\
& \verb|app arg ...|
\end{tabular}
\end{quote}
The \verb|MEMLEAK| source works by attaching a header or a footer to
the application's \verb|malloc|'d regions. Headers are faster but
have a greater potential for interfering with an application. Footers
have higher overhead (require an external lookup) but have almost no
chance of interfering with an application. The
\verb|MEMLEAK_NO_HEADER| variable disables headers and uses only
footers.
% ===========================================================================
\section{Experimental Python Support}
This section provides a brief overview of how to use \HPCToolkit{} to analyze the performance of Python-based applications.
Normally, \hpcrun{} will attribute performance to the CPython implementation, not to the application Python code, as shown in Figure~\ref{fig:python-support}.
This usually is of little interest to an application developer, so \HPCToolkit{} provides experimental support for attributing to Python callstacks.
\textbf{NOTE: Python support in HPCToolkit is in early days. If you compile HPCToolkit to match the version of Python being used by your application, in many cases you will find that you can measure what you want. However, other cases may not work as expected; crashes and corrupted performance data are not uncommon. For the aforementioned reasons, use at your own risk.}
\begin{figure}[ht]
\centering
\includegraphics[width=.95\textwidth]{fig/python-comparison.png}
\caption{
Example of a simple Python application measured without (left) and with (right) Python support enabled via \hpcrun{} \texttt{-a python}.
The left database has no source code, since sources were not provided for the CPython implementation.
}
\label{fig:python-support}
\end{figure}
If \HPCToolkit{} has been compiled with Python support enabled, \hpcrun{} is able to replace segments of the C callstacks with the Python code running in those frames.
To enable this transformation, profile your application the additional \texttt{-a python} flag:
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -a python -e event@howoften python3 app arg ...| \\
\end{tabular}
\end{quote}
As shown in Figure~\ref{fig:python-support}, passing this flag removes the CPython implementation details, replacing it with the much smaller Python callstack.
When Python calls an external C library, \HPCToolkit{} will report both the name of the Python function object and the C function being called, in this example \texttt{sleep} and Glibc's \texttt{clock\_nanosleep} respectively.
\subsection{Known Limitations}
This section lists a number of known limitations with the current implementation of the Python support.
It is recommended that users are aware of these limitations before attempting to use the Python support in practice.
\begin{enumerate}
\item
Pythons older than 3.8 are not supported by \HPCToolkit{}.
Please upgrade any applications and Python extensions to use a recent version of Python before attempting to enable Python support.
\item
The application should be run with the same Python that was used to compile \HPCToolkit{}.
The CPython ABI can change between patch versions and due to certain build configuration flags.
To ensure \hpcrun{} will not unwittingly crash the application, it is best to use a single Python for both \HPCToolkit{} and the application.
Despite this recommendation, it is worth noting that we have had some minor success with cross-version compatability (e.g. building HPCToolkit with Python 3.11.8 support and using it to measure a program running under Python 3.11.6) However, you should be surprised if cross-version compatibility works rather than expecting it to work, even across patch releases.
\item
The bottom-up and flat views of \hpcviewer{} may not correctly present Python callstacks, particularly those that call C/C++ extensions.
Some Python functions may be missing, and the metrics attributed to them may be suspect.
In these cases, refer to the top-down view as the known-good source of truth.
\item
Threads spawned by Python's \texttt{threading} and \texttt{subprocess} modules are not fully supported.
Only the main Python thread will attribute performance to Python callstacks, all others will attribute performance to the CPython implementation.
If Python \texttt{threading} is a performance bottleneck, consider implementing the parallelism in a C/C++ extension instead of in Python to avoid \href{https://docs.python.org/3/glossary.html#term-global-interpreter-lock}{contention on the GIL}.
\item
Applications using signals and signal handlers, for example Python's \texttt{signal} module, will experience crashes when run under \hpcrun{}.
The current implementation fails to process the non-sequential modifications to the Python stack that take place when Python handles signals.
\end{enumerate}
% ===========================================================================
\section{Process Fraction}
Although \hpcrun{} can profile parallel jobs with thousands or tens of
thousands of processes, there are two scaling problems that become
prohibitive beyond a few thousand cores. First, \hpcrun{} writes the
measurement data for all of the processes into a single directory.
This results in one file per process plus one file per thread (two
files per thread if using tracing). Unix file systems are not
equipped to handle directories with many tens or hundreds of thousands
of files. Second, the sheer volume of data can overwhelm the viewer
when the size of the database far exceeds the amount of memory on the
machine.
The solution is to sample only a fraction of the processes. That is,
you can run an application on many thousands of cores but record data
for only a few hundred processes. The other processes run the
application but do not record any measurement data. This is what the
process fraction option (\verb|-f| or \verb|--process-fraction|) does.
For example, to monitor 10\% of the processes, use:
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -f 0.10 -e event@howoften app arg ...| \\
(dynamic) & \verb|hpcrun -f 1/10 -e event@howoften app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST='event@howoften'| \\
& \verb|export HPCRUN_PROCESS_FRACTION=0.10| \\
& \verb|app arg ...|
\end{tabular}
\end{quote}
With this option, each process generates a random number and records
its measurement data with the given probability. The process fraction
(probability) may be written as a decimal number (0.10) or as a
fraction (1/10) between 0 and 1. So, in the above example, all three
cases would record data for approximately 10\% of the processes. Aim
for a number of processes in the hundreds.
% ===========================================================================
\section{Starting and Stopping Sampling}
\HPCToolkit{} supports an API for the application to start and stop
sampling. This is useful if you want to profile only a subset of a
program and ignore the rest. The API supports the following
functions.
\begin{quote}
\begin{verbatim}
void hpctoolkit_sampling_start(void);
void hpctoolkit_sampling_stop(void);
\end{verbatim}
\end{quote}
For example, suppose that your program has three major phases: it
reads input from a file, performs some numerical computation on the
data and then writes the output to another file. And suppose that you
want to profile only the compute phase and skip the read and write
phases. In that case, you could stop sampling at the beginning of the
program, restart it before the compute phase and stop it again at the
end of the compute phase.
This interface is process wide, not thread specific. That is, it
affects all threads of a process. Note that when you turn sampling on
or off, you should do so uniformly across all processes, normally at
the same point in the program. Enabling sampling in only a subset of
the processes would likely produce skewed and misleading results.
And for technical reasons, when sampling is turned off in a threaded
process, interrupts are disabled only for the current thread. Other
threads continue to receive interrupts, but they don't unwind the call
stack or record samples. So, another use for this interface is to
protect syscalls that are sensitive to being interrupted with signals.
For example, some Gemini interconnect (GNI) functions called from
inside \verb|gasnet_init()| or \verb|MPI_Init()| on Cray XE systems
will fail if they are interrupted by a signal. As a workaround, you
could turn sampling off around those functions.
Also, you should use this interface only at the top level for major
phases of your program. That is, the granularity of turning sampling
on and off should be much larger than the time between samples.
Turning sampling on and off down inside an inner loop will likely
produce skewed and misleading results.
To use this interface, put the above function calls into your program
where you want sampling to start and stop. Remember, starting and
stopping apply process wide. For C/C++, include the following header
file from the \HPCToolkit{} \verb|include| directory.
\begin{quote}
\begin{verbatim}
#include <hpctoolkit.h>
\end{verbatim}
\end{quote}
Compile your application with \verb|libhpctoolkit| with \verb|-I| and
\verb|-L| options for the include and library paths. For example,
\begin{quote}
\begin{verbatim}
gcc -I /path/to/hpctoolkit/include app.c ... \
-L /path/to/hpctoolkit/lib/hpctoolkit -lhpctoolkit ...
\end{verbatim}
\end{quote}
The \verb|libhpctoolkit| library provides weak symbol no-op definitions
for the start and stop functions. For dynamically linked programs, be
sure to include \verb|-lhpctoolkit| on the link line (otherwise your
program won't link). For statically linked programs, \hpclink{} adds
strong symbol definitions for these functions. So, \verb|-lhpctoolkit|
is not necessary in the static case, but it doesn't hurt.
To run the program, set the \verb|LD_LIBRARY_PATH| environment
variable to include the \HPCToolkit{} \verb|lib/hpctoolkit| directory.
This step is only needed for dynamically linked programs.
\begin{quote}
\begin{verbatim}
export LD_LIBRARY_PATH=/path/to/hpctoolkit/lib/hpctoolkit
\end{verbatim}
\end{quote}
Note that sampling is initially turned on until the program turns it
off. If you want it initially turned off, then use the \verb|-ds| (or
\verb|--delay-sampling|) option for \hpcrun{} (dynamic) or set the
\verb|HPCRUN_DELAY_SAMPLING| environment variable (static).
\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -ds -e event@howoften app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST='event@howoften'| \\
& \verb|export HPCRUN_DELAY_SAMPLING=1| \\
& \verb|app arg ...|
\end{tabular}
\end{quote}
% ===========================================================================
\section{Environment Variables for \hpcrun{}}
\label{sec:env-vars}
For most systems, \hpcrun{} requires no special environment variable settings.
There are two situations, however, where \hpcrun{}, to function correctly,
\emph{must} refer to environment variables. These environment variables, and
corresponding situations are:
\begin{description}
\item{\verb|HPCTOOLKIT|} To function correctly, \hpcrun{} must know
the location of the \HPCToolkit{} top-level installation directory.
The \hpcrun{} script uses elements of the installation \verb|lib| and
\verb|libexec| subdirectories. On most systems, the
\hpcrun{} can find the requisite
components relative to its own location in the file system.
However, some parallel job launchers \emph{copy} the
\hpcrun{} script to a different location as they launch a job. If your
system does this, you must set the \verb|HPCTOOLKIT|
environment variable to the location of the \HPCToolkit{} top-level installation directory
before launching a job.
\end{description}
{\bf Note to system administrators:} if your system provides a module system for configuring
software packages, then constructing
a module for \HPCToolkit{} to initialize these environment variables to appropriate settings
would be convenient for users.
\section{Cray System Specific Notes}
\label{sec:platform-specific}
%
% system specific notes for titan, keenland?
%
If you are trying to profile a dynamically-linked executable on a Cray that is still using the ALPS job launcher and you see an error like the following
\begin{quote}
\begin{verbatim}
/var/spool/alps/103526/hpcrun: Unable to find HPCTOOLKIT root directory.
Please set HPCTOOLKIT to the install prefix, either in this script,
or in your environment, and try again.
\end{verbatim}
\end{quote}
\noindent
in your job's error log then read on. Otherwise, skip this section.
The problem is that the Cray job launcher copies HPCToolkit's \hpcrun{}
script to a directory somewhere below \verb|/var/spool/alps/|\ and runs
it from there. By moving \hpcrun{} to a different directory, this
breaks \hpcrun{}'s method for finding \HPCToolkit{}'s install directory.
To fix this problem, in your job script, set \verb|HPCTOOLKIT| to the top-level \HPCToolkit{} installation directory
(the directory containing the \verb|bin|, \verb|lib| and
\verb|libexec| subdirectories) and export it to the environment.
(If launching statically-linked binaries created using \hpclink{}, this step is unnecessary, but harmless.)
Figure~\ref{cray-alps} show a skeletal job script that sets the \verb|HPCTOOLKIT| environment variable before monitoring
a dynamically-linked executable with \hpcrun{}:
\begin{figure}
\begin{quote}
\begin{verbatim}
#!/bin/sh