-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathenviron.tex
310 lines (242 loc) · 15.2 KB
/
environ.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
\newcommand{\parg}{\vspace{1ex}\noindent}
\chapter{Environment Variables}
\label{sec:env}
\HPCToolkit{}'s measurement subsystem decides what and how to measure
using information it obtains from environment variables.
This chapter describes all of the environment variables that control
\HPCToolkit's measurement subsystem.
When using
\HPCToolkit{}'s \hpcrun{} script to measure the performance of
dynamically-linked executables, \hpcrun{} takes information passed
to it in command-line arguments and communicates it to \HPCToolkit{}'s
measurement subsystem by appropriately setting environment variables.
To measure statically-linked executables, one first adds \HPCToolkit's
measurement subsystem to a binary as it is linked by using \HPCToolkit's
\hpclink{} script. Prior to launching a statically-linked binary that
includes \HPCToolkit's measurement subsystem, a user
must manually set environment variables.
Section~\ref{user-env} describes
environment variables of interest to users. Section~\ref{system-env}
describes environment variables designed for use by \HPCToolkit{}
developers. In some cases, \HPCToolkit's developers will ask a user
to set some of the environment variables described in Section~\ref{system-env} to generate a detailed error
report when problems arise.
\section{Environment Variables for Users}
\label{user-env}
\paragraph{HPCTOOLKIT.}
Under normal circumstances, there is no need to use this environment variable.
However, there are two situations where \hpcrun{}
\emph{must} consult the \verb+HPCTOOLKIT+ environment variable to determine the location
of \HPCToolkit{}'s top-level installation directory to find libraries
and utilities that it needs at runtime.
\begin{itemize}
\item
If you launch the \hpcrun{} script via a file system link,
you must set the \verb+HPCTOOLKIT+ environment variable to
\HPCToolkit{}'s top-level installation directory.
\item Some parallel job launchers (e.g., Cray's aprun) may \emph{copy} the
\hpcrun{} script to a different location. If this is the case, you will need to set the \verb+HPCTOOLKIT+ environment variable to
\HPCToolkit{}'s top-level installation directory.
\end{itemize}
\paragraph{HPCRUN\_EVENT\_LIST.}
This environment variable is used provide a set of (event, period)
pairs that will be used to configure \HPCToolkit's measurement subsystem to perform
asynchronous sampling. The HPCRUN\_EVENT\_LIST environment variable
must be set otherwise HPCToolkit's measurement subsystem will terminate
execution. If an application should run with sampling disabled,
HPCRUN\_EVENT\_LIST should be set to NONE. Otherwise, HPCToolkit's
measurement subsystem expects an event list of the form shown below.
$$event1[@period1];...;eventN[@periodN]$$ As denoted by the
square brackets, periods are optional. The default period is 1
million.
\parg
Flags to add an event with \hpcrun: \verb|-e/--event| $event1[@period1]$
\parg
Multiple events may be specified using multiple instances of \verb|-e/--event| options.
\paragraph{HPCRUN\_TRACE.}
If this environment variable is set, HPCToolkit's measurement
subsystem will collect a trace of sample events as part of a measurement
database in addition to a profile. HPCToolkit's hpctraceviewer
utility can be used to view the trace after the measurement database
are processed with either HPCToolkit's hpcprof or hpcprofmpi
utilities.
\parg
Flags to enable tracing with \hpcrun: \verb|-t/--trace|
\paragraph{HPCRUN\_OUT\_PATH}
If this environment variable is set, HPCToolkit's measurement subsystem
will use the value specified as the name of the directory where
output data will be recorded. The default directory for a command
$command$ running under control of a job launcher with as job ID
$jobid$ is hpctoolkit-$command$-measurements[-$jobid$]. (If no
job ID is available, the portion of the directory name in square
brackets will be omitted. Warning: Without a $jobid$ or an output
option, multiple profiles of the same $command$ will be placed
in the same output directory.
\parg
Flags to set output path with \hpcrun: \verb|-o/--output| $directoryName$
\paragraph{HPCRUN\_PROCESS\_FRACTION}
\sloppy
If this environment variable is set, \HPCToolkit's measurement
subsystem will measure only a fraction of an execution’s processes.
The value of HPCRUN\_PROCESS\_FRACTION may be written as a a floating
point number or as a fraction. So, '0.10' and '1/10' are equivalent.
If HPCRUN\_PROCESS\_FRACTION is set to a value with an unrecognized
format, \HPCToolkit's measurement subsystem will use the default
probability of 0.1. For each process, \HPCToolkit's measurement
subsystem will generate a pseudo-random value in the range [0.0, 1.0).
If the generated random number is less than the value of
HPCRUN\_PROCESS\_FRACTION, then \HPCToolkit{} will collect performance
measurements for that process.
\parg
Flags to set process fraction with \hpcrun: \verb|-f/-fp/--process-fraction| $frac$
\paragraph{HPCRUN\_DELAY\_SAMPLING}
\sloppy
If this environment variable is set, HPCToolkit's measurement subsystem
will initialize itself but not begin measurement using sampling
until the program turns on sampling by calling
\verb|hpctoolkit_sampling_start()|. To measure only a part of a
program, one can bracket that with \verb|hpctoolkit_sampling_start()|
and \verb|hpctoolkit_sampling_stop()|. Sampling may be turned on
and off multiple times during an execution, if desired.
\parg
Flags to delay sampling with \hpcrun: \verb|-ds/--delay-sampling|
\paragraph{HPCRUN\_CONTROL\_KNOBS.}
\hpcrun{} has some settings, known as control knobs, that can be adjusted by a knowledgeable user to tune the operation of \hpcrun's measurement subsystem. Names and default values of the control knobs are shown in Table~\ref{knob-names}
\begin{table}[t]
\begin{center}
\begin{tabular}{|l|r|p{2in}|}\hline
& Default & \\
Name & Value & Description \\\hline\hline
\verb|MAX_COMPLETION_CALLBACK_THREADS| & 1000 & See Note 1. \\\hline
\verb|STREAMS_PER_TRACING_THREAD| & 4 & See Note 2. \\\hline
\verb|HPCRUN_CUDA_DEVICE_BUFFER_SIZE| & 8388608 & See Note 3. \\\hline
\verb|HPCRUN_CUDA_DEVICE_SEMAPHORE_SIZE| & 65536 & See Note 4. \\\hline
\hline
\end{tabular}
\begin{flushleft}
{\bf Note 1}: OpenCL may execute callbacks on helper threads created by the OpenCL runtime. This knob specifies the maximum number of helper threads that can be handled by \hpcrun{}'s OpenCL tracing implementation.
{\bf Note 2}: GPU stream traces are recorded by tracing threads created by \hpcrun{}. Reducing the number of streams per \hpcrun{} tracing thread may make monitoring faster, though it will use more resources.
{\bf Note 3}: Value used as {\tt CUPTI\_ACTIVITY\_ATTR\_DEVICE\_BUFFER\_SIZE}. See \url{https://docs.nvidia.com/cuda/cupti/group__CUPTI__ACTIVITY__API.html}.
{\bf Note 4}: Value used as \verb|CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE|. See \url{https://docs.nvidia.com/cuda/cupti/group__CUPTI__ACTIVITY__API.html}
\end{flushleft}
\caption{Control knob names and default values.}
\label{knob-names}
\end{center}
\end{table}
\parg
Flags to set a control knob for \hpcrun: \verb|-ck/--control-knob| name=setting.
\paragraph{HPCRUN\_MEMLEAK\_PROB}
If this environment variable is set, \HPCToolkit's measurement
subsystem will measure only a fraction of an execution’s memory
allocations, e.g., calls to \verb|malloc|, \verb|calloc|, \verb|realloc|,
\verb|posix_memalign|, \verb|memalign|, and valloc. All allocations
monitored will have their corresponding calls to free monitored as
well. The value of HPCRUN\_MEMLEAK\_PROB may be written as a a
floating point number or as a fraction. So, '0.10' and '1/10' are
equivalent. If HPCRUN\_MEMLEAK\_PROB is set to a value with an
unrecognized format, \HPCToolkit's measurement subsystem will use the
default probability of 0.1. For each memory allocation, \HPCToolkit's
measurement subsystem will generate a pseudo-random value in the range
[0.0, 1.0). If the generated random number is less than the value
of HPCRUN\_MEMLEAK\_PROB, then \HPCToolkit{} will monitor that
allocation.
\parg
Flags to set process fraction with \hpcrun: \verb|-mp/--memleak-prob| $prob$
\paragraph{HPCRUN\_RETAIN\_RECURSION}
Unless this environment variable is set, by default HPCToolkit's
measurement subsystem will summarize call chains from recursive calls
at a depth of two. Typically, application developers have no need
to see performance attribution at all recursion depths when an
application calls recursive procedures such as quicksort. Setting
this environment variable may dramatically increase the size of
calling context trees for applications that employ bushy subtrees
of recursive calls.
\parg
Flags to retain recursion with \hpcrun: \verb|-r/--retain-recursion|
\paragraph{HPCRUN\_MEMSIZE}
If this environment variable is set, HPCToolkit's measurement subsystem
will allocate memory for measurement data in segments using the
value specified for HPCRUN\_MEMSIZE (rounded up to the nearest
enclosing multiple of system page size) as the segment size. The
default segment size is 4M.
\parg
Flags to set memsize with \hpcrun: \verb|-ms/--memsize| $bytes$
\paragraph{HPCRUN\_LOW\_MEMSIZE}
If this environment variable is set, HPCToolkit's measurement subsystem
will allocate another segment of measurement data when the amount
of free space available in the current segment is less than the
value specified by HPCRUN\_LOW\_MEMSIZE. The default for low memory
size is 80K.
\parg
Flags to set low memsize with \hpcrun: \verb|-lm/--low-memsize| $bytes$
\paragraph{HPCTOOLKIT\_HPCSTRUCT\_CACHE}
If this environment variable contains the name of a Linux directory
that is readable and writable to you, \hpcstruct{} will cache any
program structure files it computes in this directory. When invoked
to analyze a binary, \hpcstruct{} will check if program structure
information for the binary exists in the cache. If so, \hpcstruct{}
will return the cached copy. If not, \hpcstruct{} will compute
program structure information for the binary and record it in the
cache.
\section{Environment Variables that May Avoid a Crash}
\paragraph{HPCRUN\_AUDIT\_FAKE\_AUDITOR.}
By default, \hpcrun{} will use \verb|libc|'s \verb|LD_AUDIT| feature to monitor dynamic library operations.
For cases where using \verb|LD_AUDIT| is problematic (e.g. with applications or libraries that require the use of {\tt dlmopen}) ,
\hpcrun{} supports an alternative {\em fake auditor} that monitors shared library operations by wrapping {\tt dlopen} and {\tt dlclose} instead. This variable will be set to 1 if a {\em fake auditor} is used.
If \verb|LD_AUDIT| is not causing your program to crash, we don't recommend using the fake auditor as it may cause your application or shared libraries it loads to ignore any {\tt RUNPATH} set in their binaries.
\parg
Flag to select the fake auditor with \hpcrun: \verb|--disable-auditor|.
\paragraph{HPCRUN\_AUDIT\_DISABLE\_PLT\_CALL\_OPT.}
By default, \hpcrun{} will use \verb|libc|'s \verb|LD_AUDIT| feature to monitor dynamic library operations. The \verb|LD_AUDIT| facility has the unfortunate behavior of intercepting each call to a shared library. Each call to a shared library is dispatched through the {\em Procedure Linkage Table} (PLT). We have observed that allowing the \verb|LD_AUDIT| facility to intercept each call to a shared library is costly: on {\tt x86\_64} we measured a slowdown of $68\times$ for a call to an empty shared library routine.
To avoid this overhead, \hpcrun{} sidesteps \verb|LD_AUDIT|'s monitoring of a load module's calls to a shared library routine by allowing the address of the routine to be cached in the load module's Global Offset Table (GOT). The mechanism for this optimization is complex. If you suspect that this optimization is causing your program to crash, this optimization can be disabled. If your program is not crashing, don't even consider adjusting this!
\parg
Flag to disable optimization of PLT calls when using \verb|LD_AUDIT| to monitor shared library operations with \hpcrun: \verb|--disable-auditor-got-rewriting|.
\section{Environment Variables for Developers}
\label{system-env}
\paragraph{HPCRUN\_WAIT}
If this environment variable is set, HPCToolkit's measurement subsystem
will spin wait for a user to attach a debugger. After attaching a
debugger, a user can set breakpoints or watchpoints in the user
program or HPCToolkit's measurement subsystem before continuing
execution. To continue after attaching a debugger, use the debugger
to set the program variable DEBUGGER\_WAIT=0 and then continue.
Note: Setting HPCRUN\_WAIT can only be cleared by a debugger
if \HPCToolkit{} has been built with debugging symbols.
Building \HPCToolkit{} with debugging symbols requires
configuring \HPCToolkit{} with --enable-develop.
\paragraph{HPCRUN\_DEBUG\_FLAGS}
\HPCToolkit{} supports a multitude of debugging flags that enable a
developer to log information about HPCToolkit's measurement subsystem
as it records sample events. If HPCRUN\_DEBUG\_FLAGS is set, this
environment variable is expected to contain a list of tokens separated
by a space, comma, or semicolon. If a token is the name of a debugging
flag, the flag will be enabled, it will cause HPCToolkit's measurement
subsystem to log messages guarded with that flag as an application
executes. The complete list of dynamic debugging flags can be found
in HPCToolkit's source code in the file
src/tool/hpcrun/messages/messages.flag-defns. A special flag value ``ALL'' enables all flags.
\parg
Note: not all debugging flags are
meaningful on all architectures.
\parg
Caution: turning on debugging flags
will typically result in voluminous log messages, which will typically will
dramatically slow measurement of the execution under study.
\parg
Flags to set debug flags with \hpcrun: \verb|-dd/--dynamic-debug| $flag$
\paragraph{HPCRUN\_ABORT\_TIMEOUT}
If an execution hangs when profiled with HPCToolkit's measurement
subsystem, the environment variable HPCRUN\_ABORT\_TIMEOUT can be
used to specify the number of seconds that an application should
be allowed to execute. After executing for the number of seconds
specified in HPCRUN\_ABORT\_TIMEOUT, HPCToolkit's measurement
subsystem will forcibly terminate the execution and record a core
dump (assuming that core dumps are enabled) to aid in debugging.
\parg
Caution: for a large-scale parallel execution, this might cause a
core dump for each process, depending upon the settings for your
system. Be careful!
\paragraph{HPCRUN\_FNBOUNDS\_CMD}
For dynamically-linked executables, this environment variable must
be set to the full path of a copy of HPCToolkit's hpcfnbounds
utility. There are presently two versions of this utility. One, known as \verb|hpcfnbounds|, analyzes program load modules (the executable and shared libraries) using Dyninst to recover a table of addresses that represent the beginning of each function. A second version of the tool, known as \verb|hpcfnbounds2|, was designed to compute the same set of addresses for a load module using only a lightweight inspection of the load module's symbol table and DWARF information. \verb|hpcfnbounds2| is over a factor of ten faster and uses over a factor of 10 less memory than the original. \verb|hpcfnbounds2| is the default. If \verb|hpcfnbounds2| delivers an unsatisfactory result, a user can employ \verb|hpcfnbounds| instead by setting this environment variable using the \verb|--fnbounds| command line argument to \hpcrun{}.