Lightweight performance and debugging tools
This site contains supporting experimental data for WitchTools. witch_data.xlsx is an excel spread sheet that contains the following sheets:
-
Accuracy: This sheet quantifies the overall redundancies in SPEC CPU2006 benchmarks and compares the redundancy metric of each of exhaustive tools vs. sampling tools. The sheet contains benchmark names, inputs, and the measured redundancies by both exhaustive tools (DeadSpy, RedSpy, LoadSpy) and their sampling counterparts (DeadCraft, SilentCraft, and LoadCraft). The sampling tools have the redundancy metrics collected at five different sampling rates 500K, 1M, 5M, 10M, and 100M.
-
Slowdown: This sheet quantifies the runtime slowdown caused by different redundancy detection tools. Slowdown is measured as the time taken by the monitored execution divided by the time taken by the original execution. The data contains both exhaustive tools (DeadSpy, RedSpy, LoadSpy) and their sampling counterparts (DeadCraft, SilentCraft, and LoadCraft). The sampling tools have the slowdown values at five different sampling rates: 500K, 1M, 5M, 10M, and 100M.
-
Memory bloat: This sheet quantifies the memory bloat caused by different redundancy detection tools. Memory bloat is measured as the peak resident set size of the monitored execution divided by the peak resident set size of the original execution. The data contains both exhaustive tools (DeadSpy, RedSpy, LoadSpy) and their sampling counterparts (DeadCraft, SilentCraft, and LoadCraft). The sampling tools have their data values at five different sampling rates: 500K, 1M, 5M, 10M, and 100M.
-
TopNComparison: This sheet compares the top N calling context pairs found by the sampling tool (DeadCraft) at 5M sampling rate against the grand truth exhaustive instrumentation scheme (DeadSpy) on SPEC CPU2006 benchmarks. Since the comparison is non-trivial, we produce multiple metrics. For each benchmark, first we sort all calling contexts by their contribution and pick the first few whose cumulative contribution adds to more than 90% of the total observed dead writes. In order to avoid extremely small contributors, we do not account the contexts whose individual contribution is below 10% to the overall contribution. Thus the value of N varies benchmark to benchmark and between sampling vs. exhaustive metrics. Once we have the top N contexts, we represent each context with a single ASCII alphabet and all top N contexts in an execution as an ASCII string (e.g., abcdefgh). Col #F and Col #G in the sheet contain the canonical strings for the grand truth vs. sampling schemes. Col #D and Col #E capture the length of these strings. We compare the canonical strings from exhaustive tool against the sampling tool. Col #H is the edit distance between the strings in Col #F and Col #G. Col I quantifies how many contexts are missing in the sampled scheme but present in the grand truth. Col #J quantifies how many contexts are not present in the grand truth but present in the sampled tool. Col #K and Col #L, are the percentage (0.0-1.0) contribution of each of the top N contexts for grand truth and sampling respectively. Each comma separated entry has a 1-1 mapping with each alphabet in its canonical string. Row M is the comparison of the weights present in Col #K and Col #L. For each comma separated entry i in Col #K and Col #L, we compute the weigh difference in Col #M as as \sum_i{ abs(ColK[i] - ColL[i])}. Col #M does not capture the tail entries in Col #L if there are more entries in Col#L compared to Col#K. Col #N captures the sum of the weight of these tail entries.
-
Num DBG Registers: This sheet is similar to "Accuracy" sheet but instead of all (4) debug registers, we vary the MAX number of debug registers to 1, 2, 3, and 4 and compare the accuracy for each of them against the exhaustive tools. The sheet contains benchmark names, inputs, and the measured redundancies by both exhaustive tools (DeadSpy, RedSpy, LoadSpy) and their sampling counterparts (DeadCraft, SilentCraft, and LoadCraft). The sampling tools have the redundancy metrics collected at five different sampling rates 500K, 1M, 5M, 10M, and 100M. The fact to highlight is that our sampling scheme and proportional attribution is highly stable and accurate; and the number of debug registers has little effect on the accuracy.