Keiko Nakata - Valgrind

Citations

Valgrind runs the application under the test (AUT) in a virtual machine, VEX. VEX performs on the fly translation of the AUT machine code to an Intermediate Representation (IR), and intercepts system calls and memory accesses required for the analysis.

cachegrind simulates caches and counts cache misses/hits. callgrind counts CPU instructions executed.

Run callgrind

valgrind --tool=callgrind --collect-atstart=no program-to-run program-arguments

Run the annotator

callgrind_annotate --show=DLmr --sort=DLmr --auto=yes callgrind.out.pid

--auto=yes breaks down the results per statement (instead of per function). --show=DLmr only shows figures for DLmr

Enable cache simulation

valgrind --tool=callgrind --simulate-cache=yes program-to-run program-arguments

How to limit the range of collected events

Limiting the range of collected events

the collection state at program start can be switched off by –instr-atstart=no. During execution, it can be controlled programmatically with the macro CALLGRIND_TOGGLE_COLLECT;. Further, you can limit event collection to a specific function by using –toggle-collect=function.

I – instruction Ir – instruction reads, #-instructions x #-frequency I1 – instruction L1 LLi – Last-Level instruction

D data D1 - data L1 cache LLd – last level cache data

Bc – conditional branches executed Bi – indirect branches executed

By default, the counts are exclusive— the counts for a function include only the time spent in that function and not in the functions that it calls.

--inclusive=yes makes the counts inclusive.

L1 miss will typically cost around 5-10 cycles, an L2 miss can cost as much as 100-200 cycles,