eu-stacktrace: roadmap and discussion thread

Mon May 8 12:33:57 GMT 2023

Hello all,

I wanted to open up public discussion on a project I'm looking to
develop in elfutils, tentatively named eu-stacktrace. I've started to
write code on branch users/serhei/eu-stacktrace.

eu-stacktrace will be a utility to process a stream of raw stack
samples (such as those obtained from the Linux kernel's
PERF_SAMPLE_STACK facility) into a stream of stack traces (such as the
ones obtained from PERF_SAMPLE_CALLCHAIN), freeing various profiling
utilities from having to implement their own backtracing logic.

My initial goal is to make the tool work with a (slightly modified)
version of the sysprof profiler. If all goes well, I hope to produce a
demonstration of sysprof using elfutils eu-stacktrace and eh_frame
data to produce useful profiles on code compiled with
-fomit-frame-pointer. (I'm aware of the problem of profiling
-fomit-frame-pointer programs being a topic of some fairly contentious
recent discussion, which I'm not looking to rehash; I'm just
interested to see if I can add a viable technical solution to the
mix.) I'm cc:ing chergert and posting a link to this thread on
GNOME Discourse so that sysprof developers can keep track of the
discussion.

For the time being, eu-stacktrace is meant to be fed data from a
profiling tool via a pipe or fifo. We will see how well this idea
works as implementation proceeds.

The eventual goal is to work with various profiler data formats. After
sysprof, supporting perf's native data format is an obvious
prerequisite for merging the users/serhei/eu-stacktrace branch into
elfutils. Ideally, I would like for eu-stacktrace to also convert
between different profile data formats (e.g. taking sysprof data as
input and emitting perf data, and vice-versa), but this may be
out-of-scope given the amount of code that would need to be written to
handle profile data other than stack traces.

Usage instructions will be kept up-to-date in README.eu-stacktrace on
the topic branch:

- https://sourceware.org/cgit/elfutils/tree/README.eu-stacktrace?h=users/serhei/eu-stacktrace

All the best,
  Serhei Makarov

PS. More information follows.

* * *

My current roadmap for the prototype with sysprof is as follows:

# 1. Get build-ids of all executables as sysprof encounters them.

Build-id data can be obtained by coding sysprof to support
PERF_RECORD_MMAP2 rather than PERF_RECORD_MMAP. As far as I
understand, there are indications this would be a welcome patch for
the sysprof project.

# 2. Get stack samples with PERF_SAMPLE_STACK; pipe to eu-stacktrace.

Within sysprof, add an option to switch the perf data source to use
PERF_SAMPLE_STACK rather than PERF_SAMPLE_CALLCHAIN. The capture
writer will write the data to a pipe to be processed by eu-stacktrace;
thus the stack samples never hit the disk.

Within eu-stacktrace, I'm implementing the code to accept data in
sysprof format, as defined in the public header (e.g. sysprof-devel
package on Fedora provides
/usr/include/sysprof-4/sysprof-capture-types.h).

# 3. Implement eh_frame / dwarf-via-debuginfod data retrieval in eu-stacktrace.

I am hoping that eh_frame data will be sufficient, but elfutils
includes support for retrieving data via debuginfod as a
fallback.

There are a number of use cases relating to executables inside
containers that sysprof handles with clever logic. If I want to match
the profile coverage of plain sysprof with sysprof+eu-stacktrace, some
contemplation is required as to whether I need to duplicate that
logic, or to leverage sysprof's codebase directly from eu-stacktrace.

# 4. Implement and benchmark naive unwinding of all samples as they come in.

Within eu-stacktrace, once we have the stack samples and the .eh_frame
data accessible, use them to unwind the stack sample and output the
resulting compact stack traces as callchain frames in sysprof's
currently-existing format. Resulting pipeline:

Of course, it is possible that eu-stacktrace is so slow that an
unsuitable amount of data piles up in the pipe. This would be
guaranteed if we need to retrieve data from debuginfod.

# 5. If needed, scope out / implement async preparation of unwinder data.

If eu-stacktrace cannot handle all of the stack samples in real time,
there is a scheme that will allow us to reach good-enough profile
coverage (e.g. 90%+ on a long-enough run) by caching data structures
pertaining to a repeatedly-encountered code location and using a
JIT-style 'priming' scheme.

The overall idea: the first time we encounter a code location, we
would drop the sample and initiate whatever preparation procedure
(setting up data structures or retrieving data via debuginfod) is
needed to unwind from that code location successfully.

After the preparation procedure completes, we will be able to unwind
future samples based at that code location.

Within sysprof, we could add code to display a percentage indicator of
how many samples in the profile were successfully converted to stack
traces. This could be provided by having eu-stacktrace export the
number to a procfs-style file which sysprof can poll and incorporate
into its live statistic UI that already displays a running total of
the number of samples. As the eu-stacktrace cache is primed with data,
the success rate will rise -- in my simulated scenarios, it routinely
reached 90%+ -- and the sysprof user can keep an eye on the indicator
and stop profiling once the percentage has reached a satisfyingly high
value.

I am sure the details will be complex and interesting to work out, but
I also hope this is not actually needed outside of unusual cases.

# 6. Implement support for stitching stack traces to always reach the root.

For a top-down profile visualization, it's not crucial to accurately
unwind 100% of the samples, but it is important that the
accurately-unwound samples reach the root of the stack. However,
PERF_SAMPLE_STACK only provides a fixed-size sample of the stack,
which may not include the root. This can be worked around with
per-thread caching of the last-known state of the entire stack.
Frank Ch. Eigler and I brainstormed around 5-6 possibilities
for how to maintain this cache.

* * *

Based on the above staging, the required changes to sysprof would be
reduced to the following four:

1. Collect build-id data via PERF_RECORD_MMAP2 rather than PERF_RECORD_MMAP
2. Collect stack samples via PERF_SAMPLE_STACK rather than PERF_SAMPLE_CALLCHAIN
3. Output the sample frames to a pipe connected to eu-stacktrace
4. (If needed,) poll a procfs-style file updated by eu-stacktrace to
   receive and display the percentage of successfully-unwound frames