llvm-project.git/lldb/source/Target/Thread.cpp, branch users/fmayer/spr/dfsan-compiler-rt-leave-bufferedstacktrace-uninit

New ThreadPlanSingleThreadTimeout to resolve potential deadlock in single thread stepping (#90930)

2024-08-06T00:26:39+00:00

This PR introduces a new `ThreadPlanSingleThreadTimeout` that will be
used to address potential deadlock during single-thread stepping.

While debugging a target with a non-trivial number of threads (around
5000 threads in one example target), we noticed that a simple step over
can take as long as 10 seconds. Enabling single-thread stepping mode
significantly reduces the stepping time to around 3 seconds. However,
this can introduce deadlock if we try to step over a method that depends
on other threads to release a lock.

To address this issue, we introduce a new
`ThreadPlanSingleThreadTimeout` that can be controlled by the
`target.process.thread.single-thread-plan-timeout` setting during
single-thread stepping mode. The concept involves counting the elapsed
time since the last internal stop to detect overall stepping progress.
Once a timeout occurs, we assume the target is not making progress due
to a potential deadlock, as mentioned above. We then send a new async
interrupt, resume all threads, and `ThreadPlanSingleThreadTimeout`
completes its task.

To support this design, the major changes made in this PR are:
1. `ThreadPlanSingleThreadTimeout` is popped during every internal stop
and reset (re-pushed) to the top of the stack (as a leaf node) during
resume. This is achieved by always returning `true` from
`ThreadPlanSingleThreadTimeout::DoPlanExplainsStop()` and
`ThreadPlanSingleThreadTimeout::MischiefManaged()`.
2. A new thread-specific async interrupt stop is introduced, which can
be detected/consumed by `ThreadPlanSingleThreadTimeout`.
3. The clearing of branch breakpoints in the range thread plan has been
moved from `DoPlanExplainsStop()` to `ShouldStop()`, as it is not
guaranteed that it will be called.

The detailed design is discussed in the RFC below:

[https://discourse.llvm.org/t/improve-single-thread-stepping/74599](https://discourse.llvm.org/t/improve-single-thread-stepping/74599)

---------

Co-authored-by: jeffreytan81

[lldb/Target] Rename ThreadPlanPython into ScriptedThreadPlan (#101931)

2024-08-05T17:43:42+00:00

Following 9a9ec228cdcf, since the ThreadPlanPython class started making
use of the Scripted Interface instead of calling directly into the
python methods, this class can work with other scripting languages (as
long as someone add the interfact for that language ;p).

So it doesn't make sense anymore for it to keep this name and also we
should avoid having language specific related classes outside the plugin
directory.

This patch renames the internal class from `ThreadPlanPython` to
`ScriptedThreadPlan` as its advertised externally, and also updates the
various log messages.

This should hopefully make the codebase more coherent.

Signed-off-by: Med Ismail Bennani

Revert "[lldb] Change lldb's breakpoint handling behavior (#96260)"

2024-07-20T01:43:53+00:00

This reverts commit 05f0e86cc895181b3d2210458c78938f83353002.

The debuginfo dexter tests are failing, probably because the way
stepping over breakpoints has changed with my patches.  And there
are two API tests fails on the ubuntu-arm (32-bit) bot. I'll need
to investigate both of these, neither has an obvious failure reason.

[lldb] Change lldb's breakpoint handling behavior (#96260)

2024-07-20T00:26:13+00:00

lldb today has two rules: When a thread stops at a BreakpointSite, we
set the thread's StopReason to be "breakpoint hit" (regardless if we've
actually hit the breakpoint, or if we've merely stopped *at* the
breakpoint instruction/point and haven't tripped it yet). And second,
when resuming a process, any thread sitting at a BreakpointSite is
silently stepped over the BreakpointSite -- because we've already
flagged the breakpoint hit when we stopped there originally.

In this patch, I change lldb to only set a thread's stop reason to
breakpoint-hit when we've actually executed the instruction/triggered
the breakpoint. When we resume, we only silently step past a
BreakpointSite that we've registered as hit. We preserve this state
across inferior function calls that the user may do while stopped, etc.

Also, when a user adds a new breakpoint at $pc while stopped, or changes
$pc to be the address of a BreakpointSite, we will silently step past
that breakpoint when the process resumes. This is purely a UX call, I
don't think there's any person who wants to set a breakpoint at $pc and
then hit it immediately on resuming.

One non-intuitive UX from this change, but I'm convinced it is
necessary: If you're stopped at a BreakpointSite that has not yet
executed, you `stepi`, you will hit the breakpoint and the pc will not
yet advance. This thread has not completed its stepi, and the thread
plan is still on the stack. If you then `continue` the thread, lldb will
now stop and say, "instruction step completed", one instruction past the
BreakpointSite. You can continue a second time to resume execution. I
discussed this with Jim, and trying to paper over this behavior will
lead to more complicated scenarios behaving non-intuitively. And mostly
it's the testsuite that was trying to instruction step past a breakpoint
and getting thrown off -- and I changed those tests to expect the new
behavior.

The bugs driving this change are all from lldb dropping the real stop
reason for a thread and setting it to breakpoint-hit when that was not
the case. Jim hit one where we have an aarch64 watchpoint that triggers
one instruction before a BreakpointSite. On this arch we are notified of
the watchpoint hit after the instruction has been unrolled -- we disable
the watchpoint, instruction step, re-enable the watchpoint and collect
the new value. But now we're on a BreakpointSite so the watchpoint-hit
stop reason is lost.

Another was reported by ZequanWu in
https://discourse.llvm.org/t/lldb-unable-to-break-at-start/78282 we
attach to/launch a process with the pc at a BreakpointSite and
misbehave. Caroline Tice mentioned it is also a problem they've had with
putting a breakpoint on _dl_debug_state.

The change to each Process plugin that does execution control is that

1. If we've stopped at a BreakpointSite that has not been executed yet,
we will call Thread::SetThreadStoppedAtUnexecutedBP(pc) to record
that.  When the thread resumes, if the pc is still at the same site, we
will continue, hit the breakpoint, and stop again.

2. When we've actually hit a breakpoint (enabled for this thread or not),
the Process plugin should call Thread::SetThreadHitBreakpointSite().
When we go to resume the thread, we will push a step-over-breakpoint
ThreadPlan before resuming.

The biggest set of changes is to StopInfoMachException where we
translate a Mach Exception into a stop reason. The Mach exception codes
differ in a few places depending on the target (unambiguously), and I
didn't want to duplicate the new code for each target so I've tested
what mach exceptions we get for each action on each target, and
reorganized StopInfoMachException::CreateStopReasonWithMachException to
document these possible values, and handle them without specializing
based on the target arch.

rdar://123942164

[lldb][nfc] Move broadcaster class strings away from ConstString (#89690)

2024-04-24T19:13:18+00:00

These are hardcoded strings that are already present in the data section
of the binary, no need to immediately place them in the ConstString
StringPools. Lots of code still calls `GetBroadcasterClass` and places
the return value into a ConstString. Changing that would be a good
follow-up.

Additionally, calls to these functions are still wrapped in ConstStrings
at the SBAPI layer. This is because we must guarantee the lifetime of
all strings handed out publicly.

[lldb] Reland: Store SupportFile in FileEntry (NFC) (#85892)

2024-03-21T15:40:08+00:00

This is another step towards supporting DWARF5 checksums and inline
source code in LLDB. This is a reland of #85468 but without the
functional change of storing the support file from the line table (yet).

Revert "[lldb] Store SupportFile in FileEntry (NFC)" (#85885)

2024-03-20T00:48:46+00:00

Reverts llvm/llvm-project#85468 because @slackito reports this broke
stepping in one of their tests [1] and this patch was meant to be NFC.

[1]
https://github.com/llvm/llvm-project/commit/d5a277d309e92b1d3e493da6036cffdf815105b1#commitcomment-139991120

[lldb] Store SupportFile in FileEntry (NFC) (#85468)

2024-03-15T22:03:54+00:00

This is another step towards supporting DWARF5 checksums and inline
source code in LLDB.

[lldb] Detect a Darwin kernel issue and work around it (#81573)

2024-02-14T21:06:20+00:00

On arm64 machines, when there is a hardware breakpoint or watchpoint
set, and lldb has instruction-stepped a thread, and then done a
Process::Resume, we will sometimes receive an extra "instruction step
completed" mach exception and the pc has not advanced. From a user's
perspective, they hit Continue and lldb stops again at the same spot.
From the testsuite's perspective, this has been a constant source of
testsuite failures for any test using hardware watchpoints and
breakpoints, the arm64 CI bots seem especially good at hitting this
issue.

Jim and I have been slowly looking at this for a few months now, and
finally I decided to try to detect this situation in lldb and silently
resume the process again when it happens.

We were already detecting this "got an insn-step finished mach exception
but this thread was not instruction stepping" combination in
StopInfoMachException where we take the mach exception and create a
StopInfo object for it. We had a lot of logging we used to understand
the failure as it was hit on the bots in assert builds.

This patch adds a new case to `Thread::GetPrivateStopInfo()` to call the
StopInfo's (new) `IsContinueInterrupted()` method. In
StopInfoMachException, where we previously had logging for assert
builds, I now note it in an ivar, and when
`Thread::GetPrivateStopInfo()` asks if this has happened, we check all
of the combination of events that this comes up: We have a hardware
breakpoint or watchpoint, we were not instruction stepping this thread
but got an insn-step mach exception, the pc is the same as the previous
stop's pc. And in that case, `Thread::GetPrivateStopInfo()` returns no
StopInfo -- indicating that this thread would like to resume execution.

The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and
`m_prev_frames_sp`. When a thread resumes execution, we move
`m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops executing,
w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp` if most of
the stack is the same as before.

In this same location, I now save the Thread's RegisterContext::GetPC
into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs this
information to check all of the conditions I outlined above for
`IsContinueInterrupted`.

This has passed exhaustive testing and we do not have any testsuite
failures for hardware watchpoints and breakpoints due to this kernel bug
with the patch in place. In focusing on these tests for thousands of
runs, I have found two other uncommon race conditions for the
TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py (which
uses no hardware watchpoint/breakpoints) will sometimes only have 99
breakpoints when it expects 100, and any of the concurrent tests using
the shared harness (I've seen it in
TestConcurrentWatchBreakDelay.py,
TestConcurrentTwoBreakpointsOneSignal.py,
TestConcurrentSignalDelayWatch.py) can fail when the test harness checks
that there is only one thread still running at the end, and it finds two
-- one of them under pthread_exit / pthread_terminate. Both of these
failures happen on github main without my changes, and with my changes -
they are unrelated race conditions in these tests, and I'm sure I'll be
looking into them at some point if they hit the CI bots with frequency.
On my computer, these are in the 0.3-0.5% of the time class. But the CI
bots do have different timing.

[lldb][NFCI] Remove EventData* param from BroadcastEvent (#78773)

2024-01-22T18:46:20+00:00

BroadcastEvent currently takes its EventData* param and shoves it into
an Event object, which takes ownership of the pointer and places it into
a shared_ptr to manage the lifetime.

Instead of relying on `new` and passing raw pointers around, I think it
would make more sense to create the shared_ptr up front.