llvm-project.git/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp, branch users/mingmingl-llvm/samplefdo-profile-format

[lldb][debugserver] Upstream to debugserver changes (#155733)

2025-08-28T17:06:46+00:00

Review of diffs from lldb's internal debugserver and llvm.org main found
two orphaned changes that should be upstreamed.

First is in MachTask::ExceptionThread where we want to confirm that a
mach exception messages is from the correct process before we process
it.

Second is that we want to run the arm64 register context through
thread_convert_thread_state() after thread_get_state, and before
thread_set_state, to re-sign fp/sp/lr/pc appropriately for ptrauth
(arm64e) processes.

[lldb][debugserver] Save and restore the SVE/SME register state (#134184)

2025-04-03T22:48:54+00:00

debugserver isn't saving and restoring the SVE/SME register state around
inferior function calls.

Making arbitrary function calls while in Streaming SVE mode is generally
a poor idea because a NEON instruction can be hit and crash the
expression execution, which is how I missed this, but they should be
handled correctly if the user knows it is safe to do.

Re-landing this change after fixing an incorrect behavior on systems
without SME support.

rdar://146886210

Revert "[lldb][debugserver] Save and restore the SVE/SME register state (#134184)"

2025-04-03T06:01:51+00:00

This reverts commit 4e40c7c4bd66d98f529a807dbf410dc46444f4ca.

arm64 CI is getting a failure in
lldb-api.tools/lldb-server.TestGdbRemoteRegisterState.py
with this commit, need to investigate and re-land.

[lldb][debugserver] Save and restore the SVE/SME register state (#134184)

2025-04-03T03:37:07+00:00

debugserver isn't saving and restoring the SVE/SME register state around
inferior function calls.

Making arbitrary function calls while in Streaming SVE mode is generally
a poor idea because a NEON instruction can be hit and crash the
expression execution, which is how I missed this, but they should be
handled correctly if the user knows it is safe to do.

rdar://146886210

[lldb][debugserver] Read/write SME registers on arm64 (#119171)

2024-12-19T17:57:27+00:00

**Note:** The register reading and writing depends on new register
flavor support in thread_get_state/thread_set_state in the kernel, which
will be first available in macOS 15.4.

The Apple M4 line of cores includes the Scalable Matrix Extension (SME)
feature. The M4s do not implement Scalable Vector Extension (SVE),
although the processor is in Streaming SVE Mode when the SME is being
used. The most obvious side effects of being in SSVE Mode are that (on
the M4 cores) NEON instructions cannot be used, and watchpoints may get
false positives, the address comparisons are done at a lowered
granularity.

When SSVE mode is enabled, the kernel will provide the Streaming Vector
Length register, which is a maximum of 64 bytes with the M4. Also
provided are SVCR (with bits indicating if SSVE mode and SME mode are
enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long),
P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4
supports SME2, so the ZT0 register (64 bytes).

When SSVE/SME are disabled, none of these registers are provided by the
kernel - reads and writes of them will fail.

Unlike Linux, lldb cannot modify the SVL through a thread_set_state
call, or change the processor state's SSVE/SME status. There is also no
way for a process to request a lowered SVL size today, so the work that
David did to handle VL/SVL changing while stepping through a process is
not an issue on Darwin today. But debugserver should be providing
everything necessary so we can reuse all of David's work on resizing the
register contexts in lldb if it happens in the future. debugbserver
sends svl, svcr, and tpidr2 in the expedited registers when a thread
stops, if SSVE|SME mode are enabled (if the kernel allows it to read the
ARM_SME_STATE register set).

While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible
SVL is 256; this would give us a 64k ZA register. If debugserver sized
all of its register contexts assuming the largest possible SVL, we could
easily use 2MB more memory for the register contexts of all threads in a
process -- and on iOS et al, processes must run within a small memory
allotment and this would push us over that.

Much of the work in debugserver was changing the arm64 register context
from being a static compile-time array of register sets, to being
initialized at runtime if debugserver is running on a machine with SME.
The ZA is only created to the machine's actual maximum SVL. The size of
the 32 SVE Z registers is less significant so I am statically allocating
those to the architecturally largest possible SVL value today.

Also, debugserver includes information about registers that share the
same part of the register file. e.g. S0 and D0 are the lower parts of
the NEON 128-bit V0 register. And when running on an SME machine, v0 is
the lower 128 bits of the SVE Z0 register. So the register maps used
when defining the VFP registers must differ depending on the
capabilities of the cpu at runtime.

I also changed register reading in debugserver, where formerly when
debugserver was asked to read a register, and the thread_get_state read
of that register failed, it would return all zero's. This is necessary
when constructing a `g` packet that gets all registers - because there
is no separation between register bytes, the offsets are fixed. But when
we are asking for a single register (e.g. Z0) when not in SSVE/SME mode,
this should return an error.

This does mean that when you're running on an SME capabable machine, but
not in SME mode, and do `register read -a`, lldb will report that 48 SVE
registers were unavailable and 5 SME registers were unavailable. But
that's only when `-a` is used.

The register reading and writing depends on new register flavor support
in thread_get_state/thread_set_state in the kernel, which is not yet in
a release. The test case I wrote is skipped on current OSes. I pilfered
the SME register setup from some of David's existing SME test files;
there were a few Linux specific details in those tests that they weren't
easy to reuse on Darwin.

rdar://121608074

[lldb] [debugserver] address preprocessor warning, extra arg (#90808)

2024-05-02T22:52:24+00:00

In DNBArchImplARM64.cpp I'm doing
```
#if __has_feature(ptrauth_calls) && defined(__LP64__)
```
And the preprocessor warns that this is not defined behavior. This
checks if ptrauth_calls is available and if this is being compiled
64-bit (i.e. arm64e), and defines a single DEBUGSERVER_IS_ARM64E when
those are both true.

I did have to duplicate one DNBLogThreaded() call which itself is a
macro, and using an ifdef in the middle of macro arguments also got me a
warning from the preprocessor.

While testing this for all the different targets, I found a DNBError
initialization that accepts a c-string but I'm passing in a printf-style
formatter c-string and an argument. Create the string before the call
and pass in the constructed string.

rdar://127129242

[lldb] Add support for large watchpoints in lldb (#79962)

2024-02-01T05:03:38+00:00

This patch is the next piece of work in my Large Watchpoint proposal,
https://discourse.llvm.org/t/rfc-large-watchpoint-support-in-lldb/72116

This patch breaks a user's watchpoint into one or more
WatchpointResources which reflect what the hardware registers can cover.
This means we can watch objects larger than 8 bytes, and we can watched
unaligned address ranges. On a typical 64-bit target with 4 watchpoint
registers you can watch 32 bytes of memory if the start address is
doubleword aligned.

Additionally, if the remote stub implements AArch64 MASK style
watchpoints (e.g. debugserver on Darwin), we can watch any power-of-2
size region of memory up to 2GB, aligned to that same size.

I updated the Watchpoint constructor and CommandObjectWatchpoint to
create a CompilerType of Array when the size of the watched
region is greater than pointer-size and we don't have a variable type to
use. For pointer-size and smaller, we can display the watched granule as
an integer value; for larger-than-pointer-size we will display as an
array of bytes.

I have `watchpoint list` now print the WatchpointResources used to
implement the watchpoint.

I added a WatchpointAlgorithm class which has a top-level static method
that takes an enum flag mask WatchpointHardwareFeature and a user
address and size, and returns a vector of WatchpointResources covering
the request. It does not take into account the number of watchpoint
registers the target has, or the number still available for use. Right
now there is only one algorithm, which monitors power-of-2 regions of
memory. For up to pointer-size, this is what Intel hardware supports.
AArch64 Byte Address Select watchpoints can watch any number of
contiguous bytes in a pointer-size memory granule, that is not currently
supported so if you ask to watch bytes 3-5, the algorithm will watch the
entire doubleword (8 bytes). The newly default "modify" style means we
will silently ignore modifications to bytes outside the watched range.

I've temporarily skipped TestLargeWatchpoint.py for all targets. It was
only run on Darwin when using the in-tree debugserver, which was a proxy
for "debugserver supports MASK watchpoints". I'll be adding the
aforementioned feature flag from the stub and enabling full mask
watchpoints when a debugserver with that feature is enabled, and
re-enable this test.

I added a new TestUnalignedLargeWatchpoint.py which only has one test
but it's a great one, watching a 22-byte range that is unaligned and
requires four 8-byte watchpoints to cover.

I also added a unit test, WatchpointAlgorithmsTests, which has a number
of simple tests against WatchpointAlgorithms::PowerOf2Watchpoints. I
think there's interesting possible different approaches to how we cover
these; I note in the unit test that a user requesting a watch on address
0x12e0 of 120 bytes will be covered by two watchpoints today, a
128-bytes at 0x1280 and at 0x1300. But it could be done with a 16-byte
watchpoint at 0x12e0 and a 128-byte at 0x1300, which would have fewer
false positives/private stops. As we try refining this one, it's helpful
to have a collection of tests to make sure things don't regress.

I tested this on arm64 macOS, (genuine) x86_64 macOS, and AArch64
Ubuntu. I have not modifed the Windows process plugins yet, I might try
that as a standalone patch, I'd be making the change blind, but the
necessary changes (see ProcessGDBRemote::EnableWatchpoint) are pretty
small so it might be obvious enough that I can change it and see what
the Windows CI thinks.

There isn't yet a packet (or a qSupported feature query) for the gdb
remote serial protocol stub to communicate its watchpoint capabilities
to lldb. I'll be doing that in a patch right after this is landed,
having debugserver advertise its capability of AArch64 MASK watchpoints,
and have ProcessGDBRemote add eWatchpointHardwareArmMASK to
WatchpointAlgorithms so we can watch larger than 32-byte requests on
Darwin.

I haven't yet tackled WatchpointResource *sharing* by multiple
Watchpoints. This is all part of the goal, especially when we may be
watching a larger memory range than the user requested, if they then add
another watchpoint next to their first request, it may be covered by the
same WatchpointResource (hardware watchpoint register). Also one "read"
watchpoint and one "write" watchpoint on the same memory granule need to
be handled, making the WatchpointResource cover all requests.

As WatchpointResources aren't shared among multiple Watchpoints yet,
there's no handling of running the conditions/commands/etc on multiple
Watchpoints when their shared WatchpointResource is hit. The goal beyond
"large watchpoint" is to unify (much more) the Watchpoint and Breakpoint
behavior and commands. I have a feeling I may be slowly chipping away at
this for a while.

Re-landing this patch after fixing two undefined behaviors in
WatchpointAlgorithms found by UBSan and by failures on different
CI bots.

rdar://108234227

[lldb] [debugserver] Preserve signing bits on lr in debugserver (#67384)

2023-09-26T00:02:25+00:00

In https://reviews.llvm.org/D136620 I changed debugserver to stop using
the kernel-provided functions
arm_thread_state64_get_{pc,lr,sp,fp} to postprocess those four registers
on aarch64 systems after we thread_get_state() them. The kernel stores
these four registers with signing internally, either from the inferior
process' actual signing, or its own.

When a program had crashed by doing an authenticated BL to an address
with improper signing, the inferior process would crash and that
improperly signed pc would be given to debugserver via thread_get_state.
debugserver would run that through arm_thread_state64_get_pc() and then
debugserver would crash when authenticating & stripping the value, on
newer Mac hardware.

To avoid debugserver crashing on a crashed inferior process, I switched
from using these system functions to strip the values, to simply
clearing the bits outright in debugserver.

However, lr is a special case where the inferior may have signed this
value (against the stack pointer value at the time). Or it may not yet
have any authentication bits, right after a BL. In the latter case, the
kernel will add its own auth bits for while it is stored inside the
kernel. In the case of a user lr value, we cannot authenticate it in
debugserver without knowing the sp value it was signed against (and the
way it is signed is not specified by the ABI) so an "improperly" signed
lr (whatever that means) won't cause debugserver to crash.

debugserver can thread_get_state the inferior's lr, run it through
arm_thread_state64_get_lr(), and get the actual signed 64-bit value that
the inferior process is using. And the specifics of how that lr is
signed may be important for debugging the process, instead of how I am
currently clearing the auth bits outright.

This patch reverts that change for lr only, and also adds a new logging
to debugserver specifically for the four sp/fp/lr/pc values that
thread_get_state hands to us, before we process them at all.

Add AArch64 MASK watchpoint support in debugserver

2023-05-04T20:23:51+00:00

Add suport for MASK style watchpoints on AArch64 in debugserver
on Darwin systems, for watching power-of-2 sized memory ranges.
More work needed in lldb before this can be exposed to the user
(because they will often try watching memory ranges that are not
exactly power-of-2 in size/alignment) but this is the first part
of adding that capability.

Differential Revision: https://reviews.llvm.org/D149792
rdar://108233371

Refactor and generalize AArch64 watchpoint support in debugserver

2023-04-29T01:24:38+00:00

Refactor the debugserver watchpiont support in anticipating of
adding support for AArch64 MASK hardware watchpoints to watch
larger regions of memory.  debugserver already had support for
handling a request to watch an unaligned region of memory up
to 8 bytes using Byte Address Select watchpoints - it would split
an unaligned watch request into two aligned doublewords that
could be watched with two hardware watchpoints using the BAS
specification.

This patch generalizes that code for properly aligning, and
possibly splitting, a watchpoint request into two hardware watchpoints
to handle any size request.  And separates out the specifics
about BAS watchpoints into its own method, so a sibling method
for MASK watchpoints can be dropped in next.

Differential Revision: https://reviews.llvm.org/D149040
rdar://108233371