Add headless runtime tooling and Campaign.win analysis

2026-04-10 01:22:47 -07:00 · 2026-04-10 01:22:47 -07:00 · 27172e3786
commit 27172e3786
parent 57bf0666e0
37 changed files with 11867 additions and 302 deletions
--- a/docs/debug-load-workflow.md
+++ b/docs/debug-load-workflow.md
@ -69,27 +69,321 @@ Compared to the successful manual path:

 So the hook is no longer missing the coordinator entry shape. The remaining question is no longer "can we reach `0x00445ac0`?" but "does the live non-debugger call return successfully and trigger the actual restore transition?"

-## Latest Live Crash
+## Latest Plain-Run Narrowing

-The latest non-debugger auto-load run now reaches:
+The current non-debugger auto-load path no longer looks like the original shell-side crash at
+`0x0053fea6`.

- `rrt-hook: auto load ready gate passed`
- `rrt-hook: auto load restore calling`
+The hook-side state machine is now stable up to the handoff into `shell_transition_mode`:

-and then crashes at:
+- `rrt-hook: auto load shell transition entering`
+- `rrt-hook: auto load shell unpublish entering`
+- `rrt-hook: auto load shell unpublish entry this=0x029b3a08 object=0x026d7b88`

- `0x0053fea6`
+So the old hook-side gating and bad-call-shape problems are no longer the blocker.

-The local disassembly around `0x0053fe90` shows a shell-side list traversal over `[this+0x74]` that walks linked entries and calls a virtual method on each. The crash instruction at `0x0053fea6` dereferences one traversed entry:
+The current runtime probes now push the remaining stall much later than the original old-mode
+teardown inside `shell_transition_mode`:

- `mov eax, DWORD PTR [esi]`
+- `shell_transition_mode` enters
+- old shell-window unpublish at `0x005389c0` enters with:
+  - shell bundle `this = 0x029b3a08`
+  - old object `object = 0x026d7b88`
+- the inner wrapper `0x005400c0(object)` returns
+- the full `0x53fe00 -> 0x53f860` remove-node sweep over `[object+0x74]` returns and clears
+  `[object+0x70/+0x74]`
+- `shell_unpublish` itself then returns cleanly
+- the nearby mode-`2` teardown helper `0x00502720` returns
+- `shell_load_screen_window_construct` `0x004ea620` returns
+- the immediate shell publish through `0x00538e50` returns
+- `shell_transition_mode` itself returns cleanly

-That strongly suggests the current hook is invoking the restore from the right call shape but on the wrong shell-pump turn. The active hypothesis is now timing or re-entrancy:
+At the same time, one later load-side probe still does **not** fire:

- the hook detects readiness and fires restore on the same shell-pump turn
- RT3 later re-enters shell object traversal in a phase where one list entry is still invalid
+- no `shell_active_mode_run_profile_startup_and_load_dispatch` `0x00438890` entry

-So the next experiment is to defer the actual restore by additional ready shell-pump turns instead of firing on the first ready turn.
+So the current live stall is now best read as:
+
+- after the old-object unpublish path at `0x005389c0`
+- after the inner `0x5400c0 -> 0x53fe00 -> 0x53f860` teardown sweep
+- after the nearby mode-`2` teardown helper `0x00502720`
+- after the mode-`4` `LoadScreen.win` constructor and immediate shell publish
+- but still before any trusted runtime evidence that `0x00438890` has entered
+
+The richer plain-run snapshots now tighten the old-object state too:
+
+- the old object is still the expected `Setup.win` instance with vtable `0x005d1664`
+- the shell bundle head and tail both point to that same object
+- `[object+0x54]` and `[object+0x58]` are both null, so the outer unlink state is consistent
+- `[object+0x74]` is non-null and the first two linked nodes recovered from `+0x8a` also look
+  structurally sane:
+  - first node `0x02a74470`: vtable `0x005dd870`, type `0xea72`, owner-ish field `0x02a067b8`,
+    next `0x02a04b38`
+  - second node `0x02a04b38`: vtable `0x005dd870`, type `0xea71`, owner-ish field `0x02a067b8`,
+    next `0x02a03e38`
+
+So the remaining leading hypothesis is no longer "the list head is already garbage." The later
+shared node vcall target `0x540910` is healthy in general and does not fire on the failing
+transition path. The newer direct probes narrow it even further: the failing transition still does
+not reach `0x53fe00` or `0x53f860`. That pushes the current boundary into the tiny wrapper layer
+between `shell_unpublish` entry and the `0x53fe00` call, with `0x5400c0(object)` now the next
+useful direct probe.
+
+The latest plain Wine log also ends with a matching crash:
+
+- `wine: Unhandled page fault on read access to 02E11000 at address 02E11000`
+
+Static disassembly sharpened the remaining boundary one step further, but the newer jump-table
+decode changes the interpretation materially. The startup-runtime slice
+
+- `0x004ea710`
+- `0x0053b070(0x46c40)`
+- `0x004336d0`
+- `0x00438890`
+
+is not owned by mode `4`. It is owned by jump-table entry `1` at `0x483012`. Jump-table entry `4`
+lands at `0x4832e5` instead and only constructs and publishes a plain `LoadScreen.win` object
+through `0x004ea620` and `0x00538e50`.
+
+So the next useful probe is no longer the mode-`4` branch’s pre-dispatch runtime-object helper,
+because mode `4` does not own that startup-runtime path at all. The next useful test is the real
+startup-dispatch entrypoint: `shell_transition_mode(1, 0)`.
+
+The latest plain runs tightened that correction one more step:
+
+- the direct `0x004336d0` runtime-reset probe still does **not** fire
+- the direct `0x00438890` startup-dispatch probe still does **not** fire
+- but `shell_transition_mode`, `LoadScreen.win` construction, and the immediate shell publish all
+  still return cleanly
+
+That no longer means the post-construct startup slice is mysteriously skipped inside mode `4`.
+Instead, it matches the corrected static decode exactly: the hook has been entering the plain
+load-screen branch rather than the startup-runtime branch.
+
+The next best runtime target is therefore no longer another allocator cut under mode `4`. It is a
+direct test of `shell_transition_mode(1, 0)`, which is the jump-table arm that statically owns the
+startup-runtime allocation and `0x00438890` dispatch.
+
+## Current Pause Point
+
+Current recorded stop point:
+
+- the old hook-side crash and teardown corruption are resolved
+- the static jump-table decode at `0x48342c` shows the hook had been entering the wrong arm
+- `shell_transition_mode(4, 0)` is only the plain `LoadScreen.win` branch
+- `shell_transition_mode(1, 0)` is the startup-dispatch branch that owns:
+  - `0x004ea710`
+  - `0x0053b070(0x46c40)`
+  - `0x004336d0`
+  - `0x00438890`
+
+So the next live experiment, when this work resumes, should start from the corrected mode-`1`
+transition path rather than adding more probes under mode `4`.
+
+Two corrective notes from the allocator probe passes:
+
+- the first allocator experiment at `0x005a125d` was not trustworthy, because that shared cdecl
+  body sits behind the `0x0053b070` thunk and the initial hook used the wrong entry shape and
+  split its first internal `call`
+- the first direct thunk hook on `0x0053b070` was also not trustworthy as implemented, because a
+  copied relative-`jmp` thunk cannot be replayed through an ordinary trampoline
+
+The next trustworthy allocator boundary is still the exact mode-`4`-branch thunk at `0x0053b070`,
+but only with a detour that calls the original target `0x005a125d` directly instead of executing
+the copied thunk bytes.
+
+The latest filtered run exposed a more basic gating issue too: the log only reached one
+`gate mask 0x7` line with `mode_id = 2`, and it never advanced into `ready gate passed`, staging,
+or transition. So that run did not actually exercise the load-screen startup subchain; it mostly
+recorded ordinary shell-node activity plus one late ready-state observation. The old default gate
+of `30` ready polls plus `5` deferred polls was therefore too conservative for this workflow. The
+next run now lowers those defaults to `1` and `0`, and adds an explicit ready-count log so the
+trace should either stage immediately or show exactly how far the gate gets.
+
+That gate adjustment worked on the next run: the hook now reaches `ready count`, stages selector
+`3`, enters `shell_transition_mode`, returns from the `LoadScreen.win` construct and publish
+helpers, and reports success again. But the allocator side is still unresolved:
+
+- there is still no trusted `0x46c40` allocator hit from `0x0053b070`
+- there is still no direct `0x004336d0` runtime-reset entry
+- there is still no direct `0x00438890` startup-dispatch entry
+
+So the next clean post-publish boundary is the tiny scalar setter at `0x004ea710`, which is the
+last straightforward callsite in the static mode-`4` branch immediately before the `0x0053b070`
+allocation.
+
+The immediate next runtime check is even more concrete than that helper hook, though: inspect the
+state that `0x004ea710` should leave behind. Right after `shell_transition_mode` returns, the hook
+now logs:
+
+- `0x006d10b0` (`LoadScreen.win` singleton)
+- `[LoadScreen.win+0x78]`
+- `0x006cec78`
+- `[0x006cec74+0x0c]`
+- `[0x006cec7c+0x01]`
+
+If `0x004ea710` really ran on the mode-`4` branch, `[LoadScreen.win+0x78]` should no longer be
+zero after transition return.
+
+The latest run answered that question directly:
+
+- `shell_transition_mode` still returns cleanly
+- `field_active_mode_object` is still the `LoadScreen.win` singleton
+- `0x006cec78` is still null
+- `[LoadScreen.win+0x78]` is still `0`
+- startup selector remains `3`
+
+So the strongest current read is no longer “the helper hooks might be missing a straight-line call.”
+At transition return, RT3 still looks like it is parked in the plain `LoadScreen.win` state rather
+than having entered the separate runtime-object path at all. The next useful runtime cut is
+therefore not deeper inside `shell_transition_mode`, but on the later active-mode service cadence:
+does a subsequent service tick on the `LoadScreen.win` object populate `[+0x78]` or promote
+`0x006cec78` into the startup-dispatch object on a later frame?
+
+The next run now logs the first few shell-state service ticks after auto-load is attempted with the
+same state tuple:
+
+- `0x006cec78`
+- `[0x006cec74+0x0c]`
+- `0x006d10b0`
+- `[LoadScreen.win+0x78]`
+- startup selector
+
+So the next question is very narrow: does that tuple stay frozen in the plain `LoadScreen.win`
+shape, or does one later service tick finally promote it into the startup-runtime object path?
+
+The latest service-tick run makes that boundary stronger still:
+
+- the first later shell-state service ticks `count=2..8` all keep the same frozen state
+- `0x006cec78` stays `0`
+- `[shell_state+0x0c]` stays the `LoadScreen.win` singleton
+- `[LoadScreen.win+0x78]` stays `0`
+
+So the active-mode service pass itself is not promoting the plain load screen into the startup
+runtime object during those first later frames. The next best runtime boundary is now the
+`LoadScreen.win` message owner `0x004e3a80`, because that is the remaining live owner most likely
+to receive the trigger that seeds page id `[this+0x78]`, allocates the `0x46c40` startup runtime,
+and later publishes `0x006cec78`.
+
+One later run did not reach that boundary at all:
+
+- the new `0x004e3a80` hook installed successfully
+- but there were no `ready count`, staging, transition, post-transition, or load-screen-message
+  lines anywhere in the log
+- the trace only showed ordinary shell node-vcall traffic before the window was closed
+
+So that run is best treated as "auto-load path not exercised", not as evidence that the
+`LoadScreen.win` message owner stayed silent after a successful transition. The next useful runtime
+check is therefore one step earlier again: add a small first-few-calls trace on
+`shell_state_service_active_mode_frame` itself so we can confirm whether that detour is firing on
+the run at all and what mode id and gate mask it sees before the auto-load gate would stage.
+
+That newer service-entry trace now confirms the full cadence:
+
+- the service detour is firing
+- the gate does stage and transition on counts `1 -> 2`
+- the transition returns cleanly
+- later service ticks run with `mode_id = 4`
+
+At the same time, the next two probes are now bounded as negative results on that successful path:
+
+- the `LoadScreen.win` message hook at `0x004e3a80` stayed completely silent
+- the plain post-transition state still stays frozen with:
+  - `0x006cec78 = 0`
+  - `field_active_mode_object = LoadScreen.win`
+  - `[LoadScreen.win+0x78] = 0`
+
+So the next best boundary is no longer the message owner itself. It is the shell-runtime prime call
+at `0x00538b60`, because `0x00482160` still takes that branch on the null-`0x006cec78` service
+path before the later frame-cycle owner `0x00520620`.
+
+The first `0x00538b60` probe run is not trustworthy yet, though:
+
+- the hook installed
+- but the log stopped immediately after the first
+  `shell-state service entry count=1 ... gate_mask=0x7 mode_id=2 ...`
+- there were no ready-count lines, no transition lines, and no runtime-prime entry lines
+
+So that result currently reads as "the new runtime-prime instrumentation likely interrupted the
+first service pass" rather than as a real RT3 boundary shift. The next corrective step is to log
+the matching shell-state service return and to trace the first few `0x00538b60` calls even before
+`AUTO_LOAD_ATTEMPTED` becomes true. That will tell us whether the first service pass actually
+returns and whether the runtime-prime hook is firing at all.
+
+The static branch under `0x00482160` also adds one more caution: `0x00538b60` is conditional, not
+unconditional. The service pass only enters it when the shell runtime at `0x006d401c` is live and
+`[shell_state+0xa0] == 0`. So a silent `0x00538b60` probe does not yet prove the shell is frozen
+before the runtime-prime call; it may simply mean the `+0xa0` gate stayed nonzero on that service
+tick. The next service-entry logs therefore need to include `[shell_state+0xa0]` before we treat
+runtime-prime silence as meaningful.
+
+The newer run closes that conditional question:
+
+- `[shell_state+0xa0]` is `0` on the first traced service call
+- `0x00538b60` is therefore eligible
+- the runtime-prime probe now shows it entering and returning cleanly on that same service tick
+
+The later run closes the next owner too:
+
+- `0x00520620` `shell_service_frame_cycle` also enters and returns cleanly on the same frozen
+  mode-`4` path
+- the logged state matches the generic frame-service branch:
+  - `[+0x1c] = 0`
+  - `[+0x28] = 0`
+  - `flag_56 = 0`
+  - `[+0x58]` is pulsed and then cleared back to `0`
+  - `0x006cec78` stays `0`
+
+The newer run closes that owner too:
+
+- `0x0053fda0` enters and returns cleanly on the frozen mode-`4` path
+- it is actively servicing the `LoadScreen.win` object itself
+- the serviced object keeps `field_1d = 1`, `field_5c = 1`, and a stable child list
+- the first child vcall target at `+0x18` stays `0x005595d0`
+- `0x006cec78` still stays `0`
+
+So the next live boundary is now the child-service target itself at `0x005595d0`, not the higher
+object walker.
+
+The child-service run narrows that again. The first sixteen `0x005595d0` calls under the serviced
+`LoadScreen.win` object are stable, presentation-heavy child lanes:
+
+- every child points back to the same parent through `[child+0x86] = LoadScreen.win`
+- the early children have `flag_68 = 0x03`, `flag_6a = 0x03`, and return `4`
+- the later siblings have `flag_68 = 0x00`, `flag_6a = 0x03`, and return `0`
+- `field_b0` stays `0`
+- `0x006cec78` still stays `0`
+
+Static disassembly matches that read: `0x005595d0` is gated by `0x00558670` and then spends most
+of its body in draw or overlay helpers like `0x54f710`, `0x54f9f0`, `0x54fdd0`, `0x53de00`, and
+`0x552560`. So this is a presentation-side child service path, not the missing startup-runtime
+promotion.
+
+That moved the next useful runtime target back to the transition-time allocator lane, but the
+later jump-table decode changes what that means. The widened `0x0053b070` window below is now
+best read as evidence for the plain mode-`4` `LoadScreen.win` arm, not as evidence for the
+startup-runtime arm.
+
+The next widened allocator run immediately paid off, but in a narrower way than expected:
+
+- the first traced transition-window allocation is `0x7c`, which matches the static pre-construct
+  `0x48302a -> 0x53b070` call exactly
+- the following `0x111`, `0x84`, `0x3a`, and repeated `0x25` allocations all happen before
+  `LoadScreen.win` construct returns, so they now read as constructor-side child or control setup
+- that means the allocator probe was not disproving the `0x46c40` startup-runtime slice yet; it
+  was simply exhausting its 16-entry log budget inside the constructor before the later
+  post-construct block
+
+The corrected follow-up run with that reset is now the decisive one: after `LoadScreen.win`
+construct returns, there are still no further allocator hits before publish and transition return.
+That matches the corrected jump-table decode cleanly, because mode `4` does not own the
+`0x46c40 -> 0x4336d0 -> 0x438890` path at all.
+
+The first corrected thunk run also showed one practical problem: the probe became too noisy to be
+useful as a boundary marker, because `0x0053b070` is used widely outside the load-screen path.
+That still mattered, because it showed the hook-driven transition was taking the same `0x7c`
+constructor-side allocation as the plain mode-`4` branch rather than the startup-runtime
+allocation size `0x46c40`.

 ## Manual Owner Tail

@ -134,6 +428,17 @@ The surrounding mode map is tighter now too:

 That makes `0x00438890(active_mode, 1, 0)` the strongest current RT3-native entry candidate for reproducing the successful manual load branch, because it owns the internal dispatch that later reaches `0x004390cb`.

+The containing shell-mode switcher ABI is tighter now too:
+
+- `0x00482ec0` is not a one-arg mode switch
+- it is a `thiscall` with two stack arguments
+- the grounded world-entry load-screen call shape at `0x443adf..0x443ae3` is `(4, 0)`
+- the function confirms that shape itself by reading the requested mode from `[esp+0x0c]` and
+  returning with `ret 8`
+- the second stack argument is now best read as an old-active-mode teardown flag, because the
+  `0x482fc6..0x482fff` branch only runs when it is nonzero and then releases the old active-mode
+  object through `0x00434300`, `0x00433730`, `0x0053b080`, and finally clears `0x006cec78`
+
 Current static xrefs also tighten the broader ownership split:

 - `0x00443b57` calls `0x00438890` from the world-entry side, but with `(0, 0)` after dismissing the current shell detail panel and servicing `0x4834e0(0, 0)`
@ -186,10 +491,17 @@ The scripted auto-load debugger run is now useful without manual interaction:
  - `0x00438890`
  - `0x004390cb`
  - `0x00445ac0`
-  - `0x0053fea6`
- but only `0x0053fea6` actually fired in the captured run
+- older runs that also broke on `0x0053fea6` stopped too early on that shell-side crash site
+- the default scripted compare flow now keeps only the owner-chain breakpoints above the real load lane

-So the current non-interactive path is good enough to gather repeatable crash-side state, but it also tells us that the current auto-load code path is still not obviously traversing the larger-owner breakpoints under `winedbg`. The next step is therefore more hook-side logging around the `0x00438890` call itself rather than more manual debugger work.
+So the current non-interactive path is still good enough to gather repeatable crash-side state, but
+on this display setup the owner-chain compare flow is also vulnerable to early X11 death:
+
+- `XF86VidModeClientNotLocal`
+- process termination before the RT3 owner breakpoints fire
+
+That means the current plain-run hook probes are more reliable than `winedbg` for narrowing the
+live stall inside `shell_transition_mode`.

 The latest static pivot also means the next reverse-engineering step does not require a live run:

@ -256,8 +568,34 @@ RRT_WINEDBG_LOG=/tmp/rt3-manual-load-winedbg.log tools/run_rt3_winedbg.sh
 Ready-made debugger command files are also provided:

 - [winedbg_manual_load_445ac0.cmd](/home/jan/projects/rrt/tools/winedbg_manual_load_445ac0.cmd)
+- [winedbg_auto_load_crash.cmd](/home/jan/projects/rrt/tools/winedbg_auto_load_crash.cmd)
 - [winedbg_auto_load_compare.cmd](/home/jan/projects/rrt/tools/winedbg_auto_load_compare.cmd)

+The default auto-load debugger run is now crash-first. It does not set RT3 owner breakpoints.
+Instead, it:
+
+- continues immediately
+- lets `winedbg` stop on the first exception
+- dumps registers
+- dumps the top four stack dwords
+- prints a backtrace
+
+Use that default when the hook is already known to stage and return from `shell_transition_mode`,
+and the current question is the downstream crash site.
+
+If you specifically want the earlier owner-chain compare flow, override the command file:
+
+```bash
+RRT_WINEDBG_CMD_FILE=/home/jan/projects/rrt/tools/winedbg_auto_load_compare.cmd \
+tools/run_hook_auto_load_winedbg.sh hh
+```
+
+Or use the shorter wrapper:
+
+```bash
+tools/run_hook_auto_load_winedbg_compare.sh hh
+```
+
 If you do not use `RRT_WINEDBG_CMD_FILE`, you can still open those files and paste their contents into the debugger manually.

 Both scripts rebuild `rrt-hook`, copy `dinput8.dll` into the Wine RT3 directory, and launch RT3 under `winedbg`.