Add headless runtime tooling and Campaign.win analysis

This commit is contained in:
Jan Petykiewicz 2026-04-10 01:22:47 -07:00
commit 27172e3786
37 changed files with 11867 additions and 302 deletions

View file

@ -69,27 +69,321 @@ Compared to the successful manual path:
So the hook is no longer missing the coordinator entry shape. The remaining question is no longer "can we reach `0x00445ac0`?" but "does the live non-debugger call return successfully and trigger the actual restore transition?"
## Latest Live Crash
## Latest Plain-Run Narrowing
The latest non-debugger auto-load run now reaches:
The current non-debugger auto-load path no longer looks like the original shell-side crash at
`0x0053fea6`.
- `rrt-hook: auto load ready gate passed`
- `rrt-hook: auto load restore calling`
The hook-side state machine is now stable up to the handoff into `shell_transition_mode`:
and then crashes at:
- `rrt-hook: auto load shell transition entering`
- `rrt-hook: auto load shell unpublish entering`
- `rrt-hook: auto load shell unpublish entry this=0x029b3a08 object=0x026d7b88`
- `0x0053fea6`
So the old hook-side gating and bad-call-shape problems are no longer the blocker.
The local disassembly around `0x0053fe90` shows a shell-side list traversal over `[this+0x74]` that walks linked entries and calls a virtual method on each. The crash instruction at `0x0053fea6` dereferences one traversed entry:
The current runtime probes now push the remaining stall much later than the original old-mode
teardown inside `shell_transition_mode`:
- `mov eax, DWORD PTR [esi]`
- `shell_transition_mode` enters
- old shell-window unpublish at `0x005389c0` enters with:
- shell bundle `this = 0x029b3a08`
- old object `object = 0x026d7b88`
- the inner wrapper `0x005400c0(object)` returns
- the full `0x53fe00 -> 0x53f860` remove-node sweep over `[object+0x74]` returns and clears
`[object+0x70/+0x74]`
- `shell_unpublish` itself then returns cleanly
- the nearby mode-`2` teardown helper `0x00502720` returns
- `shell_load_screen_window_construct` `0x004ea620` returns
- the immediate shell publish through `0x00538e50` returns
- `shell_transition_mode` itself returns cleanly
That strongly suggests the current hook is invoking the restore from the right call shape but on the wrong shell-pump turn. The active hypothesis is now timing or re-entrancy:
At the same time, one later load-side probe still does **not** fire:
- the hook detects readiness and fires restore on the same shell-pump turn
- RT3 later re-enters shell object traversal in a phase where one list entry is still invalid
- no `shell_active_mode_run_profile_startup_and_load_dispatch` `0x00438890` entry
So the next experiment is to defer the actual restore by additional ready shell-pump turns instead of firing on the first ready turn.
So the current live stall is now best read as:
- after the old-object unpublish path at `0x005389c0`
- after the inner `0x5400c0 -> 0x53fe00 -> 0x53f860` teardown sweep
- after the nearby mode-`2` teardown helper `0x00502720`
- after the mode-`4` `LoadScreen.win` constructor and immediate shell publish
- but still before any trusted runtime evidence that `0x00438890` has entered
The richer plain-run snapshots now tighten the old-object state too:
- the old object is still the expected `Setup.win` instance with vtable `0x005d1664`
- the shell bundle head and tail both point to that same object
- `[object+0x54]` and `[object+0x58]` are both null, so the outer unlink state is consistent
- `[object+0x74]` is non-null and the first two linked nodes recovered from `+0x8a` also look
structurally sane:
- first node `0x02a74470`: vtable `0x005dd870`, type `0xea72`, owner-ish field `0x02a067b8`,
next `0x02a04b38`
- second node `0x02a04b38`: vtable `0x005dd870`, type `0xea71`, owner-ish field `0x02a067b8`,
next `0x02a03e38`
So the remaining leading hypothesis is no longer "the list head is already garbage." The later
shared node vcall target `0x540910` is healthy in general and does not fire on the failing
transition path. The newer direct probes narrow it even further: the failing transition still does
not reach `0x53fe00` or `0x53f860`. That pushes the current boundary into the tiny wrapper layer
between `shell_unpublish` entry and the `0x53fe00` call, with `0x5400c0(object)` now the next
useful direct probe.
The latest plain Wine log also ends with a matching crash:
- `wine: Unhandled page fault on read access to 02E11000 at address 02E11000`
Static disassembly sharpened the remaining boundary one step further, but the newer jump-table
decode changes the interpretation materially. The startup-runtime slice
- `0x004ea710`
- `0x0053b070(0x46c40)`
- `0x004336d0`
- `0x00438890`
is not owned by mode `4`. It is owned by jump-table entry `1` at `0x483012`. Jump-table entry `4`
lands at `0x4832e5` instead and only constructs and publishes a plain `LoadScreen.win` object
through `0x004ea620` and `0x00538e50`.
So the next useful probe is no longer the mode-`4` branchs pre-dispatch runtime-object helper,
because mode `4` does not own that startup-runtime path at all. The next useful test is the real
startup-dispatch entrypoint: `shell_transition_mode(1, 0)`.
The latest plain runs tightened that correction one more step:
- the direct `0x004336d0` runtime-reset probe still does **not** fire
- the direct `0x00438890` startup-dispatch probe still does **not** fire
- but `shell_transition_mode`, `LoadScreen.win` construction, and the immediate shell publish all
still return cleanly
That no longer means the post-construct startup slice is mysteriously skipped inside mode `4`.
Instead, it matches the corrected static decode exactly: the hook has been entering the plain
load-screen branch rather than the startup-runtime branch.
The next best runtime target is therefore no longer another allocator cut under mode `4`. It is a
direct test of `shell_transition_mode(1, 0)`, which is the jump-table arm that statically owns the
startup-runtime allocation and `0x00438890` dispatch.
## Current Pause Point
Current recorded stop point:
- the old hook-side crash and teardown corruption are resolved
- the static jump-table decode at `0x48342c` shows the hook had been entering the wrong arm
- `shell_transition_mode(4, 0)` is only the plain `LoadScreen.win` branch
- `shell_transition_mode(1, 0)` is the startup-dispatch branch that owns:
- `0x004ea710`
- `0x0053b070(0x46c40)`
- `0x004336d0`
- `0x00438890`
So the next live experiment, when this work resumes, should start from the corrected mode-`1`
transition path rather than adding more probes under mode `4`.
Two corrective notes from the allocator probe passes:
- the first allocator experiment at `0x005a125d` was not trustworthy, because that shared cdecl
body sits behind the `0x0053b070` thunk and the initial hook used the wrong entry shape and
split its first internal `call`
- the first direct thunk hook on `0x0053b070` was also not trustworthy as implemented, because a
copied relative-`jmp` thunk cannot be replayed through an ordinary trampoline
The next trustworthy allocator boundary is still the exact mode-`4`-branch thunk at `0x0053b070`,
but only with a detour that calls the original target `0x005a125d` directly instead of executing
the copied thunk bytes.
The latest filtered run exposed a more basic gating issue too: the log only reached one
`gate mask 0x7` line with `mode_id = 2`, and it never advanced into `ready gate passed`, staging,
or transition. So that run did not actually exercise the load-screen startup subchain; it mostly
recorded ordinary shell-node activity plus one late ready-state observation. The old default gate
of `30` ready polls plus `5` deferred polls was therefore too conservative for this workflow. The
next run now lowers those defaults to `1` and `0`, and adds an explicit ready-count log so the
trace should either stage immediately or show exactly how far the gate gets.
That gate adjustment worked on the next run: the hook now reaches `ready count`, stages selector
`3`, enters `shell_transition_mode`, returns from the `LoadScreen.win` construct and publish
helpers, and reports success again. But the allocator side is still unresolved:
- there is still no trusted `0x46c40` allocator hit from `0x0053b070`
- there is still no direct `0x004336d0` runtime-reset entry
- there is still no direct `0x00438890` startup-dispatch entry
So the next clean post-publish boundary is the tiny scalar setter at `0x004ea710`, which is the
last straightforward callsite in the static mode-`4` branch immediately before the `0x0053b070`
allocation.
The immediate next runtime check is even more concrete than that helper hook, though: inspect the
state that `0x004ea710` should leave behind. Right after `shell_transition_mode` returns, the hook
now logs:
- `0x006d10b0` (`LoadScreen.win` singleton)
- `[LoadScreen.win+0x78]`
- `0x006cec78`
- `[0x006cec74+0x0c]`
- `[0x006cec7c+0x01]`
If `0x004ea710` really ran on the mode-`4` branch, `[LoadScreen.win+0x78]` should no longer be
zero after transition return.
The latest run answered that question directly:
- `shell_transition_mode` still returns cleanly
- `field_active_mode_object` is still the `LoadScreen.win` singleton
- `0x006cec78` is still null
- `[LoadScreen.win+0x78]` is still `0`
- startup selector remains `3`
So the strongest current read is no longer “the helper hooks might be missing a straight-line call.”
At transition return, RT3 still looks like it is parked in the plain `LoadScreen.win` state rather
than having entered the separate runtime-object path at all. The next useful runtime cut is
therefore not deeper inside `shell_transition_mode`, but on the later active-mode service cadence:
does a subsequent service tick on the `LoadScreen.win` object populate `[+0x78]` or promote
`0x006cec78` into the startup-dispatch object on a later frame?
The next run now logs the first few shell-state service ticks after auto-load is attempted with the
same state tuple:
- `0x006cec78`
- `[0x006cec74+0x0c]`
- `0x006d10b0`
- `[LoadScreen.win+0x78]`
- startup selector
So the next question is very narrow: does that tuple stay frozen in the plain `LoadScreen.win`
shape, or does one later service tick finally promote it into the startup-runtime object path?
The latest service-tick run makes that boundary stronger still:
- the first later shell-state service ticks `count=2..8` all keep the same frozen state
- `0x006cec78` stays `0`
- `[shell_state+0x0c]` stays the `LoadScreen.win` singleton
- `[LoadScreen.win+0x78]` stays `0`
So the active-mode service pass itself is not promoting the plain load screen into the startup
runtime object during those first later frames. The next best runtime boundary is now the
`LoadScreen.win` message owner `0x004e3a80`, because that is the remaining live owner most likely
to receive the trigger that seeds page id `[this+0x78]`, allocates the `0x46c40` startup runtime,
and later publishes `0x006cec78`.
One later run did not reach that boundary at all:
- the new `0x004e3a80` hook installed successfully
- but there were no `ready count`, staging, transition, post-transition, or load-screen-message
lines anywhere in the log
- the trace only showed ordinary shell node-vcall traffic before the window was closed
So that run is best treated as "auto-load path not exercised", not as evidence that the
`LoadScreen.win` message owner stayed silent after a successful transition. The next useful runtime
check is therefore one step earlier again: add a small first-few-calls trace on
`shell_state_service_active_mode_frame` itself so we can confirm whether that detour is firing on
the run at all and what mode id and gate mask it sees before the auto-load gate would stage.
That newer service-entry trace now confirms the full cadence:
- the service detour is firing
- the gate does stage and transition on counts `1 -> 2`
- the transition returns cleanly
- later service ticks run with `mode_id = 4`
At the same time, the next two probes are now bounded as negative results on that successful path:
- the `LoadScreen.win` message hook at `0x004e3a80` stayed completely silent
- the plain post-transition state still stays frozen with:
- `0x006cec78 = 0`
- `field_active_mode_object = LoadScreen.win`
- `[LoadScreen.win+0x78] = 0`
So the next best boundary is no longer the message owner itself. It is the shell-runtime prime call
at `0x00538b60`, because `0x00482160` still takes that branch on the null-`0x006cec78` service
path before the later frame-cycle owner `0x00520620`.
The first `0x00538b60` probe run is not trustworthy yet, though:
- the hook installed
- but the log stopped immediately after the first
`shell-state service entry count=1 ... gate_mask=0x7 mode_id=2 ...`
- there were no ready-count lines, no transition lines, and no runtime-prime entry lines
So that result currently reads as "the new runtime-prime instrumentation likely interrupted the
first service pass" rather than as a real RT3 boundary shift. The next corrective step is to log
the matching shell-state service return and to trace the first few `0x00538b60` calls even before
`AUTO_LOAD_ATTEMPTED` becomes true. That will tell us whether the first service pass actually
returns and whether the runtime-prime hook is firing at all.
The static branch under `0x00482160` also adds one more caution: `0x00538b60` is conditional, not
unconditional. The service pass only enters it when the shell runtime at `0x006d401c` is live and
`[shell_state+0xa0] == 0`. So a silent `0x00538b60` probe does not yet prove the shell is frozen
before the runtime-prime call; it may simply mean the `+0xa0` gate stayed nonzero on that service
tick. The next service-entry logs therefore need to include `[shell_state+0xa0]` before we treat
runtime-prime silence as meaningful.
The newer run closes that conditional question:
- `[shell_state+0xa0]` is `0` on the first traced service call
- `0x00538b60` is therefore eligible
- the runtime-prime probe now shows it entering and returning cleanly on that same service tick
The later run closes the next owner too:
- `0x00520620` `shell_service_frame_cycle` also enters and returns cleanly on the same frozen
mode-`4` path
- the logged state matches the generic frame-service branch:
- `[+0x1c] = 0`
- `[+0x28] = 0`
- `flag_56 = 0`
- `[+0x58]` is pulsed and then cleared back to `0`
- `0x006cec78` stays `0`
The newer run closes that owner too:
- `0x0053fda0` enters and returns cleanly on the frozen mode-`4` path
- it is actively servicing the `LoadScreen.win` object itself
- the serviced object keeps `field_1d = 1`, `field_5c = 1`, and a stable child list
- the first child vcall target at `+0x18` stays `0x005595d0`
- `0x006cec78` still stays `0`
So the next live boundary is now the child-service target itself at `0x005595d0`, not the higher
object walker.
The child-service run narrows that again. The first sixteen `0x005595d0` calls under the serviced
`LoadScreen.win` object are stable, presentation-heavy child lanes:
- every child points back to the same parent through `[child+0x86] = LoadScreen.win`
- the early children have `flag_68 = 0x03`, `flag_6a = 0x03`, and return `4`
- the later siblings have `flag_68 = 0x00`, `flag_6a = 0x03`, and return `0`
- `field_b0` stays `0`
- `0x006cec78` still stays `0`
Static disassembly matches that read: `0x005595d0` is gated by `0x00558670` and then spends most
of its body in draw or overlay helpers like `0x54f710`, `0x54f9f0`, `0x54fdd0`, `0x53de00`, and
`0x552560`. So this is a presentation-side child service path, not the missing startup-runtime
promotion.
That moved the next useful runtime target back to the transition-time allocator lane, but the
later jump-table decode changes what that means. The widened `0x0053b070` window below is now
best read as evidence for the plain mode-`4` `LoadScreen.win` arm, not as evidence for the
startup-runtime arm.
The next widened allocator run immediately paid off, but in a narrower way than expected:
- the first traced transition-window allocation is `0x7c`, which matches the static pre-construct
`0x48302a -> 0x53b070` call exactly
- the following `0x111`, `0x84`, `0x3a`, and repeated `0x25` allocations all happen before
`LoadScreen.win` construct returns, so they now read as constructor-side child or control setup
- that means the allocator probe was not disproving the `0x46c40` startup-runtime slice yet; it
was simply exhausting its 16-entry log budget inside the constructor before the later
post-construct block
The corrected follow-up run with that reset is now the decisive one: after `LoadScreen.win`
construct returns, there are still no further allocator hits before publish and transition return.
That matches the corrected jump-table decode cleanly, because mode `4` does not own the
`0x46c40 -> 0x4336d0 -> 0x438890` path at all.
The first corrected thunk run also showed one practical problem: the probe became too noisy to be
useful as a boundary marker, because `0x0053b070` is used widely outside the load-screen path.
That still mattered, because it showed the hook-driven transition was taking the same `0x7c`
constructor-side allocation as the plain mode-`4` branch rather than the startup-runtime
allocation size `0x46c40`.
## Manual Owner Tail
@ -134,6 +428,17 @@ The surrounding mode map is tighter now too:
That makes `0x00438890(active_mode, 1, 0)` the strongest current RT3-native entry candidate for reproducing the successful manual load branch, because it owns the internal dispatch that later reaches `0x004390cb`.
The containing shell-mode switcher ABI is tighter now too:
- `0x00482ec0` is not a one-arg mode switch
- it is a `thiscall` with two stack arguments
- the grounded world-entry load-screen call shape at `0x443adf..0x443ae3` is `(4, 0)`
- the function confirms that shape itself by reading the requested mode from `[esp+0x0c]` and
returning with `ret 8`
- the second stack argument is now best read as an old-active-mode teardown flag, because the
`0x482fc6..0x482fff` branch only runs when it is nonzero and then releases the old active-mode
object through `0x00434300`, `0x00433730`, `0x0053b080`, and finally clears `0x006cec78`
Current static xrefs also tighten the broader ownership split:
- `0x00443b57` calls `0x00438890` from the world-entry side, but with `(0, 0)` after dismissing the current shell detail panel and servicing `0x4834e0(0, 0)`
@ -186,10 +491,17 @@ The scripted auto-load debugger run is now useful without manual interaction:
- `0x00438890`
- `0x004390cb`
- `0x00445ac0`
- `0x0053fea6`
- but only `0x0053fea6` actually fired in the captured run
- older runs that also broke on `0x0053fea6` stopped too early on that shell-side crash site
- the default scripted compare flow now keeps only the owner-chain breakpoints above the real load lane
So the current non-interactive path is good enough to gather repeatable crash-side state, but it also tells us that the current auto-load code path is still not obviously traversing the larger-owner breakpoints under `winedbg`. The next step is therefore more hook-side logging around the `0x00438890` call itself rather than more manual debugger work.
So the current non-interactive path is still good enough to gather repeatable crash-side state, but
on this display setup the owner-chain compare flow is also vulnerable to early X11 death:
- `XF86VidModeClientNotLocal`
- process termination before the RT3 owner breakpoints fire
That means the current plain-run hook probes are more reliable than `winedbg` for narrowing the
live stall inside `shell_transition_mode`.
The latest static pivot also means the next reverse-engineering step does not require a live run:
@ -256,8 +568,34 @@ RRT_WINEDBG_LOG=/tmp/rt3-manual-load-winedbg.log tools/run_rt3_winedbg.sh
Ready-made debugger command files are also provided:
- [winedbg_manual_load_445ac0.cmd](/home/jan/projects/rrt/tools/winedbg_manual_load_445ac0.cmd)
- [winedbg_auto_load_crash.cmd](/home/jan/projects/rrt/tools/winedbg_auto_load_crash.cmd)
- [winedbg_auto_load_compare.cmd](/home/jan/projects/rrt/tools/winedbg_auto_load_compare.cmd)
The default auto-load debugger run is now crash-first. It does not set RT3 owner breakpoints.
Instead, it:
- continues immediately
- lets `winedbg` stop on the first exception
- dumps registers
- dumps the top four stack dwords
- prints a backtrace
Use that default when the hook is already known to stage and return from `shell_transition_mode`,
and the current question is the downstream crash site.
If you specifically want the earlier owner-chain compare flow, override the command file:
```bash
RRT_WINEDBG_CMD_FILE=/home/jan/projects/rrt/tools/winedbg_auto_load_compare.cmd \
tools/run_hook_auto_load_winedbg.sh hh
```
Or use the shorter wrapper:
```bash
tools/run_hook_auto_load_winedbg_compare.sh hh
```
If you do not use `RRT_WINEDBG_CMD_FILE`, you can still open those files and paste their contents into the debugger manually.
Both scripts rebuild `rrt-hook`, copy `dinput8.dll` into the Wine RT3 directory, and launch RT3 under `winedbg`.