Add headless runtime tooling and Campaign.win analysis
This commit is contained in:
parent
57bf0666e0
commit
27172e3786
37 changed files with 11867 additions and 302 deletions
|
|
@ -69,27 +69,321 @@ Compared to the successful manual path:
|
|||
|
||||
So the hook is no longer missing the coordinator entry shape. The remaining question is no longer "can we reach `0x00445ac0`?" but "does the live non-debugger call return successfully and trigger the actual restore transition?"
|
||||
|
||||
## Latest Live Crash
|
||||
## Latest Plain-Run Narrowing
|
||||
|
||||
The latest non-debugger auto-load run now reaches:
|
||||
The current non-debugger auto-load path no longer looks like the original shell-side crash at
|
||||
`0x0053fea6`.
|
||||
|
||||
- `rrt-hook: auto load ready gate passed`
|
||||
- `rrt-hook: auto load restore calling`
|
||||
The hook-side state machine is now stable up to the handoff into `shell_transition_mode`:
|
||||
|
||||
and then crashes at:
|
||||
- `rrt-hook: auto load shell transition entering`
|
||||
- `rrt-hook: auto load shell unpublish entering`
|
||||
- `rrt-hook: auto load shell unpublish entry this=0x029b3a08 object=0x026d7b88`
|
||||
|
||||
- `0x0053fea6`
|
||||
So the old hook-side gating and bad-call-shape problems are no longer the blocker.
|
||||
|
||||
The local disassembly around `0x0053fe90` shows a shell-side list traversal over `[this+0x74]` that walks linked entries and calls a virtual method on each. The crash instruction at `0x0053fea6` dereferences one traversed entry:
|
||||
The current runtime probes now push the remaining stall much later than the original old-mode
|
||||
teardown inside `shell_transition_mode`:
|
||||
|
||||
- `mov eax, DWORD PTR [esi]`
|
||||
- `shell_transition_mode` enters
|
||||
- old shell-window unpublish at `0x005389c0` enters with:
|
||||
- shell bundle `this = 0x029b3a08`
|
||||
- old object `object = 0x026d7b88`
|
||||
- the inner wrapper `0x005400c0(object)` returns
|
||||
- the full `0x53fe00 -> 0x53f860` remove-node sweep over `[object+0x74]` returns and clears
|
||||
`[object+0x70/+0x74]`
|
||||
- `shell_unpublish` itself then returns cleanly
|
||||
- the nearby mode-`2` teardown helper `0x00502720` returns
|
||||
- `shell_load_screen_window_construct` `0x004ea620` returns
|
||||
- the immediate shell publish through `0x00538e50` returns
|
||||
- `shell_transition_mode` itself returns cleanly
|
||||
|
||||
That strongly suggests the current hook is invoking the restore from the right call shape but on the wrong shell-pump turn. The active hypothesis is now timing or re-entrancy:
|
||||
At the same time, one later load-side probe still does **not** fire:
|
||||
|
||||
- the hook detects readiness and fires restore on the same shell-pump turn
|
||||
- RT3 later re-enters shell object traversal in a phase where one list entry is still invalid
|
||||
- no `shell_active_mode_run_profile_startup_and_load_dispatch` `0x00438890` entry
|
||||
|
||||
So the next experiment is to defer the actual restore by additional ready shell-pump turns instead of firing on the first ready turn.
|
||||
So the current live stall is now best read as:
|
||||
|
||||
- after the old-object unpublish path at `0x005389c0`
|
||||
- after the inner `0x5400c0 -> 0x53fe00 -> 0x53f860` teardown sweep
|
||||
- after the nearby mode-`2` teardown helper `0x00502720`
|
||||
- after the mode-`4` `LoadScreen.win` constructor and immediate shell publish
|
||||
- but still before any trusted runtime evidence that `0x00438890` has entered
|
||||
|
||||
The richer plain-run snapshots now tighten the old-object state too:
|
||||
|
||||
- the old object is still the expected `Setup.win` instance with vtable `0x005d1664`
|
||||
- the shell bundle head and tail both point to that same object
|
||||
- `[object+0x54]` and `[object+0x58]` are both null, so the outer unlink state is consistent
|
||||
- `[object+0x74]` is non-null and the first two linked nodes recovered from `+0x8a` also look
|
||||
structurally sane:
|
||||
- first node `0x02a74470`: vtable `0x005dd870`, type `0xea72`, owner-ish field `0x02a067b8`,
|
||||
next `0x02a04b38`
|
||||
- second node `0x02a04b38`: vtable `0x005dd870`, type `0xea71`, owner-ish field `0x02a067b8`,
|
||||
next `0x02a03e38`
|
||||
|
||||
So the remaining leading hypothesis is no longer "the list head is already garbage." The later
|
||||
shared node vcall target `0x540910` is healthy in general and does not fire on the failing
|
||||
transition path. The newer direct probes narrow it even further: the failing transition still does
|
||||
not reach `0x53fe00` or `0x53f860`. That pushes the current boundary into the tiny wrapper layer
|
||||
between `shell_unpublish` entry and the `0x53fe00` call, with `0x5400c0(object)` now the next
|
||||
useful direct probe.
|
||||
|
||||
The latest plain Wine log also ends with a matching crash:
|
||||
|
||||
- `wine: Unhandled page fault on read access to 02E11000 at address 02E11000`
|
||||
|
||||
Static disassembly sharpened the remaining boundary one step further, but the newer jump-table
|
||||
decode changes the interpretation materially. The startup-runtime slice
|
||||
|
||||
- `0x004ea710`
|
||||
- `0x0053b070(0x46c40)`
|
||||
- `0x004336d0`
|
||||
- `0x00438890`
|
||||
|
||||
is not owned by mode `4`. It is owned by jump-table entry `1` at `0x483012`. Jump-table entry `4`
|
||||
lands at `0x4832e5` instead and only constructs and publishes a plain `LoadScreen.win` object
|
||||
through `0x004ea620` and `0x00538e50`.
|
||||
|
||||
So the next useful probe is no longer the mode-`4` branch’s pre-dispatch runtime-object helper,
|
||||
because mode `4` does not own that startup-runtime path at all. The next useful test is the real
|
||||
startup-dispatch entrypoint: `shell_transition_mode(1, 0)`.
|
||||
|
||||
The latest plain runs tightened that correction one more step:
|
||||
|
||||
- the direct `0x004336d0` runtime-reset probe still does **not** fire
|
||||
- the direct `0x00438890` startup-dispatch probe still does **not** fire
|
||||
- but `shell_transition_mode`, `LoadScreen.win` construction, and the immediate shell publish all
|
||||
still return cleanly
|
||||
|
||||
That no longer means the post-construct startup slice is mysteriously skipped inside mode `4`.
|
||||
Instead, it matches the corrected static decode exactly: the hook has been entering the plain
|
||||
load-screen branch rather than the startup-runtime branch.
|
||||
|
||||
The next best runtime target is therefore no longer another allocator cut under mode `4`. It is a
|
||||
direct test of `shell_transition_mode(1, 0)`, which is the jump-table arm that statically owns the
|
||||
startup-runtime allocation and `0x00438890` dispatch.
|
||||
|
||||
## Current Pause Point
|
||||
|
||||
Current recorded stop point:
|
||||
|
||||
- the old hook-side crash and teardown corruption are resolved
|
||||
- the static jump-table decode at `0x48342c` shows the hook had been entering the wrong arm
|
||||
- `shell_transition_mode(4, 0)` is only the plain `LoadScreen.win` branch
|
||||
- `shell_transition_mode(1, 0)` is the startup-dispatch branch that owns:
|
||||
- `0x004ea710`
|
||||
- `0x0053b070(0x46c40)`
|
||||
- `0x004336d0`
|
||||
- `0x00438890`
|
||||
|
||||
So the next live experiment, when this work resumes, should start from the corrected mode-`1`
|
||||
transition path rather than adding more probes under mode `4`.
|
||||
|
||||
Two corrective notes from the allocator probe passes:
|
||||
|
||||
- the first allocator experiment at `0x005a125d` was not trustworthy, because that shared cdecl
|
||||
body sits behind the `0x0053b070` thunk and the initial hook used the wrong entry shape and
|
||||
split its first internal `call`
|
||||
- the first direct thunk hook on `0x0053b070` was also not trustworthy as implemented, because a
|
||||
copied relative-`jmp` thunk cannot be replayed through an ordinary trampoline
|
||||
|
||||
The next trustworthy allocator boundary is still the exact mode-`4`-branch thunk at `0x0053b070`,
|
||||
but only with a detour that calls the original target `0x005a125d` directly instead of executing
|
||||
the copied thunk bytes.
|
||||
|
||||
The latest filtered run exposed a more basic gating issue too: the log only reached one
|
||||
`gate mask 0x7` line with `mode_id = 2`, and it never advanced into `ready gate passed`, staging,
|
||||
or transition. So that run did not actually exercise the load-screen startup subchain; it mostly
|
||||
recorded ordinary shell-node activity plus one late ready-state observation. The old default gate
|
||||
of `30` ready polls plus `5` deferred polls was therefore too conservative for this workflow. The
|
||||
next run now lowers those defaults to `1` and `0`, and adds an explicit ready-count log so the
|
||||
trace should either stage immediately or show exactly how far the gate gets.
|
||||
|
||||
That gate adjustment worked on the next run: the hook now reaches `ready count`, stages selector
|
||||
`3`, enters `shell_transition_mode`, returns from the `LoadScreen.win` construct and publish
|
||||
helpers, and reports success again. But the allocator side is still unresolved:
|
||||
|
||||
- there is still no trusted `0x46c40` allocator hit from `0x0053b070`
|
||||
- there is still no direct `0x004336d0` runtime-reset entry
|
||||
- there is still no direct `0x00438890` startup-dispatch entry
|
||||
|
||||
So the next clean post-publish boundary is the tiny scalar setter at `0x004ea710`, which is the
|
||||
last straightforward callsite in the static mode-`4` branch immediately before the `0x0053b070`
|
||||
allocation.
|
||||
|
||||
The immediate next runtime check is even more concrete than that helper hook, though: inspect the
|
||||
state that `0x004ea710` should leave behind. Right after `shell_transition_mode` returns, the hook
|
||||
now logs:
|
||||
|
||||
- `0x006d10b0` (`LoadScreen.win` singleton)
|
||||
- `[LoadScreen.win+0x78]`
|
||||
- `0x006cec78`
|
||||
- `[0x006cec74+0x0c]`
|
||||
- `[0x006cec7c+0x01]`
|
||||
|
||||
If `0x004ea710` really ran on the mode-`4` branch, `[LoadScreen.win+0x78]` should no longer be
|
||||
zero after transition return.
|
||||
|
||||
The latest run answered that question directly:
|
||||
|
||||
- `shell_transition_mode` still returns cleanly
|
||||
- `field_active_mode_object` is still the `LoadScreen.win` singleton
|
||||
- `0x006cec78` is still null
|
||||
- `[LoadScreen.win+0x78]` is still `0`
|
||||
- startup selector remains `3`
|
||||
|
||||
So the strongest current read is no longer “the helper hooks might be missing a straight-line call.”
|
||||
At transition return, RT3 still looks like it is parked in the plain `LoadScreen.win` state rather
|
||||
than having entered the separate runtime-object path at all. The next useful runtime cut is
|
||||
therefore not deeper inside `shell_transition_mode`, but on the later active-mode service cadence:
|
||||
does a subsequent service tick on the `LoadScreen.win` object populate `[+0x78]` or promote
|
||||
`0x006cec78` into the startup-dispatch object on a later frame?
|
||||
|
||||
The next run now logs the first few shell-state service ticks after auto-load is attempted with the
|
||||
same state tuple:
|
||||
|
||||
- `0x006cec78`
|
||||
- `[0x006cec74+0x0c]`
|
||||
- `0x006d10b0`
|
||||
- `[LoadScreen.win+0x78]`
|
||||
- startup selector
|
||||
|
||||
So the next question is very narrow: does that tuple stay frozen in the plain `LoadScreen.win`
|
||||
shape, or does one later service tick finally promote it into the startup-runtime object path?
|
||||
|
||||
The latest service-tick run makes that boundary stronger still:
|
||||
|
||||
- the first later shell-state service ticks `count=2..8` all keep the same frozen state
|
||||
- `0x006cec78` stays `0`
|
||||
- `[shell_state+0x0c]` stays the `LoadScreen.win` singleton
|
||||
- `[LoadScreen.win+0x78]` stays `0`
|
||||
|
||||
So the active-mode service pass itself is not promoting the plain load screen into the startup
|
||||
runtime object during those first later frames. The next best runtime boundary is now the
|
||||
`LoadScreen.win` message owner `0x004e3a80`, because that is the remaining live owner most likely
|
||||
to receive the trigger that seeds page id `[this+0x78]`, allocates the `0x46c40` startup runtime,
|
||||
and later publishes `0x006cec78`.
|
||||
|
||||
One later run did not reach that boundary at all:
|
||||
|
||||
- the new `0x004e3a80` hook installed successfully
|
||||
- but there were no `ready count`, staging, transition, post-transition, or load-screen-message
|
||||
lines anywhere in the log
|
||||
- the trace only showed ordinary shell node-vcall traffic before the window was closed
|
||||
|
||||
So that run is best treated as "auto-load path not exercised", not as evidence that the
|
||||
`LoadScreen.win` message owner stayed silent after a successful transition. The next useful runtime
|
||||
check is therefore one step earlier again: add a small first-few-calls trace on
|
||||
`shell_state_service_active_mode_frame` itself so we can confirm whether that detour is firing on
|
||||
the run at all and what mode id and gate mask it sees before the auto-load gate would stage.
|
||||
|
||||
That newer service-entry trace now confirms the full cadence:
|
||||
|
||||
- the service detour is firing
|
||||
- the gate does stage and transition on counts `1 -> 2`
|
||||
- the transition returns cleanly
|
||||
- later service ticks run with `mode_id = 4`
|
||||
|
||||
At the same time, the next two probes are now bounded as negative results on that successful path:
|
||||
|
||||
- the `LoadScreen.win` message hook at `0x004e3a80` stayed completely silent
|
||||
- the plain post-transition state still stays frozen with:
|
||||
- `0x006cec78 = 0`
|
||||
- `field_active_mode_object = LoadScreen.win`
|
||||
- `[LoadScreen.win+0x78] = 0`
|
||||
|
||||
So the next best boundary is no longer the message owner itself. It is the shell-runtime prime call
|
||||
at `0x00538b60`, because `0x00482160` still takes that branch on the null-`0x006cec78` service
|
||||
path before the later frame-cycle owner `0x00520620`.
|
||||
|
||||
The first `0x00538b60` probe run is not trustworthy yet, though:
|
||||
|
||||
- the hook installed
|
||||
- but the log stopped immediately after the first
|
||||
`shell-state service entry count=1 ... gate_mask=0x7 mode_id=2 ...`
|
||||
- there were no ready-count lines, no transition lines, and no runtime-prime entry lines
|
||||
|
||||
So that result currently reads as "the new runtime-prime instrumentation likely interrupted the
|
||||
first service pass" rather than as a real RT3 boundary shift. The next corrective step is to log
|
||||
the matching shell-state service return and to trace the first few `0x00538b60` calls even before
|
||||
`AUTO_LOAD_ATTEMPTED` becomes true. That will tell us whether the first service pass actually
|
||||
returns and whether the runtime-prime hook is firing at all.
|
||||
|
||||
The static branch under `0x00482160` also adds one more caution: `0x00538b60` is conditional, not
|
||||
unconditional. The service pass only enters it when the shell runtime at `0x006d401c` is live and
|
||||
`[shell_state+0xa0] == 0`. So a silent `0x00538b60` probe does not yet prove the shell is frozen
|
||||
before the runtime-prime call; it may simply mean the `+0xa0` gate stayed nonzero on that service
|
||||
tick. The next service-entry logs therefore need to include `[shell_state+0xa0]` before we treat
|
||||
runtime-prime silence as meaningful.
|
||||
|
||||
The newer run closes that conditional question:
|
||||
|
||||
- `[shell_state+0xa0]` is `0` on the first traced service call
|
||||
- `0x00538b60` is therefore eligible
|
||||
- the runtime-prime probe now shows it entering and returning cleanly on that same service tick
|
||||
|
||||
The later run closes the next owner too:
|
||||
|
||||
- `0x00520620` `shell_service_frame_cycle` also enters and returns cleanly on the same frozen
|
||||
mode-`4` path
|
||||
- the logged state matches the generic frame-service branch:
|
||||
- `[+0x1c] = 0`
|
||||
- `[+0x28] = 0`
|
||||
- `flag_56 = 0`
|
||||
- `[+0x58]` is pulsed and then cleared back to `0`
|
||||
- `0x006cec78` stays `0`
|
||||
|
||||
The newer run closes that owner too:
|
||||
|
||||
- `0x0053fda0` enters and returns cleanly on the frozen mode-`4` path
|
||||
- it is actively servicing the `LoadScreen.win` object itself
|
||||
- the serviced object keeps `field_1d = 1`, `field_5c = 1`, and a stable child list
|
||||
- the first child vcall target at `+0x18` stays `0x005595d0`
|
||||
- `0x006cec78` still stays `0`
|
||||
|
||||
So the next live boundary is now the child-service target itself at `0x005595d0`, not the higher
|
||||
object walker.
|
||||
|
||||
The child-service run narrows that again. The first sixteen `0x005595d0` calls under the serviced
|
||||
`LoadScreen.win` object are stable, presentation-heavy child lanes:
|
||||
|
||||
- every child points back to the same parent through `[child+0x86] = LoadScreen.win`
|
||||
- the early children have `flag_68 = 0x03`, `flag_6a = 0x03`, and return `4`
|
||||
- the later siblings have `flag_68 = 0x00`, `flag_6a = 0x03`, and return `0`
|
||||
- `field_b0` stays `0`
|
||||
- `0x006cec78` still stays `0`
|
||||
|
||||
Static disassembly matches that read: `0x005595d0` is gated by `0x00558670` and then spends most
|
||||
of its body in draw or overlay helpers like `0x54f710`, `0x54f9f0`, `0x54fdd0`, `0x53de00`, and
|
||||
`0x552560`. So this is a presentation-side child service path, not the missing startup-runtime
|
||||
promotion.
|
||||
|
||||
That moved the next useful runtime target back to the transition-time allocator lane, but the
|
||||
later jump-table decode changes what that means. The widened `0x0053b070` window below is now
|
||||
best read as evidence for the plain mode-`4` `LoadScreen.win` arm, not as evidence for the
|
||||
startup-runtime arm.
|
||||
|
||||
The next widened allocator run immediately paid off, but in a narrower way than expected:
|
||||
|
||||
- the first traced transition-window allocation is `0x7c`, which matches the static pre-construct
|
||||
`0x48302a -> 0x53b070` call exactly
|
||||
- the following `0x111`, `0x84`, `0x3a`, and repeated `0x25` allocations all happen before
|
||||
`LoadScreen.win` construct returns, so they now read as constructor-side child or control setup
|
||||
- that means the allocator probe was not disproving the `0x46c40` startup-runtime slice yet; it
|
||||
was simply exhausting its 16-entry log budget inside the constructor before the later
|
||||
post-construct block
|
||||
|
||||
The corrected follow-up run with that reset is now the decisive one: after `LoadScreen.win`
|
||||
construct returns, there are still no further allocator hits before publish and transition return.
|
||||
That matches the corrected jump-table decode cleanly, because mode `4` does not own the
|
||||
`0x46c40 -> 0x4336d0 -> 0x438890` path at all.
|
||||
|
||||
The first corrected thunk run also showed one practical problem: the probe became too noisy to be
|
||||
useful as a boundary marker, because `0x0053b070` is used widely outside the load-screen path.
|
||||
That still mattered, because it showed the hook-driven transition was taking the same `0x7c`
|
||||
constructor-side allocation as the plain mode-`4` branch rather than the startup-runtime
|
||||
allocation size `0x46c40`.
|
||||
|
||||
## Manual Owner Tail
|
||||
|
||||
|
|
@ -134,6 +428,17 @@ The surrounding mode map is tighter now too:
|
|||
|
||||
That makes `0x00438890(active_mode, 1, 0)` the strongest current RT3-native entry candidate for reproducing the successful manual load branch, because it owns the internal dispatch that later reaches `0x004390cb`.
|
||||
|
||||
The containing shell-mode switcher ABI is tighter now too:
|
||||
|
||||
- `0x00482ec0` is not a one-arg mode switch
|
||||
- it is a `thiscall` with two stack arguments
|
||||
- the grounded world-entry load-screen call shape at `0x443adf..0x443ae3` is `(4, 0)`
|
||||
- the function confirms that shape itself by reading the requested mode from `[esp+0x0c]` and
|
||||
returning with `ret 8`
|
||||
- the second stack argument is now best read as an old-active-mode teardown flag, because the
|
||||
`0x482fc6..0x482fff` branch only runs when it is nonzero and then releases the old active-mode
|
||||
object through `0x00434300`, `0x00433730`, `0x0053b080`, and finally clears `0x006cec78`
|
||||
|
||||
Current static xrefs also tighten the broader ownership split:
|
||||
|
||||
- `0x00443b57` calls `0x00438890` from the world-entry side, but with `(0, 0)` after dismissing the current shell detail panel and servicing `0x4834e0(0, 0)`
|
||||
|
|
@ -186,10 +491,17 @@ The scripted auto-load debugger run is now useful without manual interaction:
|
|||
- `0x00438890`
|
||||
- `0x004390cb`
|
||||
- `0x00445ac0`
|
||||
- `0x0053fea6`
|
||||
- but only `0x0053fea6` actually fired in the captured run
|
||||
- older runs that also broke on `0x0053fea6` stopped too early on that shell-side crash site
|
||||
- the default scripted compare flow now keeps only the owner-chain breakpoints above the real load lane
|
||||
|
||||
So the current non-interactive path is good enough to gather repeatable crash-side state, but it also tells us that the current auto-load code path is still not obviously traversing the larger-owner breakpoints under `winedbg`. The next step is therefore more hook-side logging around the `0x00438890` call itself rather than more manual debugger work.
|
||||
So the current non-interactive path is still good enough to gather repeatable crash-side state, but
|
||||
on this display setup the owner-chain compare flow is also vulnerable to early X11 death:
|
||||
|
||||
- `XF86VidModeClientNotLocal`
|
||||
- process termination before the RT3 owner breakpoints fire
|
||||
|
||||
That means the current plain-run hook probes are more reliable than `winedbg` for narrowing the
|
||||
live stall inside `shell_transition_mode`.
|
||||
|
||||
The latest static pivot also means the next reverse-engineering step does not require a live run:
|
||||
|
||||
|
|
@ -256,8 +568,34 @@ RRT_WINEDBG_LOG=/tmp/rt3-manual-load-winedbg.log tools/run_rt3_winedbg.sh
|
|||
Ready-made debugger command files are also provided:
|
||||
|
||||
- [winedbg_manual_load_445ac0.cmd](/home/jan/projects/rrt/tools/winedbg_manual_load_445ac0.cmd)
|
||||
- [winedbg_auto_load_crash.cmd](/home/jan/projects/rrt/tools/winedbg_auto_load_crash.cmd)
|
||||
- [winedbg_auto_load_compare.cmd](/home/jan/projects/rrt/tools/winedbg_auto_load_compare.cmd)
|
||||
|
||||
The default auto-load debugger run is now crash-first. It does not set RT3 owner breakpoints.
|
||||
Instead, it:
|
||||
|
||||
- continues immediately
|
||||
- lets `winedbg` stop on the first exception
|
||||
- dumps registers
|
||||
- dumps the top four stack dwords
|
||||
- prints a backtrace
|
||||
|
||||
Use that default when the hook is already known to stage and return from `shell_transition_mode`,
|
||||
and the current question is the downstream crash site.
|
||||
|
||||
If you specifically want the earlier owner-chain compare flow, override the command file:
|
||||
|
||||
```bash
|
||||
RRT_WINEDBG_CMD_FILE=/home/jan/projects/rrt/tools/winedbg_auto_load_compare.cmd \
|
||||
tools/run_hook_auto_load_winedbg.sh hh
|
||||
```
|
||||
|
||||
Or use the shorter wrapper:
|
||||
|
||||
```bash
|
||||
tools/run_hook_auto_load_winedbg_compare.sh hh
|
||||
```
|
||||
|
||||
If you do not use `RRT_WINEDBG_CMD_FILE`, you can still open those files and paste their contents into the debugger manually.
|
||||
|
||||
Both scripts rebuild `rrt-hook`, copy `dinput8.dll` into the Wine RT3 directory, and launch RT3 under `winedbg`.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue