Skip to content

Windows: Implement WOW64/ARM64EC offline compiler backend#5502

Open
bylaws wants to merge 12 commits into
FEX-Emu:mainfrom
bylaws:winofflinecompilerbackedn
Open

Windows: Implement WOW64/ARM64EC offline compiler backend#5502
bylaws wants to merge 12 commits into
FEX-Emu:mainfrom
bylaws:winofflinecompilerbackedn

Conversation

@bylaws

@bylaws bylaws commented May 17, 2026

Copy link
Copy Markdown
Collaborator

Includes many prereq changes, but I think submitting this all at once will probably be easier for review.

This is able to reuse both the WOW64 and ARM64EC image loading APIs by building specialised binaries. This offline compiler is to be invoked once per main image exe, and will compile all prereq libraries under the correct config context into the tagrte codemap dir

@bylaws bylaws force-pushed the winofflinecompilerbackedn branch from 808e7d8 to cbbd654 Compare May 17, 2026 22:07
neobrain and others added 9 commits May 17, 2026 23:08
This allows mapping the code directly into memory for execution.
…ation

This ensures that any code buffer memory owned by the MappedResource is
invalidated before being deallocated.
The lazy code loading refactor replaced LoadData with the new
LoadCache/EnableLoadedSection API but left the Windows path as TODOs.
Implement the wiring: LoadAOTImages now calls LoadCache +
RegisterMappedCodeBuffer for each mapped cache file, and HandleImageMap
calls EnableLoadedSection (with nullptr thread since lazy mapping is not
yet implemented on Windows).
A significant proportion of time with the prior approach was spent in
the unbuffered write calls used once per code page/entrypoint; taking
upwards of 15 seconds to serialize a 700MB P3R cache. Switching to an
mmap approach reduces the CPU time there to less than a second, leaving
I/O to dominate.

A two-pass approach is used:
1. Calculate the required size for the cache file, issue a callback to
   frontend to map a cache file of that size.
2. Write cache data into the buffer provided by that callback.
Merges and sorts relocations from all provided threads before
serializing, enabling multi-threaded offline code cache generation.
Implements an offline JIT compiler backend that compiles x86 code blocks
from a given code map into an ARM64 code cache on Windows. Separate
binaries are built for WOW64 (32-bit) and ARM64EC (64-bit) targets.
@bylaws bylaws force-pushed the winofflinecompilerbackedn branch from cbbd654 to 4e78fc7 Compare May 17, 2026 23:07

// Generic information associated with an executable file.
struct ExecutableFileInfo {
~ExecutableFileInfo();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you figure out what's going on here and fix it properly instead? There can be no reason this change should be needed, and I had flagged this before already.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a proper fix, it's required to enable the use of named init which is much nicer. It's not required in the sense I could do something else, but it would make the user code more ugly? And I don't really sae a point when this works fine

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a proper fix, it's required to enable the use of named init

Nope, why would using designated initializers require this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it deletes the default ctor that's required for designated initialisers. If you try to compile without it you'll see the error

@neobrain neobrain left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a striking argument to make this a separate executable instead of incorporating it into the existing FEXOfflineCompiler? Looking at the diff it seems largely unnecessary to have it separate, when integrating it would more easily enable a future option for automatic cache generation.

I'm also not super thrilled about potentially maintaining the same logic twice, especially with subtle things like atomic cache replacement and stale file deletion going on etc.

Comment on lines +44 to +46
std::optional<FEX::Windows::InvalidationTracker> InvalidationTracker;
std::optional<FEX::Windows::ImageTracker> ImageTracker;
std::optional<FEX::Windows::OvercommitTracker> OvercommitTracker;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be static


const auto CacheDir = std::filesystem::path(FEX::Config::GetCacheDirectory());
const auto ReadyDir = CacheDir / "codemap" / "ready";
const auto MetadataPath = ImageCacheDir / "metadata";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable. Do you not have compiler warnings enabled?

@bylaws

bylaws commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

Was there a striking argument to make this a separate executable instead of incorporating it into the existing FEXOfflineCompiler? Looking at the diff it seems largely unnecessary to have it separate, when integrating it would more easily enable a future option for automatic cache generation.

I'm also not super thrilled about potentially maintaining the same logic twice, especially with subtle things like atomic cache replacement and stale file deletion going on etc.

It needs a bunch of windows specific things, executable loading is entirely different, and we need two separate binaries targeting different architectures. There's also divergence in steamstub using bcrypt. Arm64ec handling requires all sorts of new relocation stuff windows gives us for free, we don't need to deal with flock etc. there's probably more. Atomic cache replacement inherently is different on windows and Linux too so not sure point of unifying there.

I genuinely don't have time to change the fundamental approach here or argue more in either direction. I personally think there's strong justification for doing it this way and it's literally less than 1000 lines, it's not like the windows frontend or invalidation tracker reuses FEX Linux, half of that would need to be reimplemented.

In terms of paths forward, I won't be able to meaningfully contribute enough time to refactor this in such a direction regardless of the outcome of such a discussion. So this so can either go with this design or will be left for someone else to implement. Idm. But I do think that this works perfect rn so

@Sonicadvance1

Copy link
Copy Markdown
Member

Was there a striking argument to make this a separate executable instead of incorporating it into the existing FEXOfflineCompiler? Looking at the diff it seems largely unnecessary to have it separate, when integrating it would more easily enable a future option for automatic cache generation.
I'm also not super thrilled about potentially maintaining the same logic twice, especially with subtle things like atomic cache replacement and stale file deletion going on etc.

It needs a bunch of windows specific things, executable loading is entirely different, and we need two separate binaries targeting different architectures. There's also divergence in steamstub using bcrypt. Arm64ec handling requires all sorts of new relocation stuff windows gives us for free, we don't need to deal with flock etc. there's probably more. Atomic cache replacement inherently is different on windows and Linux too so not sure point of unifying there.

I genuinely don't have time to change the fundamental approach here or argue more in either direction. I personally think there's strong justification for doing it this way and it's literally less than 1000 lines, it's not like the windows frontend or invalidation tracker reuses FEX Linux, half of that would need to be reimplemented.

In terms of paths forward, I won't be able to meaningfully contribute enough time to refactor this in such a direction regardless of the outcome of such a discussion. So this so can either go with this design or will be left for someone else to implement. Idm. But I do think that this works perfect rn so

Seems reasonable enough to keep it as a separate binary to handle Windows specific things.

@neobrain

neobrain commented Jun 1, 2026

Copy link
Copy Markdown
Member

In terms of paths forward, I won't be able to meaningfully contribute enough time to refactor this in such a direction regardless of the outcome of such a discussion. So this so can either go with this design or will be left for someone else to implement.

FTR since you're saying your time constraints don't allow for any sort of iteration, I've integrated this PR into the existing FEXOfflineCompiler and wired up the rest of the code caching infra myself instead. There's actually decent overlap especially with future PE support on the Linux side, just running more end-to-end-testing to ensure it's on par with Linux before PRing this.

@bylaws

bylaws commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

In terms of paths forward, I won't be able to meaningfully contribute enough time to refactor this in such a direction regardless of the outcome of such a discussion. So this so can either go with this design or will be left for someone else to implement.

FTR since you're saying your time constraints don't allow for any sort of iteration, I've integrated this PR into the existing FEXOfflineCompiler and wired up the rest of the code caching infra myself instead. There's actually decent overlap especially with future PE support on the Linux side, just running more end-to-end-testing to ensure it's on par with Linux before PRing this.

this would have been nice to know before I spent my weekend cleaning this up. Presumably your PR is at complete parity with this implementation, including non-executable range handling, volatile metadata handling, arm64x handling, jit guard page handling, multithreading and steam DRM support? as would be strictly required for claiming of decent overlap, for one could not reasonably assess the overlap without full feature parity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants