NtCreateProcess redux

Tue Apr 26 13:49:00 GMT 2011

On 25/04/2011 6:05 PM, Daniel Colascione wrote:
> On 4/25/2011 12:33 PM, Ryan Johnson wrote:
>> I know that folks have looked before into NtCreateProcess as a way of
>> doing a real fork() in cygwin, but it's very unclear from the various
>> list archives why it's still a bad idea today, other than its being
>> undocumented.
>
> It's a bad idea because it doesn't work.  You can certainly create a 
> forked child with NtCreateProcess, but without being able to connect 
> it to csrss and the rest of the win32 subsystem, this new process is 
> useless.  NtCreateProcess-fork works for Interix because it has its 
> own NT subsystem, but Cygwin has to live within win32, and I don't 
> think creating a new subsystem is feasible for anyone without access 
> to the NT source.
That would definitely go in the Bad Things category... I didn't realize 
you had to manually deal with subsystems. Looking at the NT internals 
book, I see now that their fork example is a thoroughly scary hack 
(casting arbitrary hex numbers to function pointers and trying to call 
them... shudder).

> As far as the address space issue goes: when NT creates a new process, 
> the loader, in ntdll, gains control before the entry point is ever 
> called, and this loader is what's responsible for the initial VM 
> layout.  Because ntdll is a "known dll", you can't replace it with a 
> friendlier implementation.  After the loader completes its work, the 
> kernel does some black magic and resets the initial thread's stack so 
> that it begins executing in the ntdll thread startup routine, so you 
> never actually _see_ the loader executing.
Yes, I've noticed that. Windbg can actually trace the load process, 
thoughI don't have any debug symbols to know what's going on. There are 
even several nameless dlls which get loaded and unloaded before WOW64 
hands over control.

>
> The only thing that might have a chance of working is to unload 
> everything except user32, kernel32, and a few other components, then 
> start fresh with a more constrained module loading strategy.
Unfortunately, AFAICT it's impossible to unload statically-linked DLLs: 
you can call FreeModule() on their handle, and it returns success, but 
the image remains loaded in memory.

However, the main crazy idea I've been toying with uses the same basic 
premise: make the .exe a minimal stub (maybe not even linking 
cygwin1.dll directly) which dynamically loads a .dll containing all the 
application's code and link-time dependencies.  Doing so would minimize 
the number of address space changes the NT loader could impose during 
process startup. Most fork failures I see right now are due to 
statically-linked  dlls moving around, which we can't really do anything 
to avoid or fix, other than calling rebaseall with crossed fingers. At 
least with dynamically-loaded dlls we have a semblance of control.

Not necessarily what you want to do all the time, but for these 
problematic dll-heavy apps which also like to fork... I'll send a 
separate email soon with more details.

>
> [1] If process A has section S, the contents of which we'd like to 
> duplicate in child-process B as S', and B inherits a handle to S, it's 
> slower to remap S in B and memcpy it to S' than it is to just 
> initialize S' from A's address space with NtCopyVirtualMemory.  But 
> that's the single-threaded case.  It turns out that if we have the 
> child map S somewhere and have one thread touch S[0], S'[0], S[4096], 
> S'[4096], etc. while another thread does a mempcy from S to S', we 
> handily beat the NtCopyVirtualMemory approach.
A trade-off between the cost of traps to fault in pages vs. the cost of 
syscalls to do inter-process memory transfers? It seems like the latter 
would win if you copied enough pages at a time (the actual memcpy cost 
should be about the same either way). What happens if both threads just 
call NtCopyVirtualMemory in parallel?

Ryan