Bug in cmalloc? Was: Re: Problems with: Improvements to fork handling (2/5)

Ryan Johnson ryan.johnson@cs.utoronto.ca
Sun May 29 08:42:00 GMT 2011


On 29/05/2011 12:37 AM, Daniel Colascione wrote:
> On 5/28/11 7:35 PM, Ryan Johnson wrote:
>> On 28/05/2011 8:23 PM, Christopher Faylor wrote:
>>> On Sat, May 28, 2011 at 06:40:30PM -0400, Ryan Johnson wrote:
>>>> On 28/05/2011 4:50 PM, Christopher Faylor wrote:
>>>>> On Wed, May 11, 2011 at 02:31:37PM -0400, Ryan Johnson wrote:
>>>>>> This patch has the parent sort its dll list topologically by
>>>>>> dependencies. Previously, attempts to load a DLL_LOAD dll risked
>>>>>> pulling
>>>>>> in dependencies automatically, and the latter would then not benefit
>>>>> >from the code which "encourages" them to land in the right places. The
>>>>>> dependency tracking is achieved using a simple class which allows to
>>>>>> introspect a mapped dll image and pull out the dependencies it lists.
>>>>>> The code currently rebuilds the dependency list at every fork rather
>>>>>> than attempt to update it properly as modules are loaded and unloaded.
>>>>>> Note that the topsort optimization affects only cygwin dlls, so any
>>>>>> windows dlls which are pulled in dynamically (directly or indirectly)
>>>>>> will still impose the usual risk of address space clobbers.
>>>>> Bad news.
>>>>>
>>>>> I applied this patch and the one after it but then noticed that zsh
>>>>> started
>>>>> producing:  "bad address: " errors.
>>>>>
>>>>> path:4: bad address: /share/bin/dopath
>>>>> term:1: bad address: /bin/tee
>>>>>
>>>>> The errors disappear when I back this patch out.
>>>>>
>>>>> FWIW, I was running "zsh -l".  I have somewhat complicated
>>>>> .zshrc/.zlogin/.zshenv files.  I'll post them if needed.
>>>>>
>>>>> Until this is fixed, this patch and the subsequent ones which rely on
>>>>> it, can't go in.  I did commit this fix but it has been backed out now.
>>>> Hmm. I also see bad address errors in bash sometimes. However, when I
>>>> searched through the cygwin mailing list archives I saw that other
>>>> people have reported this problem in the past [1], so I figured it was
>>>> some existing, sporadic issue rather than my patch...
>>>>
>>>> Could you tell me what a 'bad address' error is? I'd be happy to debug
>>>> this, but really don't know what kind of bug I'm hunting here, except
>>>> that it might be a problem wow64 and suspending threads [2]. Whatever
>>>> became of these bad address errors the last time(s) they cropped up? I
>>>> can't find any resolution with Google, at least.
>>> If I had any insight beyond "It works without the patch and it doesn't
>>> work with it" I would have shared it.
>> Let me rephrase a bit... The error happens too early in fork for gdb to
>> be any help, and I was hoping you could tell me what part(s) of cygwin
>> are capable of "raising" this error -- it seems to be a linux (not
>> Windows) flavor of error message, but the case-insensitive regexp
>> 'bad.address' does not match anything in the cygwin sources.
> The actual string is in strerror.c in newlib, which is why you didn't
> find it with a grep on winsup/cygwin. The error code is EFAULT. There
> are 127 places in the cygwin1.dll where EFAULT can raised, according to
> grep '\bEFAULT\b' *.cc. In most of them, EFAULT is raised after
> something called san has detected that to Windows raised an unexpected
> structured exception. My hunch would be to look in spawn.cc first, in
> spawn_guts, but I haven't read your patch in enough detail to narrow the
> problem down further. strace might be handy.
Wow. You nailed it. I fired up windbg and had it follow a cmd.exe into 
bash --login and descendants, and sure enough, there was an access 
violation when spawn_guts called build_env and the latter called cmalloc.

Anyway, after a long and twisty debug session, I have reason to believe 
that either cmalloc/cfree is buggy, or the bug resides somewhere else 
entirely and my patch just tickles it: freeing memory in the reverse 
order it was allocated in causes Bad Things (tm) to happen, even on an 
unpatched version of my 12 May CVS snapshot (which predates my fork 
patches).

It is a known/intentional restriction of the HEAP_2_DLL that memory must 
be freed in the order it was allocated? The default cygheap most likely 
has the problem as well -- that's why I was crashing and had to call 
cmalloc instead of crealloc(NULL, ...).

(Hairy details and test case below)

Ryan

Details: Disabling the topsort call fixed the problem (no surprise), as 
did allocating dependencies from static storage rather than calling 
malloc. I worried the latter just masked the bug, and sure enough, 
calling cmalloc/cfree on dummy data during the sort brought the bug back 
-- even though the cmalloc'd memory was never touched (??). However, 
calling cmalloc/cfree once per dll instead of performing the topsort did 
not trigger the bug. Reversing order of the list (with cmalloc'd 
dependencies) instead of topsorting it caused the bug to come back.

So, I defined this small function:

static void break_cmalloc(int depth, int maxdepth) {
     void* x = cmalloc (HEAP_2_DLL, 32);
     cfree(x);
     if (depth < maxdepth)
         break_cmalloc(depth+1, maxdepth);
}

and called it during fork instead of dlls.topsort(), with maxdepth=5. No 
bug (as expected).

Then I moved the call to cfree below the recursion, so memory gets freed 
in reverse order. Bang. Bash goes down and takes mintty with it after 
briefly flashing 'bad address.' Calling bash from cmd.exe hangs the 
latter so badly Windows can't kill it (I guess I'll have to reboot).

I rolled back to the last CVS checkout I did (May 12), and the above 
function reproduces the bug there as well. AFAIK, Corinna had merged my 
/proc/maps patch by then, but the only fork-related changes were to fix 
the uninitialized value in that one constructor, and to make failed 
forkees not call dtors.



More information about the Cygwin-patches mailing list