64-bit emacs crashes a lot

Wed Aug 14 14:04:00 GMT 2013

On 10/08/2013 2:01 PM, Ken Brown wrote:
> On 8/10/2013 11:24 AM, Ryan Johnson wrote:
>> On 10/08/2013 9:59 AM, Ken Brown wrote:
>>> On 8/9/2013 11:28 PM, Ryan Johnson wrote:
>>>> On 08/08/2013 2:00 PM, Ryan Johnson wrote:
>>>>> On 08/08/2013 1:42 PM, Ken Brown wrote:
>>>>>> On 8/5/2013 11:29 AM, Ryan Johnson wrote:
>>>>>>> On 05/08/2013 11:00 AM, Ken Brown wrote:
>>>>>>>> On 8/3/2013 3:05 PM, Ryan Johnson wrote:
>>>>>>>>> On 02/08/2013 8:07 AM, Ryan Johnson wrote:
>>>>>>>>>> On 02/08/2013 7:04 AM, Ken Brown wrote:
>>>>>>>>>>> On 8/2/2013 4:02 AM, Corinna Vinschen wrote:
>>>>>>>>>>>> On Aug  1 22:46, Ryan Johnson wrote:
>>>>>>>>>>>>> Here's a new one... I started a compilation, but before it
>>>>>>>>>>>>> actually
>>>>>>>>>>>>> invoked the command it started pegging the CPU. After
>>>>>>>>>>>>> ^G^G^G, it
>>>>>>>>>>>>> crashed with the following:
>>>>>>>>>>>>>> Auto-save? (y or n) y
>>>>>>>>>>>>>>       0 [main] emacs 5076 C:\cygwin64\bin\emacs-nox.exe: ***
>>>>>>>>>>>>>> fatal
>>>>>>>>>>>>>> error - Internal error: TP_NUM_W_BUFS too small 2268032 
>>>>>>>>>>>>>> >= 10.
>>>>>>>>>>>>
>>>>>>>>>>>> That looks like a memory overwrite.  2268032 is 0x229b80, 
>>>>>>>>>>>> which
>>>>>>>>>>>> looks
>>>>>>>>>>>> suspiciously like a stack address.  And the overwritten 
>>>>>>>>>>>> value is
>>>>>>>>>>>> on the
>>>>>>>>>>>> stack, too, well within the cygwin TLS area.  If *this* value
>>>>>>>>>>>> gets
>>>>>>>>>>>> overwritten, the TLS is probbaly totally hosed at this point.
>>>>>>>>>>>> There's
>>>>>>>>>>>> just no way to infer the culprit from this limited info.
>>>>>>>>>>>
>>>>>>>>>>> Could this be BLODA?  Ryan, I noticed that you wrote in a
>>>>>>>>>>> different
>>>>>>>>>>> thread, "I recently migrated to 64-bit cygwin...and so far
>>>>>>>>>>> have not
>>>>>>>>>>> had to disable Windows Defender; the latter was a recurring
>>>>>>>>>>> source of
>>>>>>>>>>> trouble for my previous 32-bit cygwin install on Win7/64."
>>>>>>>>>> This would be a whole new level of nasty from a BLODA... I 
>>>>>>>>>> thought
>>>>>>>>>> they only interfered with fork()?
>>>>>>>>>>
>>>>>>>>>> However, this *is* Windows Defender we're talking about... 
>>>>>>>>>> service
>>>>>>>>>> disabled and all cygwin processes restarted. I'll let you know
>>>>>>>>>> in a
>>>>>>>>>> day or so if the crashes go away.
>>>>>>>>> Rats. I just had another crash, the "Fatal error 6" variety.
>>>>>>>>> Windows
>>>>>>>>> Defender has not turned itself back on (it's been known to do
>>>>>>>>> that), and
>>>>>>>>> a scan of the BLODA list didn't match anything else on my system.
>>>>>>>>>
>>>>>>>>> So I don't think it's BLODA...
>>>>>>>>>
>>>>>>>>> Ideas?
>>>>>>>>
>>>>>>>> Not really, other than the obvious: (a) Find a reproducible way of
>>>>>>>> making emacs-nox crash.  (b) Catch the crash in gdb by setting a
>>>>>>>> suitable break point.
>>>> Got one! Looks like a stack overflow somewhere in the garbage 
>>>> collector:
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> [Switching to Thread 5316.0x1af4]
>>>> 0x00000001004df44a in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5903
>>>> 5903            if (CONS_MARKED_P (ptr))
>>>> (gdb) bt
>>>> #0  0x00000001004df44a in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5903
>>>> #1  0x00000001004df66e in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5914
>>>> #2  0x00000001004df593 in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5809
>>>> #3  0x00000001004df66e in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5914
>>>> #4  0x00000001004df66e in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5914
>>>> #5  0x00000001004df585 in mark_object (arg=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5808
>>>> #6  0x00000001004dfa4e in mark_vectorlike (
>>>>      ptr=0x100f66f28 <bss_sbrk_buffer+6955080>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5501
>>>> ... snip ...
>>>> #2606 0x00000001004dfaf4 in mark_buffer (buffer=<optimized out>)
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5552
>>>> #2607 0x00000001004dff2c in Fgarbage_collect ()
>>>>      at /usr/src/debug/emacs-24.3-4/src/alloc.c:5181
>>>> #2608 0x0000000000000000 in ?? ()
>>>
>>> I don't know whether 2608 stack frames is unusual or not.  Is this
>>> enough to cause a stack overflow?
>> I don't know the answer to that for emacs, but in general that's an
>> exceedingly deep stack that would normally indicate some sort of
>> infinite recursion. Would you actually expect an object tree in emacs to
>> be 2000+ pointers deep? No plausible non-bug scenarios leap to mind
>> right off...
>
> I'd be very surprised if there were a bug in the garbage collection 
> routine that's causing this.  If there were, I'd expect to see lots of 
> people reporting this.  Could there be some memory corruption that 
> creeps in when you suspend/resume emacs?  You did say that the crashes 
> are less frequent since you deactivated Windows Defender, so I'm not 
> sure you can rule out BLODA.
>
> By the way, are your crashes always related to suspending and resuming 
> emacs?  I don't recall that you said that before, but you keep 
> mentioning ^Z.  Do you still get crashes if you never suspend emacs? 
> You could also try one of the GUI versions of emacs to see if you get 
> crashes.  "Suspending" in that case simply iconifies the frame.
>
>>>
>>>> I have the full backtrace saved to file, let me know if that would be
>>>> useful (there wasn't anything obvious that I could see, just more 
>>>> of the
>>>> same). Meanwhile, I verified that none of the addresses printed is
>>>> repeated, so it doesn't seem to be due to an obvious cycle in the 
>>>> object
>>>> graph.
>>>
>>> From what you've shown, it appears that most of the addresses have
>>> been optimized out.  I think you would need an unoptimized build in
>>> order to check that, wouldn't you?
>> Probably, yes. That's why I said no "obvious" cycles -- at least the 400
>> pointers that are shown don't show a problem.
>>
>>>
>>>> The crash happened when I foregrounded a stopped emacs. I tried 
>>>> playing
>>>> around with various breakpoints while repeatedly sending ^Z, but no 
>>>> luck
>>>> repeating the "feat" yet.
>>>>
>>>> Ideas?
>>>
>>> Can you trigger the bug by calling garbage collection manually (M-x
>>> garbage-collect)?  What happens if you put a breakpoint at
>>> Fgarbage_collect and step through it?  (Again, you might need an
>>> unoptimized build before that will be useful.)
>> I tried breaking on Fgarbage_collect and hitting ^Z no love. I also
>> tried setting a breakpoint on one of those other internal functions,
>> with an ignore count intended to trigger it deep in a GC cycle. It
>> triggered some tens of frames deep and ^Z there didn't cause trouble
>> either. I wonder if the GC cycle just happened to coincide with
>> reactivating emacs (perhaps triggered by some internal timeout that
>> elapsed while it was stopped?)
>>
>>>
>>> There are lots of lisp variables that can be used to control garbage
>>> collection and get information about it.  See the section on garbage
>>> collection in the elisp manual.  For example, you could try
>>> customizing garbage-collection-messages.  Or you could play with
>>> gc-cons-threshold.
>> I didn't see anything glaringly useful there... the messages just
>> announce a GC run, which gdb can catch just fine. There doesn't seem to
>> be any way of tracking how deep an object tree emacs traversed, or how
>> many objects were freed.
>
> Sorry, I misread what the message would be.  I should have said that 
> you could look directly at the output from garbage-collect, which you 
> can see if you evaluate (garbage-collect) in the *scratch* buffer.  
> But, as I said above, I'm not sure that garbage collection is the 
> underlying problem here.
Agree it's probably not GC... GC would just tend to trip over any bad 
pointers that were lurking around...

After a rash of crashes where I either forgot to attach gdb or forgot to 
set appropriate breakpoints, I finally managed to catch the stack trace 
below. It occurred during M-x compile, while emacs parsed the 
compilation's rather copious output, which is by far the most common 
type of crash I've been getting lately. I have no idea how to interpret 
the backtrace, though.

What should I try next? I assume I'll need a debug-compiled emacs so the 
backtrace isn't garbage? If so, (a) what is the most straightforward way 
to compile emacs-nox that way and (b) what would I be looking for if I 
encountered the below stack trace in a debug build?

Thanks,
Ryan

Breakpoint 2, 0x000000010055d190 in kill ()
(gdb) bt
#0  0x000000010055d190 in kill ()
#1  0x000000010053702e in process_send_signal 
(process=process@entry=25781889629, signo=signo@entry=2, 
current_group=<optimized out>, nomsg=nomsg@entry=0) at 
/usr/src/debug/emacs-24.3-4/src/process.c:5948
#2  0x0000000100537198 in Finterrupt_process (process=25781889629, 
current_group=<optimized out>) at 
/usr/src/debug/emacs-24.3-4/src/process.c:5966
#3  0x00000001004f7761 in Ffuncall (nargs=<optimized out>, 
args=<optimized out>) at /usr/src/debug/emacs-24.3-4/src/eval.c:2781
#4  0x000000010052b5ed in exec_byte_code (bytestr=4294962344, 
vector=2268896, maxdepth=2, args_template=4303595040, nargs=4304157760, 
args=0x100902032 <bss_sbrk_buffer+250194>)
     at /usr/src/debug/emacs-24.3-4/src/bytecode.c:900
#5  0x00000001004f7293 in funcall_lambda (fun=25778101277, 
nargs=nargs@entry=0, arg_vector=arg_vector@entry=0x22a188) at 
/usr/src/debug/emacs-24.3-4/src/eval.c:3010
#6  0x00000001004f75cb in Ffuncall (nargs=nargs@entry=1, 
args=args@entry=0x22a180) at /usr/src/debug/emacs-24.3-4/src/eval.c:2839
#7  0x00000001004f8bef in apply1 (fn=25778613730, fn@entry=4304161216, 
arg=arg@entry=4304412722) at /usr/src/debug/emacs-24.3-4/src/eval.c:2539
#8  0x00000001004f3567 in Fcall_interactively (function=4304161216, 
record_flag=4304412722, keys=4299711881) at 
/usr/src/debug/emacs-24.3-4/src/callint.c:377
#9  0x00000001004f7752 in Ffuncall (nargs=nargs@entry=4, 
args=args@entry=0x22a3b0) at /usr/src/debug/emacs-24.3-4/src/eval.c:2785
#10 0x00000001004f91b7 in call3 (fn=<optimized out>, arg1=<optimized 
out>, arg2=<optimized out>, arg3=<optimized out>) at 
/usr/src/debug/emacs-24.3-4/src/eval.c:2603
#11 0x00000001004883cd in Fcommand_execute (cmd=<optimized out>, 
record_flag=<optimized out>, keys=<optimized out>, special=<optimized 
out>) at /usr/src/debug/emacs-24.3-4/src/keyboard.c:10241
#12 0x0000000100494ae8 in command_loop_1 () at 
/usr/src/debug/emacs-24.3-4/src/keyboard.c:1587
#13 0x00000001004f5c2e in internal_condition_case 
(bfun=bfun@entry=0x100494740 <command_loop_1>, handlers=4304470642, 
hfun=hfun@entry=0x10048ae40 <cmd_error>) at 
/usr/src/debug/emacs-24.3-4/src/eval.c:1289
#14 0x000000010048630a in command_loop_2 
(ignore=ignore@entry=4304412722) at 
/usr/src/debug/emacs-24.3-4/src/keyboard.c:1168
#15 0x00000001004f5aef in internal_catch (tag=<optimized out>, 
func=func@entry=0x1004862e0 <command_loop_2>, arg=4304412722) at 
/usr/src/debug/emacs-24.3-4/src/eval.c:1060
#16 0x000000010048a914 in command_loop () at 
/usr/src/debug/emacs-24.3-4/src/keyboard.c:1147
#17 recursive_edit_1 () at /usr/src/debug/emacs-24.3-4/src/keyboard.c:779
#18 0x000000010048ac47 in Frecursive_edit () at 
/usr/src/debug/emacs-24.3-4/src/keyboard.c:843
#19 0x000000010055e8ef in main (argc=<optimized out>, argv=<optimized 
out>) at /usr/src/debug/emacs-24.3-4/src/emacs.c:1537

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple