[Cialug] Crashing with errors in mcelog

Daniel A. Ramaley daniel.ramaley at drake.edu
Wed Mar 4 14:54:37 CST 2009


On 2009-03-03 at 13:55:17, Daniel A. Ramaley wrote:
>On 2009-03-03 at 13:25:56, Aaron Porter wrote:
>>When running memtest on "server grade" hardware it's important to
>>disable ECC in the system BIOS. I've had vendors swear up and down
>>that there are no memory issues as they were testing the error
>>correction and not the memory itself.
>
>Thanks for the hint.
>
>My desktop machine at home is "server grade" hardware, to the limit of
>what budget i was able to rationalize at the time. It does have ECC
>RAM. I'll be sure to disable the ECC when running memtest86.

I'm going to call this thread's issue most likely resolved. Today i 
installed the memtest86 Debian package (which then shows up in the grub 
menu on boot). I rebooted and went into the BIOS. After disabling ECC, 
something interesting happened. The BIOS made a large number of beeps 
on boot and printed a message about memory failure. Interesting. I 
tried rebooting, same behavior. The memory problems didn't prevent the 
machine from booting, however, and it was still able to boot up to a 
Linux desktop. But, recalling that some of the mcelog errors mentioned 
"DIMM3", i figured i'd remove 2 of the 4 DIMMs (i don't know what "3" 
means since i don't know if mcelog counts from 0 or from 1, but since 
DIMMs should be installed in pairs in this machine, it doesn't matter 
anyway). Upon removing the 2 higher DIMMs, the machine started working 
perfectly, albeit with 1/2 the RAM it had before. No BIOS beeps and 
complaints about bad memory. It is running now and presumably will not 
have more problems.

After checking how cheap that RAM has become (even the large 2 GB DIMMs 
that my machine has), i just ordered a couple replacements. Probably 
only one of the 2 DIMMs i pulled is bad, but RAM is so cheap that it 
isn't worth my time trying to determine which is which.

My previous statement about memtest86 just flat out not working still 
stands, however. Both before and after removing the bad RAM, when i 
select memtest86 in grub, the screen goes black and after less than a 
second reboots. So, memtest86 is functionally similar to the computer's 
reset button. I was expecting a memory test, not another way to kick 
the machine. I might try booting memtest86 from a CD and see if that 
works any better, but since i seem to have discovered the memory 
problems without it, i probably won't bother. For now i'm running with 
ECC turned off, which i think should cause a crash if there are also 
problems with the lower pair of DIMMs.

------------------------------------------------------------------------
Dan Ramaley                            Dial Center 118, Drake University
Network Programmer/Analyst             2407 Carpenter Ave
+1 515 271-4540                        Des Moines IA 50311 USA


More information about the Cialug mailing list