[Cialug] Crashing with errors in mcelog

Daniel A. Ramaley daniel.ramaley at drake.edu
Sun Mar 1 13:19:03 CST 2009


My main machine at home had 2 crashes this morning. A few months ago it 
also had 2 crashes in one day. Normally, of course, the machine doesn't 
crash at all. When the crashes happen, the screen freezes, the keyboard 
lights start blinking, and the network dies. So the machine is totally 
unresponsive to anything other than the reset button. I opened it up 
and all fans are working properly. But i did notice 
/var/log/mcelog contains a bunch of stuff. The last ~100 lines are 
below. Any idea what these messages actually mean? Some of them mention 
the North Bridge, others look like a problem with one of the DIMMs. Do 
i have a bad DIMM or a bad North Bridge, or a bad something else? The 
RAM in the computer is 4 sticks of 2 GB ECC (8 GB total). The machine 
is running Debian Testing amd64 with kernel 2.6.26.


DDR2 DIMM 333 Mhz Synchronous Width 64 Data Width 72 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Manufacturer: Manufacturer3
Serial Number: SerNum3
Asset Tag: AssetTagNum3
Part Number: PartNum3
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache TSC c38ea32751
ADDR 1d7e62740 
  Data cache ECC error (syndrome 15)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS d40ac00000000833 MCGSTATUS 0
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
DDR2 DIMM 333 Mhz Synchronous Width 64 Data Width 72 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Manufacturer: Manufacturer3
Serial Number: SerNum3
Asset Tag: AssetTagNum3
Part Number: PartNum3
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 2 bus unit TSC c38ea32c2d
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC c38ea3305a
MISC c008000f00000000 ADDR 1d7e63358 
  Northbridge RAM ECC error
  ECC syndrome = 15
       bit33 = err cpu1
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9c0ac00200000813 MCGSTATUS 0
DDR2 DIMM 333 Mhz Synchronous Width 64 Data Width 72 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Manufacturer: Manufacturer3
Serial Number: SerNum3
Asset Tag: AssetTagNum3
Part Number: PartNum3
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit TSC c38ea19306
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 2d7050dbd77
MISC c008001100000000 ADDR 1d7e63358 
  Northbridge RAM ECC error
  ECC syndrome = 15
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9c0ac00100000813 MCGSTATUS 0
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
DDR2 DIMM 333 Mhz Synchronous Width 64 Data Width 72 Size 2 GB
Device Locator: DIMM3
Bank Locator: BANK3
Manufacturer: Manufacturer3
Serial Number: SerNum3
Asset Tag: AssetTagNum3
Part Number: PartNum3
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit TSC 2d704fc7e14
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS 9000400000000863 MCGSTATUS 0

-- 
------------------------------------------------------------------------
Dan Ramaley                            Dial Center 118, Drake University
Network Programmer/Analyst             2407 Carpenter Ave
+1 515 271-4540                        Des Moines IA 50311 USA


More information about the Cialug mailing list