[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #IIX-893121]: Question...
- Subject: [LDM #IIX-893121]: Question...
- Date: Tue, 30 Oct 2007 15:53:52 -0600
Hi Gerry,
I tried to call you a couple of times earlier today; the first time your phone
was
busy and the second time I got the canned message "all circuits are busy".
Anyway, we believe that we have found the problem on sasquatch... there is some
sort of a memory problem that likely caused a reboot on October 15 at around
15:00. Here is the appropriate snippit from /var/log/messages for the 15th:
Oct 15 14:25:13 sasquatch snmpd[30901]: transport socket = 12
Oct 15 14:25:13 sasquatch snmpd[30901]: Connection from - 10.2.9.37
Oct 15 14:25:13 sasquatch snmpd[30901]: transport socket = 12
Oct 15 15:00:54 sasquatch syslogd 1.4.1: restart.
Oct 15 15:00:54 sasquatch syslog: syslogd startup succeeded
Oct 15 15:00:54 sasquatch kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 15 15:00:54 sasquatch kernel: Bootdata ok (command line is ro root=/dev/md1
console=tty0, console=ttyS0,9600n8)
Oct 15 15:00:54 sasquatch kernel: Linux version 2.6.9-55.0.2.ELsmp
(address@hidden) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun
26 14:14:47 EDT 2007
Oct 15 15:00:54 sasquatch kernel: BIOS-provided physical RAM map:
After the machine came back up, a 'general bus error' started being reported:
Oct 15 15:01:02 sasquatch kernel: hub 2-0:1.0: USB hub found
Oct 15 15:01:02 sasquatch kernel: hub 2-0:1.0: 3 ports detected
Oct 15 15:01:02 sasquatch kernel: md: Autodetecting RAID arrays.
Oct 15 15:01:02 sasquatch kernel: md: autorun ...
Oct 15 15:01:02 sasquatch kernel: md: ... autorun DONE.
Oct 15 15:01:02 sasquatch kernel: EDAC k8 MC0: general bus error: participating
processor(local node response), time-out(no timeout) memory transaction
type(generic read), mem or i/o(mem access), cache level(generic)
Oct 15 15:01:02 sasquatch kernel: MC1: CE page 0x37ab4c, offset 0x438, grain 8,
syndrome 0x6d, row 3, channel 1, label "": k8_edac
Oct 15 15:01:02 sasquatch kernel: MC1: CE - no information available: k8_edac
Error Overflow set
Oct 15 15:01:02 sasquatch kernel: EDAC k8 MC0: extended error code: ECC error
We think (but are not positive) that the messages being written to
/var/log/messages
indicate that one bank of memory is bad.
That this situation has been ongoing since the 15th can be deduced from the
size of the
/var/log/message files:
[root@sasquatch ~]# ls -alt /var/log/messages*
-rw------- 1 root root 281057310 Oct 30 16:33 /var/log/messages
-rw------- 1 root root 215566774 Oct 22 04:03 /var/log/messages.1
-rw------- 1 root root 1151633 Oct 15 04:02 /var/log/messages.2
-rw------- 1 root root 1196676 Oct 14 04:02 /var/log/messages.3
-rw------- 1 root root 1186767 Oct 13 04:02 /var/log/messages.4
-rw------- 1 root root 1164272 Oct 12 04:02 /var/log/messages.5
-rw------- 1 root root 1161643 Oct 11 04:02 /var/log/messages.6
-rw------- 1 root root 1140437 Oct 10 04:02 /var/log/messages.7
-rw------- 1 root root 1021564 Oct 9 04:02 /var/log/messages.8
-rw------- 1 root root 1163906 Oct 8 04:02 /var/log/messages.9
-rw------- 1 root root 1184374 Oct 7 04:02 /var/log/messages.10
The end result of the errors being logged in /var/log/messages is that
the delivery of EXP data products to machines downstream of sasquatch
are showing very high latencies.
Aside: the reason that the latency for weather.renci.org was capped
at 3600 seconds is that is the default for the LDM. I restarted the
LDM on weather.renci.org specifying '-m 10000' as the max latency
that was allowable, and the latencies show for the EXP feed have been
steadily climbing towards 10000 seconds:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?EXP+weather.renci.org
What to do? I would say that it is time to take sasquatch down and try
to figure out if a bad memory DIMM really is the problem. I would first
reseat all DIMMs; if that doesn't fix the problem, I would then try to
figure out which DIMM(s) is(are) bad.
Cheers,
Tom
****************************************************************************
Unidata User Support UCAR Unidata Program
(303) 497-8642 P.O. Box 3000
address@hidden Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage http://www.unidata.ucar.edu
****************************************************************************
Ticket Details
===================
Ticket ID: IIX-893121
Department: Support LDM
Priority: Normal
Status: Closed