This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi Gerry, I tried to call you a couple of times earlier today; the first time your phone was busy and the second time I got the canned message "all circuits are busy". Anyway, we believe that we have found the problem on sasquatch... there is some sort of a memory problem that likely caused a reboot on October 15 at around 15:00. Here is the appropriate snippit from /var/log/messages for the 15th: Oct 15 14:25:13 sasquatch snmpd[30901]: transport socket = 12 Oct 15 14:25:13 sasquatch snmpd[30901]: Connection from - 10.2.9.37 Oct 15 14:25:13 sasquatch snmpd[30901]: transport socket = 12 Oct 15 15:00:54 sasquatch syslogd 1.4.1: restart. Oct 15 15:00:54 sasquatch syslog: syslogd startup succeeded Oct 15 15:00:54 sasquatch kernel: klogd 1.4.1, log source = /proc/kmsg started. Oct 15 15:00:54 sasquatch kernel: Bootdata ok (command line is ro root=/dev/md1 console=tty0, console=ttyS0,9600n8) Oct 15 15:00:54 sasquatch kernel: Linux version 2.6.9-55.0.2.ELsmp (address@hidden) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun 26 14:14:47 EDT 2007 Oct 15 15:00:54 sasquatch kernel: BIOS-provided physical RAM map: After the machine came back up, a 'general bus error' started being reported: Oct 15 15:01:02 sasquatch kernel: hub 2-0:1.0: USB hub found Oct 15 15:01:02 sasquatch kernel: hub 2-0:1.0: 3 ports detected Oct 15 15:01:02 sasquatch kernel: md: Autodetecting RAID arrays. Oct 15 15:01:02 sasquatch kernel: md: autorun ... Oct 15 15:01:02 sasquatch kernel: md: ... autorun DONE. Oct 15 15:01:02 sasquatch kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) Oct 15 15:01:02 sasquatch kernel: MC1: CE page 0x37ab4c, offset 0x438, grain 8, syndrome 0x6d, row 3, channel 1, label "": k8_edac Oct 15 15:01:02 sasquatch kernel: MC1: CE - no information available: k8_edac Error Overflow set Oct 15 15:01:02 sasquatch kernel: EDAC k8 MC0: extended error code: ECC error We think (but are not positive) that the messages being written to /var/log/messages indicate that one bank of memory is bad. That this situation has been ongoing since the 15th can be deduced from the size of the /var/log/message files: [root@sasquatch ~]# ls -alt /var/log/messages* -rw------- 1 root root 281057310 Oct 30 16:33 /var/log/messages -rw------- 1 root root 215566774 Oct 22 04:03 /var/log/messages.1 -rw------- 1 root root 1151633 Oct 15 04:02 /var/log/messages.2 -rw------- 1 root root 1196676 Oct 14 04:02 /var/log/messages.3 -rw------- 1 root root 1186767 Oct 13 04:02 /var/log/messages.4 -rw------- 1 root root 1164272 Oct 12 04:02 /var/log/messages.5 -rw------- 1 root root 1161643 Oct 11 04:02 /var/log/messages.6 -rw------- 1 root root 1140437 Oct 10 04:02 /var/log/messages.7 -rw------- 1 root root 1021564 Oct 9 04:02 /var/log/messages.8 -rw------- 1 root root 1163906 Oct 8 04:02 /var/log/messages.9 -rw------- 1 root root 1184374 Oct 7 04:02 /var/log/messages.10 The end result of the errors being logged in /var/log/messages is that the delivery of EXP data products to machines downstream of sasquatch are showing very high latencies. Aside: the reason that the latency for weather.renci.org was capped at 3600 seconds is that is the default for the LDM. I restarted the LDM on weather.renci.org specifying '-m 10000' as the max latency that was allowable, and the latencies show for the EXP feed have been steadily climbing towards 10000 seconds: http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?EXP+weather.renci.org What to do? I would say that it is time to take sasquatch down and try to figure out if a bad memory DIMM really is the problem. I would first reseat all DIMMs; if that doesn't fix the problem, I would then try to figure out which DIMM(s) is(are) bad. Cheers, Tom **************************************************************************** Unidata User Support UCAR Unidata Program (303) 497-8642 P.O. Box 3000 address@hidden Boulder, CO 80307 ---------------------------------------------------------------------------- Unidata HomePage http://www.unidata.ucar.edu **************************************************************************** Ticket Details =================== Ticket ID: IIX-893121 Department: Support LDM Priority: Normal Status: Closed