[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20010123: Strange LDM freezes
- Subject: Re: 20010123: Strange LDM freezes
- Date: Wed, 31 Jan 2001 17:36:01 -0700
Pete Stamus wrote:
>
> Hi Anne...
>
> We had another freeze last night, around 0350z (like last time). I have a
> script
> running that checks the size of the metar files and if they're missing or too
> small
> it emails my pager, so I started looking at the system around 0415z. The
> system
> seemed to be responding normally (wasn't sluggish or anything). So, I started
> colleting stuff that you suggested...here are a few hundred lines of it :)
>
> The first thing I did was an "iostat"
>
> tty fd0 sd0 sd16 nfs1 cpu
> tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id
> 0 1 0 0 0 326 14 110 0 0 0 0 0 0 7 6 4 84
>
> Then I did an "ldmadmin check", which just sat there. After about 2 min with
> no messages or any signs of activity, I did a ^y out of it. Then I did an
> "ldmadmin queuecheck", which returned without a message. Next I did a "ps
> -elf"
>
> F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
> TIME CMD
> 19 T root 0 0 0 0 SY fec14660 0 Jan 10 ?
> 0:05 sched
> 8 S root 1 0 0 41 20 e0a5e728 462 e0a5e94c Jan 10 ?
> 6:05 /etc/init -
> 19 S root 2 0 0 0 SY e0a5e008 0 fec2bc50 Jan 10 ?
> 0:00 pageout
> 19 S root 3 0 0 0 SY e0a61730 0 fec7b008 Jan 10 ?
> 13:01 fsflush
> 8 S root 248 237 0 40 20 e0aac738 509 e07d1af8 Jan 10 ?
> 0:00 /opt/nport/bin/inge
> 8 S root 233 1 0 41 20 e0d6c778 367 e0729216 Jan 10 ?
> 0:01 /usr/lib/nfs/nfsd -a 16
> 8 S root 240 1 0 41 20 e0a61010 336 e07d1d38 Jan 10 ?
> 0:01 /usr/lib/saf/sac -t 300
> 8 S daemon 147 1 0 67 20 e0aac018 495 e07296c6 Jan 10 ?
> 0:00 /usr/lib/nfs/statd
> 8 S root 197 1 0 41 20 e0ab6020 522 e0729df6 Jan 10 ?
> 0:00 /usr/sbin/vold
> 8 S root 107 1 0 40 20 e0c3d748 479 e0729996 Jan 10 ?
> 1:24 /usr/sbin/rpcbind
> 8 S root 109 1 0 46 20 e0c3d028 458 e07298f6 Jan 10 ?
> 0:00 /usr/sbin/keyserv
> 8 S root 151 1 0 51 20 e0c9f750 409 e07297b6 Jan 10 ?
> 0:00 /usr/sbin/inetd -s
> 8 S root 183 1 0 41 20 e0c9f030 681 e07294e6 Jan 10 ?
> 0:00 /usr/lib/lpsched
> 8 S root 154 1 0 41 20 e0cae758 816 e07295d6 Jan 10 ?
> 2:45 /usr/sbin/syslogd
> 8 S root 231 1 0 41 20 e0ab6740 609 8052dd0 Jan 10 ?
> 0:02 /usr/lib/nfs/mountd
> 8 S root 143 1 0 67 20 e0cd9760 370 e07298a6 Jan 10 ?
> 0:00 /usr/lib/nfs/lockd
> 8 S root 163 1 0 41 20 e0cd9040 364 e07d1eb8 Jan 10 ?
> 0:00 /usr/sbin/cron
> 8 Z root 244 237 0 0
> 0:00 <defunct>
> 8 S root 196 1 0 41 20 e0d41048 205 e0729536 Jan 10 ?
> 0:01 /usr/lib/utmpd
> 8 S root 16766 1 0 51 20 e0d41768 361 e0729bc6 Jan 29 console
> 0:00 /usr/lib/saf/ttymon -g -h -p noaapo
> 8 S root 225 1 0 40 8 e0d4c050 434 e0d4c274 Jan 10 ?
> 0:08 /usr/lib/inet/xntpd
> 8 S root 237 1 0 40 20 e0d6c058 685 e07d1a38 Jan 10 ?
> 1996:04 /opt/nport/bin/inge
> 8 S root 243 240 0 41 20 e0d6e780 358 e0d6e9a4 Jan 10 ?
> 0:01 /usr/lib/saf/ttymon
> 8 S ldm 26412 26410 0 41 20 e1235758 31534 e123597c Jan 26 ?
> 7:38 pqact
> 8 S ldm 26413 26410 0 41 20 e0726108 31537 e0db2606 Jan 26 ?
> 27:11 pqing -5 -f WMO -v -P 1501 noaaport
> 8 S ldm 26410 1 0 40 20 e0bd6008 31507 e0dcd17e Jan 26 ?
> 0:00 rpc.ldmd -v -q /home/ldm/data/ldm.p
> 8 S nobody 1277 151 0 41 20 e0d4c770 185 e07d1978 Jan 10 ?
> 17:36 cat /tmp/jmb.fifo.2
> 8 R ldm 13574 13572 0 41 20 e1b431c0 433 21:13:41 pts/0
> 0:00 -csh
> 8 S nobody 26414 151 0 41 20 e1b438e0 185 e0db2a26 Jan 26 ?
> 0:35 cat /tmp/jmb.fifo.1
> 8 S root 13747 13746 0 0 RT e1be48a8 233 e0ab05b8 21:22:07 ?
> 0:00 /opt/nport/bin/inge
> 8 S ldm 26428 26410 0 41 20 e0ffd028 31507 e0ffd24c Jan 26 ?
> 4:11 rpc.ldmd -v -q /home/ldm/data/ldm.p
> 8 O root 13748 13574 0 41 20 e0726828 329 21:22:09 pts/0
> 0:00 ps -elf
> 8 S root 13746 13745 0 0 RT e17ae170 493 e072c74c 21:22:02 ?
> 0:00 /opt/nport/bin/inge
> 8 S ldm 26411 26410 0 41 20 e0d6e060 31572 e0d6e284 Jan 26 ?
> 3:16 pqbinstats
> 8 S root 13572 151 0 61 20 e0bd6728 410 e0dcddae 21:13:41 ?
> 0:00 in.telnetd
> 8 S root 13745 1 0 0 RT e16d2178 493 e0729064 21:21:57 ?
> 0:00 /opt/nport/bin/inge
>
> Then I did an "lsof -p <pid>" on the rcp.ldmd
>
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> rpc.ldmd 26410 ldm cwd VDIR 63,3 10240 36160 /var/data/ldm/logs
> rpc.ldmd 26410 ldm txt VREG 63,6 1393856 392858 /home
> (/dev/dsk/c1t0d0s6)
> rpc.ldmd 26410 ldm txt VREG 63,3 127057920 72262
> /var/data/ldm/ldm.pq
> rpc.ldmd 26410 ldm txt VREG 63,4 227076 6194
> /usr/lib/libresolv.so.2
> rpc.ldmd 26410 ldm txt VREG 63,4 11488 6212
> /usr/lib/nss_dns.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 26392 6213
> /usr/lib/nss_files.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 9940 6217 /usr
> (/dev/dsk/c1t0d0s4)
> rpc.ldmd 26410 ldm txt VREG 63,4 17388 6183
> /usr/lib/libmp.so.2
> rpc.ldmd 26410 ldm txt VREG 63,4 936736 6175 /usr/lib/libc.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 52988 6201
> /usr/lib/libsocket.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 684112 6186
> /usr/lib/libnsl.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 26256 156841
> /usr/ucblib/librpcsoc.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 5636 6210 /usr
> (/dev/dsk/c1t0d0s4)
> rpc.ldmd 26410 ldm txt VREG 63,4 59624 6223 /usr/lib/libm.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 4372 6167
> /usr/lib/libdl.so.1
> rpc.ldmd 26410 ldm txt VREG 63,4 173272 6064 /usr/lib/ld.so.1
> rpc.ldmd 26410 ldm 0u inet 0xe13f0018 0t0 TCP *:ldm (LISTEN)
> rpc.ldmd 26410 ldm 1u VREG 63,3 127057920 72262
> /var/data/ldm/ldm.pq
> rpc.ldmd 26410 ldm 2w VCHR 21,0 0t0 60206
> /devices/pseudo/log@0:conslog->LOG
>
> and on the pqing process
>
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> pqing 26413 ldm cwd VDIR 63,3 10240 36160 /var/data/ldm/logs
> pqing 26413 ldm txt VREG 63,6 1074520 392842 /home
> (/dev/dsk/c1t0d0s6)
> pqing 26413 ldm txt VREG 63,3 127057920 72262
> /var/data/ldm/ldm.pq
> pqing 26413 ldm txt VREG 63,4 26392 6213
> /usr/lib/nss_files.so.1
> pqing 26413 ldm txt VREG 63,4 17388 6183
> /usr/lib/libmp.so.2
> pqing 26413 ldm txt VREG 63,4 936736 6175 /usr/lib/libc.so.1
> pqing 26413 ldm txt VREG 63,4 52988 6201
> /usr/lib/libsocket.so.1
> pqing 26413 ldm txt VREG 63,4 684112 6186
> /usr/lib/libnsl.so.1
> pqing 26413 ldm txt VREG 63,4 26256 156841
> /usr/ucblib/librpcsoc.so.1
> pqing 26413 ldm txt VREG 63,4 59624 6223 /usr/lib/libm.so.1
> pqing 26413 ldm txt VREG 63,4 4372 6167
> /usr/lib/libdl.so.1
> pqing 26413 ldm txt VREG 63,4 173272 6064 /usr/lib/ld.so.1
> pqing 26413 ldm 0u inet 0xe13f0398 0x2531eaf2 TCP
> noaaport.colorado-research.com:32951->noaaport.colorado-research.com:nporta
> (ESTABLISHED)
> pqing 26413 ldm 1w VCHR 0,0 0t0 60201
> /devices/pseudo/cn@0:console
> pqing 26413 ldm 2w VCHR 21,0 0t0 60206
> /devices/pseudo/log@0:conslog->LOG
> pqing 26413 ldm 3u VREG 63,3 127057920 72262
> /var/data/ldm/ldm.pq
>
> Here are the last system messages from /var/adm/messages
>
> Jan 28 10:59:59 noaaport pqing[26413]: Not a WMO format message.
> 145 976 @RU00 KWBC 281059
> Jan 28 11:03:43 noaaport pqing[26413]: Not a WMO format message.
> 108 146 @RU00 KWBC 281103
> Jan 28 12:04:15 noaaport pqing[26413]: Not a WMO format message.
> 108 719 @RU00 KWBC 281204
> Jan 28 13:05:03 noaaport pqing[26413]: Not a WMO format message.
> 108 341 @RU00 KWBC 281305
> Jan 28 16:01:44 noaaport pqing[26413]: Not a WMO format message.
> 108 469 @RU00 KWBC 281601
> Jan 28 16:56:10 noaaport pqing[26413]: Not a WMO format message.
> 100 188 @RU00 KWBC 281656
> Jan 28 17:02:54 noaaport pqing[26413]: Not a WMO format message.
> 100 004 @RU00 KWBC 281702
> Jan 28 17:04:04 noaaport pqing[26413]: Not a WMO format message.
> 108 591 @RU00 KWBC 281704
> Jan 28 19:37:00 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 28 19:50:34 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 28 20:25:21 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 28 23:03:37 noaaport pqing[26413]: Not a WMO format message.
> 114 557 @RU00 KWBC 282303
> Jan 28 23:03:59 noaaport pqing[26413]: Not a WMO format message.
> 114 681 @RU00 KWBC 282303
> Jan 29 00:14:42 noaaport pqing[26413]: Not a WMO format message.
> 114 665 @RU00 KWBC 290014
> Jan 29 01:08:52 noaaport pqing[26413]: Not a WMO format message.
> 114 491 @RU00 KWBC 290108
> Jan 29 02:26:48 noaaport pqing[26413]: Not a WMO format message.
> 114 321 @RU00 KWBC 290226
> Jan 29 03:09:45 noaaport pqing[26413]: Not a WMO format message.
> 114 042 @RU00 KWBC 290309
> Jan 29 04:51:13 noaaport pqing[26413]: Not a WMO format message.
> 99 850 @RU00 KWBC 290451
> Jan 29 08:44:21 noaaport pqing[26413]: Not a WMO format message.
> 114 149 @RU00 KWBC 290844
> Jan 29 10:12:54 noaaport pqing[26413]: Not a WMO format message.
> 108 179 @RU00 KWBC 291012
> Jan 29 10:47:01 noaaport pqing[26413]: Not a WMO format message.
> 118 188 @RU00 KWBC 291047
> Jan 29 13:06:45 noaaport pqing[26413]: Not a WMO format message.
> 108 140 @RU00 KWBC 291306
> Jan 29 15:10:01 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 29 15:10:39 noaaport last message repeated 1 time
> Jan 29 08:26:33 noaaport unix: WARNING: iprb0: no MII link detected
> Jan 29 08:26:38 noaaport unix: NOTICE: iprb0: 100 Mbps full-duplex link up
> Jan 29 20:10:38 noaaport pqing[26413]: Not a WMO format message.
> 108 129 @RU00 KWBC 292010
> Jan 29 21:34:08 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 29 22:14:10 noaaport pqing[26413]: Not a WMO format message.
> 122 428 @RU00 KWBC 292214
> Jan 29 22:46:17 noaaport pqing[26413]: Not a WMO format message.
> 132 597 @RU00 KWBC 292246
> Jan 30 00:06:34 noaaport pqing[26413]: Not a WMO format message.
> 117 509 @RU00 KWBC 300006
> Jan 30 01:35:48 noaaport pqing[26413]: Not a WMO format message.
> 114 078 @RU00 KWBC 300135
> Jan 30 02:13:01 noaaport pqing[26413]: Not a WMO format message.
> 114 294 @RU00 KWBC 300213
> Jan 30 02:20:46 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 03:04:17 noaaport pqing[26413]: Not a WMO format message.
> 114 022 @RU00 KWBC 300304
> Jan 30 04:56:45 noaaport pqing[26413]: Not a WMO format message.
> 132 954 @RU00 KWBC 300456
> Jan 30 07:38:52 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 07:40:15 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 08:23:34 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 14:07:48 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 16:54:50 noaaport pqing[26413]: Not a WMO format message.
> 146 122 @RU00 KWBC 301654
> Jan 30 20:03:25 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 30 22:53:06 noaaport pqing[26413]: Not a WMO format message.
> 152 626 @RU00 KWBC 302253
> Jan 31 00:06:00 noaaport pqing[26413]: Not a WMO format message.
> 114 753 @RU00 KWBC 310006
> Jan 31 01:17:49 noaaport pqing[26413]: Not a WMO format message.
> 114 829 @RU00 KWBC 310117
> Jan 31 02:07:20 noaaport pqing[26413]: Not a WMO format message.
> 114 791 @RU00 KWBC 310207
> Jan 31 02:44:22 noaaport pqing[26413]: scan_wmo_binary: length 15 too short
> Jan 31 03:08:26 noaaport pqing[26413]: Not a WMO format message.
> 114 117 @RU00 KWBC 310308
>
> Then I did an "ldmadmin stop", and it said that it was stopping the ldm.
> When it
> was done, I did an "ldmadmin ps", and it said that there were no ldm processes
> running. But I did a "ps -ef" and both the 'rpc.ldmd' and 'pqing' processes
> were still there
>
> F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
> TIME CMD
> 19 T root 0 0 0 0 SY fec14660 0 Jan 10 ?
> 0:05 sched
> 8 S root 1 0 0 41 20 e0a5e728 462 e0a5e94c Jan 10 ?
> 6:05 /etc/init -
> 19 S root 2 0 0 0 SY e0a5e008 0 fec2bc50 Jan 10 ?
> 0:00 pageout
> 19 S root 3 0 0 0 SY e0a61730 0 fec7b008 Jan 10 ?
> 13:01 fsflush
> 8 S root 248 237 0 40 20 e0aac738 509 e07d1af8 Jan 10 ?
> 0:00 /opt/nport/bin/inge
> 8 S root 233 1 0 41 20 e0d6c778 367 e0729216 Jan 10 ?
> 0:01 /usr/lib/nfs/nfsd -a 16
> 8 S root 240 1 0 41 20 e0a61010 336 e07d1d38 Jan 10 ?
> 0:01 /usr/lib/saf/sac -t 300
> 8 S daemon 147 1 0 67 20 e0aac018 495 e07296c6 Jan 10 ?
> 0:00 /usr/lib/nfs/statd
> 8 S root 197 1 0 41 20 e0ab6020 522 e0729df6 Jan 10 ?
> 0:00 /usr/sbin/vold
> 8 S root 107 1 0 41 20 e0c3d748 479 e0729996 Jan 10 ?
> 1:24 /usr/sbin/rpcbind
> 8 S root 109 1 0 46 20 e0c3d028 458 e07298f6 Jan 10 ?
> 0:00 /usr/sbin/keyserv
> 8 S root 151 1 0 51 20 e0c9f750 409 e07297b6 Jan 10 ?
> 0:00 /usr/sbin/inetd -s
> 8 S root 183 1 0 41 20 e0c9f030 681 e07294e6 Jan 10 ?
> 0:00 /usr/lib/lpsched
> 8 S root 154 1 0 41 20 e0cae758 816 e07295d6 Jan 10 ?
> 2:45 /usr/sbin/syslogd
> 8 S root 231 1 0 41 20 e0ab6740 609 8052dd0 Jan 10 ?
> 0:02 /usr/lib/nfs/mountd
> 8 S root 143 1 0 67 20 e0cd9760 370 e07298a6 Jan 10 ?
> 0:00 /usr/lib/nfs/lockd
> 8 S root 163 1 0 41 20 e0cd9040 364 e07d1eb8 Jan 10 ?
> 0:00 /usr/sbin/cron
> 8 Z root 244 237 0 0
> 0:00 <defunct>
> 8 S root 196 1 0 41 20 e0d41048 205 e0729536 Jan 10 ?
> 0:01 /usr/lib/utmpd
> 8 S root 16766 1 0 51 20 e0d41768 361 e0729bc6 Jan 29 console
> 0:00 /usr/lib/saf/ttymon -g -h -p noaapo
> 8 S root 225 1 0 40 8 e0d4c050 434 e0d4c274 Jan 10 ?
> 0:08 /usr/lib/inet/xntpd
> 8 S root 237 1 0 40 20 e0d6c058 685 e07d1a38 Jan 10 ?
> 1996:04 /opt/nport/bin/inge
> 8 S root 243 240 0 41 20 e0d6e780 358 e0d6e9a4 Jan 10 ?
> 0:01 /usr/lib/saf/ttymon
> 8 O root 13857 13574 0 41 20 e0ffd028 329 21:28:15 pts/0
> 0:00 ps -elf
> 8 S ldm 26413 26410 0 41 20 e0726108 31537 e0db2606 Jan 26 ?
> 27:11 pqing -5 -f WMO -v -P 1501 noaaport
> 8 S ldm 26410 1 0 45 20 e0bd6008 487 e0bd6074 Jan 26 ?
> 0:00 rpc.ldmd -v -q /home/ldm/data/ldm.p
> 8 S nobody 1277 151 0 41 20 e0d4c770 185 e07d1978 Jan 10 ?
> 17:36 cat /tmp/jmb.fifo.2
> 8 R ldm 13574 13572 0 41 20 e1b431c0 433 21:13:41 pts/0
> 0:00 -csh
> 8 S nobody 26414 151 0 41 20 e1b438e0 185 e0db2a26 Jan 26 ?
> 0:35 cat /tmp/jmb.fifo.1
> 8 S root 13854 1 0 0 RT e1be4188 493 e0729064 21:28:03 ?
> 0:00 /opt/nport/bin/inge
> 8 S root 13572 151 0 61 20 e0bd6728 410 e0dcddae 21:13:41 ?
> 0:00 in.telnetd
> 8 S root 13856 13855 0 0 RT e1235758 233 e0ab05b8 21:28:13 ?
> 0:00 /opt/nport/bin/inge
> 8 S root 13855 13854 0 0 RT e0726828 493 e072c59c 21:28:08 ?
> 0:00 /opt/nport/bin/inge
>
> Then I tried a "kill -HUP" on the rpc.ldmd (no effect) and the pqing (no
> effect
> either) processes; when they didn't do anything I did a "kill -9" and they
> both went away (one at a time). I did another "ldmadmin queuecheck", which
> returned with no comments, then did an "ldmadmin start" which started
> everthing
> back up just fine.
>
> I didn't see anything obvious in the log files; here's the last 25 lines
>
> Jan 31 03:53:28 noaaport pqing[26413]: 128 20010131035328.881 WMO
> 895 SAUS41 KLWX 310353 /pMTRDCA
> Jan 31 03:53:28 noaaport pqing[26413]: 127 20010131035328.888 WMO
> 896 SAUS41 KOKX 310353 /pMTRTEB
> Jan 31 03:53:29 noaaport pqing[26413]: 1820 20010131035329.282 WMO
> 903 FZHW50 PHFO 310353 /pCWFHI
> Jan 31 03:53:29 noaaport pqing[26413]: 1491 20010131035329.709 WMO
> 911 FPUS71 KBGM 310353 /pNOWBGM
> Jan 31 03:53:36 noaaport pqing[26413]: 152 20010131035336.867 WMO
> 960 SAUS41 KILN 310353 /pMTRCVG
> Jan 31 03:53:36 noaaport pqing[26413]: 1779 20010131035336.915 WMO
> 961 FPUS71 KILN 310353 /pNOWILN
> Jan 31 03:53:36 noaaport pqing[26413]: 125 20010131035336.946 WMO
> 962 SAUS41 KBOX 310353 /pMTRBDL
> Jan 31 03:53:37 noaaport pqing[26413]: 123 20010131035337.128 WMO
> 963 SAUS41 KAKQ 310353 /pMTRORF
> Jan 31 03:53:39 noaaport pqing[26413]: 144 20010131035339.695 WMO
> 975 SAUS43 KMKX 310353 /pMTRMKE
> Jan 31 03:53:39 noaaport pqing[26413]: 358 20010131035339.914 WMO
> 982 SXUS70 KWAL 310351
> Jan 31 03:53:39 noaaport pqing[26413]: Product already in queue
> Jan 31 03:53:41 noaaport pqing[26413]: 142 20010131035341.038 WMO
> 008 SAUS41 KRLX 310353 /pMTRHTS
> Jan 31 03:53:41 noaaport pqing[26413]: 581 20010131035341.080 WMO
> 011 SXUS81 KCLE 310353 /pOMRCLE
> Jan 31 03:53:41 noaaport pqing[26413]: 139 20010131035341.109 WMO
> 012 SAUS45 KPIH 310353 /pMTRSNT
> Jan 31 03:53:45 noaaport pqing[26413]: 128 20010131035345.372 WMO
> 024 SAUS41 KOKX 310353 /pMTRLGA
> Jan 31 03:53:45 noaaport pqing[26413]: 123 20010131035345.420 WMO
> 025 SAUS41 KBOX 310353 /pMTRPVD
> Jan 31 03:53:45 noaaport pqing[26413]: 136 20010131035345.443 WMO
> 026 SAUS41 KPHI 310353 /pMTRABE
> Jan 31 03:53:45 noaaport pqing[26413]: 140 20010131035345.474 WMO
> 028 SAUS41 KOKX 310353 /pMTRJFK
> Jan 31 03:53:50 noaaport pqing[26413]: 128 20010131035350.827 WMO
> 079 SAUS45 KLKN 310353 /pMTRAWH
> Jan 31 04:26:08 noaaport rpc.ldmd[26410]: Exiting
> Jan 31 04:26:08 noaaport rpc.ldmd[26410]: Terminating process group
> Jan 31 04:26:08 noaaport fen00(feed)[26428]: Exiting
> Jan 31 04:26:08 noaaport pqbinstats[26411]: Exiting
> Jan 31 04:26:08 noaaport pqact[26412]: Exiting
> Jan 31 04:28:48 noaaport rpc.ldmd[26410]: _NOT_ ReReading configuration file
> /home/ldm/etc/ldmd.conf
>
> I have the complete set of logs for this cycle...from when I restarted things
> on the 26th. Let me know just how much (if any) you'd like to see...even
> compressed they're a bit large...and I can get them to you. Any suggestions
> what to look for??
>
> Thanks for your help...drop me a line or give me a call if there's anything
> else I can provide.
>
> ps
> -------------------------------------------------------------------------
> Pete Stamus | Phone: (303) 415-9701 x224
> Colorado Research Associates (CoRA)* | Fax: (303) 415-9702
> 3380 Mitchell Lane | email: address@hidden
> Boulder, Colorado 80301 USA | *( CoRA is a division of NWRA )
> -------------------------------------------------------------------------
> You can't trust your eyes when your imagination is out of focus.
> -- Mark Twain
> -------------------------------------------------------------------------
Hi Pete,
I've been comparing this information from your site with our set up on
our ingest machine, desi. They are both using the SSEC ingest card and
software. But we have not had these problems on desi.
It's kind of a wild shot, but I'm suspecting that your pqing is getting
a binary character when it's expecting only text.
First, do you have the most recent version of the software? Our
/opt/nport/bin/inge binary is dated August 21. And our
/opt/nport/exceptions file is dated Feb. 2, 2000. I'm not sure if they
still use the exceptions file any more, but if they do it's probably
best the have the most recent version. I just heard today that a new
version of the code (called SDI?) is available from SSEC. Robb said he
would forward the annoucement to me. When I get it I'll forward it to
you.
I'm also wondering about your invocation of pqing. I know that WMO is a
feed type handled by pqing, but on desi we are running two separate
invocations of pqing, one for binary products and one for text:
pqing -f HRS /tmp/jmb.fifo.2
pqing -f IDS|DDPLUS /tmp/jmb.fifo.1
Since WMO includes HRS and IDS|DDPLUS, I'm not sure how it determines
whether a product is binary or not. I know that it's not uncommon for
text products, many of which are generated by hand, to have binary
characters in them.
You'll see that we're reading from a FIFO rather than directly from a
port as you are. I'm not sure how to tell inge to write to the fifo (I
assume that's what needs to happen), but I could find out if you want to
give that a try. I see that you are 'cat'ing a fifo, but are you using
that elsewhere?
It's interesting that ldmadmin check did not return. Maybe it's because
your log files are very large with all the verbose logging. I suspect
it would return eventually. It might be worthwhile to wait longer next
time.
I don't have other ideas to propose at this point. If it happens again
at a convenient time I would still like to take a look. Do you build
the software from source or do you get a binary distribution? If it's
from source and you have a debugger on your machine I could see where
it's hung up.
By the way, sending a 'kill -HUP' won't have an effect on rpc.ldmd. For
daemon processes, like rpc.ldmd, SIGHUP is commonly used to notify them
to reread their configuration file. (That's why you see the line in the
logs saying "_NOT_ ReReading configuration file" - rpc.ldmd won't reread
ldmd.conf on the fly anymore.)
Try sending both rpc.ldmd and pqing a simple 'kill' command, which will
send a SIGTERM, the normal, non-brutal termination signal that will
allow processes to die gracefully (if they can, indeed, die). If that
doesn't work try 'kill -9'. When you use 'kill -9' on rpc.ldmd you run
the risk of corrupting your queue, as the rpc.ldmd may not be able to
finish writing a product and die gracefully when it receives that
signal.
I hope this is helpful. Please let me know what transpires.
Anne
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************