[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[TIGGE #CUA-629523]: Re: dataportal not receiving data from tigge-ldm.ecmwf.int
- Subject: [TIGGE #CUA-629523]: Re: dataportal not receiving data from tigge-ldm.ecmwf.int
- Date: Fri, 14 Apr 2006 13:21:44 -0600
Manuel,
> I have tried that from tigge-ldm and I get:
> ldm@tigge-ldm:~> /usr/sbin/rpcinfo -n 388 -t tigge-ldm.ecmwf.int 300029
> rpcinfo: RPC: Timed out
> program 300029 version 0 is not available
Well, at least this is consistent with Dataportal not be able to connect to
Tigge-ldm. You might run snoop(1) or tcpdump(1) in another window while you do
this to diagnose the problem.
> > Manuel, verify that any firewall rules on Tigge-ldm will allow incoming
> > connections to port 388 from an arbitrary, remote port.
> Last Monday, when a similar problem occurred, the only thing I did was
> to restart LDM (remember I had to kill some LDM processes that were not
> stopped gracefully by ldmadmin). This cleared the problem.
> So I'm reluctant to think it is network related, but more likely a
> process that is preventing those connections. It may have been a network
> glitch that got it into this state, though.
I think it's best if we discover the cause of the problem now to prevent it
from reoccurring in the future.
> If I 'ps -fu ldm' on both tigge-ldm and tigge-portal, I get different
> results. On tigge-ldm:
> UID PID PPID C STIME TTY TIME CMD
> ldm 31408 1 0 Mar27 ? 00:00:00 vi stats.pl
> ldm 18258 18252 0 Apr05 ? 00:00:00 sshd: ldm@pts/0
>
> ldm 18259 18258 0 Apr05 pts/0 00:00:00 -bash
> ldm 23339 23337 0 Apr10 ? 00:00:00 sshd: ldm@pts/1
>
> ldm 23340 23339 0 Apr10 pts/1 00:00:00 -bash
> ldm 30862 30860 0 Apr11 ? 00:00:00 sshd: ldm@pts/6
>
> ldm 30863 30862 0 Apr11 pts/6 00:00:00 -bash
> ldm 31903 1 0 Apr11 ? 00:04:13 pqact -f ANY -v -l
> log/ldmd.log -p missing etc/tigge_pqact.conf
I'm surprised that you're using pqact(1)'s "-l" option because that utility
should log to the LDM log file by default.
> ldm 31905 1 0 Apr11 ? 00:00:06 /usr/bin/perl
> /usr/local/ldm/tigge/send
> ldm 31906 1 0 Apr11 ? 00:00:12 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm 31907 1 0 Apr11 ? 00:21:33 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm 32091 1 0 Apr11 ? 00:08:01 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm 32145 1 0 Apr11 ? 00:03:47 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm 32147 1 0 Apr11 ? 00:07:41 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
That's odd. The above indicates that 5 top-level LDM servers are running (the
parent process ID for the LDM server is "1"; it's the PID of the LDM server for
all upstream and downstream LDM child processes. This should not occur and
indicates a serious problem.
> ldm 21139 21137 0 Apr11 ? 00:00:00 sshd: ldm@pts/4
>
> ldm 21140 21139 0 Apr11 pts/4 00:00:00 -bash
> ldm 18695 1 0 Apr12 ? 00:00:59 rpc.ldmd -P 388 -v -q
> /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf
> ldm 22068 22066 0 09:35 ? 00:00:00 sshd: ldm@pts/2
>
> ldm 22069 22068 0 09:35 pts/2 00:00:00 -bash
> ldm 31904 1 0 Apr11 ? 00:03:13 rtstats -h
> rtstats.unidata.ucar.edu
> ldm 2507 22069 0 19:54 pts/2 00:00:00 ps -fu ldm
>
> while on tigge-portal:
> UID PID PPID C STIME TTY TIME CMD
> ldm 29317 1 0 Apr05 ? 00:00:00 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm 29318 29317 3 Apr05 ? 06:43:31 pqact -f EXP -p tigge
> etc/pqact.conf_tigge
> ldm 29321 29317 0 Apr05 ? 00:33:23 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm 29322 29317 0 Apr05 ? 00:33:11 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm 29323 29317 0 Apr05 ? 00:20:27 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm 29325 29317 0 Apr05 ? 00:20:10 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm 29326 29317 0 Apr05 ? 00:00:47 rpc.ldmd -P 388 -v -m
> 18000 -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/e
> ldm 29362 29317 0 Apr05 ? 00:01:38 [rpc.ldmd] <defunct>
> ldm 31349 29317 0 Apr06 ? 00:00:00 [rpc.ldmd] <defunct>
> ldm 31801 31799 0 Apr06 ? 00:00:00 sshd: ldm@pts/0
>
> ldm 31802 31801 0 Apr06 pts/0 00:00:00 -bash
> ldm 17254 17252 0 Apr10 ? 00:00:00 sshd: ldm@pts/2
>
> ldm 17255 17254 0 Apr10 pts/2 00:00:00 -bash
> ldm 30953 30951 0 10:14 ? 00:00:00 sshd: ldm@pts/3
>
> ldm 30954 30953 0 10:14 pts/3 00:00:00 -bash
> ldm 32552 30954 0 19:54 pts/3 00:00:00 ps -fu ldm
>
>
> So on tigge-portal we have a master process rpc.ldmd (pid 29317) which
> is the parent of all other rpc.ldmd processes. On tigge-ldm, all
> rpc.ldmd don't show a parent, but init. Is this normal ?
Definitely not! It might be the cause of your problem -- although I don't see
exactly how.
Can your netstat(1) show you PID-s? If so, then use it to discover which of
the top-level LDM processes on Tigge-ldm are not listening on port 388 and kill
those processes. These processes will have PID 1 as their parent PID and will
be listening on ports other than 388.
Regards,
Steve Emmerson
Ticket Details
===================
Ticket ID: CUA-629523
Department: Support IDD TIGGE
Priority: Normal
Status: On Hold