[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #SAE-848662]: Fwd: LDM dies after couple days, can't restart
- Subject: [LDM #SAE-848662]: Fwd: LDM dies after couple days, can't restart
- Date: Tue, 03 Mar 2015 10:26:30 -0700
corepuncher,
> Hi thanks for taking my question.
>
> I have a machine where LDM runs well, but only for a day or two. Then, it
> suddenly shuts off. Well...seemingly. There is no "pqact" or "noaaportinge"
> when I run "top", and data is not flowing.
The best ways to determine if data is flowing are "ldmadmin watch" and
"notifyme -vl-".
> Just happened a few minutes ago. So I try to do an "ldm clean", and it says
> The LDM system is running, and to stop it first.
>
> So I do ldmadmin stop, and I just get a perpetual:
>
> Stopping the LDM server...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
It can take a while to stop an LDM system. If it doesn't stop withing a minute,
however, then something's wrong.
> So ^C to stop it.
>
> Here is the last thing shown in ldmd.log:
>
> Mar 2 12:00:43 newton noaaportIngester[15971] ERROR: [gb22gem.c:74] [GB 1]
> Couldn't get parameter values
> Mar 2 12:00:43 newton noaaportIngester[15971] ERROR: [gb2param.c:89] [GB -1]
> Couldn't get parameter info: disc=0, cat=16, id=3, pdtn=0
> Mar 2 12:00:43 newton noaaportIngester[15971] ERROR: [gb22gem.c:74] [GB 1]
> Couldn't get parameter values
> Mar 2 12:00:43 newton noaaportIngester[15971] WARN: Gap in packet sequence:
> 1052210005 to 1052210549 [skipped 543]
> Mar 2 12:00:43 newton noaaportIngester[15971] ERROR: Missing fragment in
> sequence, last 565/66075757 this 1109/66075757
> Mar 2 12:00:43 newton noaaportIngester[15971] WARN: Gap in packet sequence:
> 1052210549 to 1052214590 [skipped 4040]
> Mar 2 12:00:43 newton noaaportIngester[15971] WARN: Gap in packet sequence:
> 1052214590 to 1052214802 [skipped 211]
Aside from missing some GEMPAK GRIB2 table entries, this looks normal.
> I did look at the "ldm pid" file, and found the number. Then, I went
> into TOP, and although I could not see it, I did a kill on that pid,
> and it worked!
A SIGINT sent to the top-level LDM server should stop the system quickly -- at
the risk of corrupting the product-queue.
> So that gets it to restart, but doesn't explain why it stops suddenly.
> The crazy part is, I have another server, so 2 cords coming from Novra
> receiver. The other machine never has this issue...so it must be a
> software issue?
>
> From address@hidden Mon Mar 2 11:38:29 2015
>
> Actually, I take that back. Even though it "seemed" to start after killing
> that PID listed in the file:
>
> The product-queue is OK.
> Checking pqact(1) configuration-file(s)...
> /home/ldm/etc/pqact.conf: syntactically correct
> etc/pqact.gempak: syntactically correct
> etc/pqact.grlevelx: syntactically correct
> Checking LDM configuration-file (/home/ldm/etc/ldmd.conf)...
> Starting the LDM server...
>
> Again, there is no pqact or noaaportinge process running under top. So
> alas, only thing I can do is reboot.
>
> The log, after getting a "fake" ldm start, shows this:
>
> Mar 2 12:35:37 pqact[518] NOTE: Starting from insertion-time 2015-03-02
> 18:01:12.401276 UTC
> Mar 2 12:35:37 noaaportIngester[520] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[520] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1201
> Mar 2 12:35:37 noaaportIngester[520] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 noaaportIngester[521] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[521] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1202
> Mar 2 12:35:37 noaaportIngester[521] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 noaaportIngester[523] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[523] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1204
> Mar 2 12:35:37 noaaportIngester[523] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 noaaportIngester[522] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[522] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1203
> Mar 2 12:35:37 noaaportIngester[522] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 ldmd[516] NOTE: child 520 exited with status 1:
> noaaportIngester -m 224.0.1.1 -I 10.0.0.3
> Mar 2 12:35:37 noaaportIngester[524] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[524] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1205
> Mar 2 12:35:37 noaaportIngester[524] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 ldmd[516] NOTE: child 521 exited with status 1:
> noaaportIngester -m 224.0.1.2 -I 10.0.0.3
> Mar 2 12:35:37 ldmd[516] NOTE: child 522 exited with status 1:
> noaaportIngester -m 224.0.1.3 -I 10.0.0.3
> Mar 2 12:35:37 noaaportIngester[525] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[525] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1206
> Mar 2 12:35:37 ldmd[516] NOTE: child 523 exited with status 1:
> noaaportIngester -m 224.0.1.4 -I 10.0.0.3
> Mar 2 12:35:37 noaaportIngester[525] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 noaaportIngester[526] ERROR: Address already in use
> Mar 2 12:35:37 noaaportIngester[526] ERROR: [multicastReader.c:97] Couldn't
> bind to port 1207
> Mar 2 12:35:37 noaaportIngester[526] ERROR: [noaaportIngester.c:340]
> Couldn't create multicast-reader
> Mar 2 12:35:37 pqact[527] NOTE: Starting Up
> Mar 2 12:35:37 ldmd[516] NOTE: child 524 exited with status 1:
> noaaportIngester -m 224.0.1.5 -I 10.0.0.3
> Mar 2 12:35:37 ldmd[516] NOTE: child 525 exited with status 1:
> noaaportIngester -m 224.0.1.6 -I 10.0.0.3
> Mar 2 12:35:37 pqact[528] NOTE: Starting Up
> Mar 2 12:35:37 ldmd[516] NOTE: child 526 exited with status 1:
> noaaportIngester -m 224.0.1.7 -I 10.0.0.3
> Mar 2 12:35:37 pqact[528] NOTE: Starting from insertion-time 2015-03-02
> 18:01:12.401276 UTC
> Mar 2 12:35:37 pqact[527] NOTE: Starting from insertion-time 2015-03-02
> 18:01:12.401276 UTC
I suspect that you still have noaaportIngester(1) processes running.
Would it be possible for me to log onto the system in question as the LDM user?
Regards,
Steve Emmerson
Ticket Details
===================
Ticket ID: SAE-848662
Department: Support LDM
Priority: Normal
Status: Closed