[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #WSJ-190258]: queue size question
- Subject: [LDM #WSJ-190258]: queue size question
- Date: Thu, 04 Apr 2019 13:16:06 -0600
John,
> We keep having the gempak images (radar/satellite) just stop after LDM runs
> for approximately 2.5 days. Here's the only errors I see during that period.
> Any ideas you may have are greatly appreciated. Thanks,
>
> LDM log from that time:
>
> 20190403T023521.100032Z idd.unidata.ucar.edu[7193] WARN error.c:236:err_log()
> Couldn't connect to LDM on idd.unidata.ucar.edu using either port 388 or
> portmapper; : RPC: Remote system error - Connection timed out
The above message means that receiving LDM process 7193 couldn't connect to
sending LDM idd.unidata.ucar.edu because that system was offline. UCAR had
networking problems yesterday.
> 20190403T102901.679888Z 199.133.78.4(feed)[11548] ERROR pq.c:3377:fd_lock()
> Interrupted system call
> 20190403T102901.679901Z 199.133.78.4(feed)[11548] ERROR pq.c:3377:fd_lock()
> fcntl F_RDLCK failed for rgn (0 SEEK_SET, 4096) 4
> 20190403T102901.679910Z 199.133.78.4(feed)[11548] ERROR up6.c:532:up6_run()
> Product send failure: Interrupted system call
The above messages mean that sending LDM process 11548 couldn't send a
data-product to its receiving LDM because it was interrupted by a signal --
most likely due to an "ldmadmin stop".
> 20190403T140901.828355Z 199.133.78.4(feed)[21358] ERROR pq.c:3377:fd_lock()
> Interrupted system call
> 20190403T140901.828370Z 199.133.78.4(feed)[21358] ERROR pq.c:3377:fd_lock()
> fcntl F_RDLCK failed for rgn (0 SEEK_SET, 4096) 4
> 20190403T140901.828382Z 199.133.78.4(feed)[21358] ERROR up6.c:532:up6_run()
> Product send failure: Interrupted system call
Ibid.
> 20190403T194347.889234Z 199.133.78.4[16693] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403194347.731079
> TS_ENDT {{ANY, ".*"}} -> 20190403194347.731079 TS_ENDT {{EXP, ".*"}}
The above message means that the subscription request by receiving LDM 16693
for ANY data-products was reduced by its sending LDM to EXP due to the ALLOW
entries in the sending LDM's configuration-file.
> 20190403T194601.846940Z 199.133.78.4(noti)[16693] ERROR
> forn5_svc.c:273:noti5_sqf() /home/ldm/cellphon/CELLPHON.DAT: RPC: Unable to
> receive
> 20190403T194601.846964Z 199.133.78.4(noti)[16693] ERROR
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno =
> 5)
The above messages mean that notifying LDM process 16693 was disconnected from
its receiving LDM. This was likely because the receiving LDM terminated.
> 20190403T194744.878579Z 199.133.78.4[30952] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403194744.721951
> TS_ENDT {{ANY, ".*"}} -> 20190403194744.721951 TS_ENDT {{EXP, ".*"}}
> 20190403T195401.932271Z 199.133.78.4(noti)[30952] ERROR
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso031945.mtr: RPC: Unable to receive
> 20190403T195401.932304Z 199.133.78.4(noti)[30952] ERROR
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno =
> 5)
> 20190403T195547.587961Z 199.133.78.4[27688] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403195547.431097
> TS_ENDT {{ANY, ".*"}} -> 20190403195547.431097 TS_ENDT {{EXP, ".*"}}
> 20190403T195653.093268Z 199.133.78.4[31798] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403195652.935582
> TS_ENDT {{ANY, ".*"}} -> 20190403195652.935582 TS_ENDT {{EXP, ".*"}}
> 20190403T195901.611237Z 199.133.78.4(noti)[27688] ERROR
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso031950.mtr: RPC: Unable to receive
> 20190403T195901.611269Z 199.133.78.4(noti)[27688] ERROR
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno =
> 5)
> 20190403T200401.265716Z 199.133.78.4(noti)[31798] ERROR
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso031955.mtr: RPC: Unable to receive
> 20190403T200401.265740Z 199.133.78.4(noti)[31798] ERROR
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno =
> 5)
> 20190403T200439.508639Z 199.133.78.4[27895] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403200439.340047
> TS_ENDT {{ANY, ".*"}} -> 20190403200439.340047 TS_ENDT {{EXP, ".*"}}
> 20190403T201601.871599Z 199.133.78.4(noti)[27895] ERROR
> forn5_svc.c:273:noti5_sqf() /home/ldm/cellphon/CELLPHON.DAT: RPC: Unable to
> receive
> 20190403T201601.871624Z 199.133.78.4(noti)[27895] ERROR
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno =
> 5)
> 20190403T204634.154906Z 199.133.78.4[26788] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403204633.989049
> TS_ENDT {{ANY, ".*"}} -> 20190403204633.989049 TS_ENDT {{EXP, ".*"}}
> 20190403T210901.479227Z 199.133.78.4(noti)[26788] ERROR
> forn5_svc.c:542:forn_5_svc() nullproc5(199.133.78.4): RPC: Unable to receive
> 20190403T210926.592774Z 199.133.78.4[17846] WARN forn.c:41:logIfReduced()
> Subscription reduced by one or more ALLOW entries: 20190403210401.246756
> TS_ENDT {{ANY, ".*"}} -> 20190403210401.246756 TS_ENDT {{EXP, ".*"}}
> 20190403T211901.233617Z 199.133.78.4(noti)[17846] ERROR
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso032110.mtr: RPC: Unable to receive
> 20190403T211901.233639Z 199.133.78.4(noti)[17846] ERROR
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno =
> 5)
Ibid.
> 20190404T000001.730946Z pqact[7179] WARN filel.c:3016:reap() Child 16321
> terminated by signal 10
> 20190404T000001.736596Z pqact[7179] WARN filel.c:3016:reap() Child 29866
> terminated by signal 10
> 20190404T000001.736636Z pqact[7179] WARN filel.c:3016:reap() Child 25337
> terminated by signal 10
> 20190404T000001.736673Z pqact[7179] WARN filel.c:3016:reap() Child 9485
> terminated by signal 10
> 20190404T000001.736702Z pqact[7179] WARN filel.c:3016:reap() Child 5313
> terminated by signal 10
> 20190404T000001.736726Z pqact[7179] WARN filel.c:3016:reap() Child 4785
> terminated by signal 10
> 20190404T000001.736749Z pqact[7179] WARN filel.c:3016:reap() Child 2641
> terminated by signal 10
> 20190404T000001.736779Z pqact[7179] WARN filel.c:3016:reap() Child 3026
> terminated by signal 10
> 20190404T000001.736815Z pqact[7179] WARN filel.c:3016:reap() Child 3211
> terminated by signal 10
> 20190404T000001.736842Z pqact[7179] WARN filel.c:3016:reap() Child 3839
> terminated by signal 10
The above messages mean that the indicated child processes of pqact(1) process
7179 terminated due to reception of a USR1 signal. This signal is used by the
LDM system as part of the process to rotate the LDM log file. Because this
happened near 0000Z, the cause was
likely a crontab(1) entry with the command "ldmadmin newlog".
Prior to LDM 6.13.11 (which isn't out yet), programs executed by pqact(1)'s
EXEC or PIPE actions had to block signals USR1 and USR2 to avoid such
termination.
Could this be the cause of your problems?
> /var/log/messages:
>
> Apr 3 21:32:17 mammatus abrt-hook-ccpp: Process 9497 (gif) of user 1009
> killed by SIGABRT - dumping core
> Apr 3 21:32:18 mammatus abrt-server: Duplicate: core backtrace
> Apr 3 21:32:18 mammatus abrt-server: DUP_OF_DIR:
> /var/spool/abrt/ccpp-2019-03-25-15:37:40-20396
> Apr 3 21:32:18 mammatus abrt-server: Deleting problem directory
> ccpp-2019-04-03-21:32:17-9497 (dup of ccpp-2019-03-25-15:37:40-20396)
> Apr 3 21:32:18 mammatus abrt-server: Undefined variable outside of [[ ]]
> bracket
>
> Contents of that ABRT trace:
>
> Reason:
>
> gif killed by SIGABRT
>
> Limits:
>
> Limit Soft Limit Hard Limit Units
> Max cpu time unlimited unlimited seconds
> Max file size unlimited unlimited bytes
> Max data size unlimited unlimited bytes
> Max stack size 8388608 unlimited bytes
> Max core file size 0 unlimited bytes
> Max resident set unlimited unlimited bytes
> Max processes 65536 95277 processes
> Max open files 1024 4096 files
> Max locked memory 65536 65536 bytes
> Max address space unlimited unlimited bytes
> Max file locks unlimited unlimited locks
> Max pending signals 95277 95277 signals
> Max msgqueue size 819200 819200 bytes
> Max nice priority 0 0
> Max realtime priority 0 0
> Max realtime timeout unlimited unlimited us
None of the above appears relevant.
Regards,
Steve Emmerson
Ticket Details
===================
Ticket ID: WSJ-190258
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata
inquiry tracking system and then made publicly available through the web. If
you do not want to have your interactions made available in this way, you must
let us know in each email you send to us.