This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
>From: Unidata User Support <address@hidden> >Organization: Unidata Program Center/UCAR >Keywords: LDM Fedora Core 2 Linux Hi Steve, David, and Waldenio, David Garrana of UFRJ reported a problem he has been seeing on one of his machines at UFRJ. David's email is included -- with comments -- at the end of this message. Here is the ldmd.log output from an ldm-6.0.14 system at UFRJ. Several notes follow: - system running running Fedora Core 2: $ uname -a Linux cardeal.meteoro.ufrj.br 2.6.6-1.435smp #1 SMP Mon Jun 14 09:31:56 EDT 2004 i686 athlon i386 GNU/Linux - originally, the ~ldm/data directory was owned by ldm:opera, but the user:group for the user was ldm:ldm. This indicates that the primary group for 'ldm' was different when ~ldm/data was made - the permissions on 'ldm's HOME directory are 755 -- no group write permission. I have to believe that this is important given the ownership on the ~ldm/data directory. - LDM queue size was the default 400000000 - ldmadmin queuecheck showed a difference of 4 products in the queue and the number of products in the header (the header showed 4 more) Here is the output from one LDM log file on cardeal: Jun 23 20:45:49 cardeal rpc.ldmd[17204]: Starting Up (version: 6.0.14; built: Jun 21 2004 14:28:21) Jun 23 20:45:49 cardeal pqbinstats[17206]: Starting Up (17204) Jun 23 20:45:49 cardeal solon[17209]: Starting Up(6.0.14): solon.meteoro.ufrj.br: TS_ZERO TS_ENDT {{I DS|DDPLUS, ".*"}} Jun 23 20:45:49 cardeal rtstats[17208]: Starting Up (17204) Jun 23 20:45:49 cardeal pqact[17207]: Starting Up Jun 23 20:45:49 cardeal solon[17209]: Desired product class: 20040623194549.337 TS_ENDT {{IDS|DDPLUS, ".*"}} Jun 23 20:45:49 cardeal solon[17209]: Connected to upstream LDM-6 Jun 23 20:45:49 cardeal solon[17209]: Upstream LDM is willing to feed Jun 23 20:45:49 cardeal solon[17209]: assertion "rl->nempty <= RL_FREE_OVERHEAD" failed: file "pq.c", line 958 Jun 23 20:45:55 cardeal rpc.ldmd[17204]: child 17209 terminated by signal 6 Jun 23 20:45:55 cardeal rpc.ldmd[17204]: Killing (SIGINT) process group Jun 23 20:45:55 cardeal rpc.ldmd[17204]: SIGINT Jun 23 20:45:55 cardeal pqbinstats[17206]: Interrupt Jun 23 20:45:55 cardeal pqbinstats[17206]: Exiting Jun 23 20:45:55 cardeal pqact[17207]: Interrupt Jun 23 20:45:55 cardeal pqact[17207]: Exiting Jun 23 20:45:55 cardeal rtstats[17208]: Interrupt Jun 23 20:45:55 cardeal rtstats[17208]: Exiting Jun 23 20:45:55 cardeal rpc.ldmd[17204]: Terminating process group What I did to try and get around this problem was: - login as 'ldm' - cd ~ldm - ldmadmin stop <- not needed; LDM was not running, but this does the needed 'ldmadmin clean' - ls -alt data/* <- the only thing in the directory was the LDM queue - mkdir data <- now owned by ldm:ldm - ldmadmin mkqueue <- 400000000 queue size - ldmadmin start Relevant uncommented entries in ~ldmd.conf are: [ldm@cardeal ldm]$ grep -v ^# etc/ldmd.conf exec "pqbinstats" exec "pqact" exec "rtstats -h rtstats.unidata.ucar.edu" request IDS|DDPLUS ".*" solon.meteoro.ufrj.br PRIMARY allow ANY ^((localhost|loopback)|(127\.0\.0\.1\.?$)|([a-z].*\.unidata\.ucar\.edu\.?$)) The log file above showed that the LDM exited virtually immediately after startup. After cleanup, the LDM has been running for 17 minutes while I write this message. >Date: Wed, 23 Jun 2004 18:06:06 -0300 (EST) >From: David Garrana Coelho <address@hidden> >To: Waldenio Almeida <address@hidden> >Cc: Tom Yoksas <address@hidden> >Subject: Solon unstable? >Hi Tom, > I'm trying for quite some time to write this e-mail, but since its >a recurring problem, but not a "halt everything" one, it was always >getting delayed. I know the feeling... > For some time now, solon eventually denies connection to cptec, >and as it seems, to brisa and a new machine in UFRJ as well. I just want to point out that the log output you sent does not show that solon 'eventually denies' a connection to cardeal. The problem is that the rpc.ldmd process on the receiving side (cardeal) exits. The sending rpc.ldmd sees this condition and exits -- with no error status. If there had been an error exit, the LDM on solon would have exited. Now, different log messages on solon _do_ show that connections get denied: ... logs/ldmd.log.12:Jun 16 14:20:14 solon rpc.ldmd[3743]: Denying connection from 200.9.149.140 logs/ldmd.log.2:Jun 23 20:00:06 solon rpc.ldmd[86790]: Denying connection from 150.163.141.196 ... logs/ldmd.log.9:Jun 18 12:05:07 solon rpc.ldmd[44018]: Denying connection from 150.163.146.249 These denials show that DNS was not available for those machines when the data request was made. Since you are dealing with a small number of sites feeding off of solon currently, I decided to add allows for these specific machines by IP address to the ~ldm/etc/ldmd.conf file: # Allows by IP in case DNS is not working # 150.163.141.196 -> moingobe.cptec.inpe.br # 150.163.146.249 -> monhane.cptec.inpe.br # 200.9.149.140 -> ? allow ANY ^150\.163\.141\.196$ allow ANY ^150\.163\.146\.249$ allow ANY ^200\.9\.149\.140$ So, if DNS is not available, these machines will be allowed by their IP. If DNS is available, they will be allowed by their fully qualified hostname. >I was(am) >trying to feed GTS as input to (MM5) model initialization, and to do so, I >compiled and installed ldm 6.0.14 on cardeal.meteoro.ufrj.br. Everything >went well till last night, when a strange error appeared: -- log file output from cardeal included above -- >And the log no solon: > >Jun 23 20:45:49 solon cardeal(feed)[19141]: up6.c:331: Starting Up(6.0.14/6): >20040623194549.337 TS_ENDT {{IDS|DDPLUS, ".*"}} >Jun 23 20:45:49 solon cardeal(feed)[19141]: topo: cardeal.meteoro.ufrj.br >IDS|DDPLUS >Jun 23 20:45:49 solon cardeal(feed)[19141]: up6.c:168: HEREIS: RPC: Unable to >send; errno = Broken pipe >Jun 23 20:45:49 solon cardeal(feed)[19141]: up6.c:396: Product send failure: >Input/output error >Jun 23 20:45:55 solon rpc.ldmd[18937]: child 19141 exited with status 6 This log output shows that the process requesting data from solon exited in the middle of a product. The correct action at this point is for the sending rpc.ldmd to exit, and it does. > Which in turn is pretty similar to the messages moingobe and >master generate when fail to access solon. Since I have login access to solon, I could get on and look at the ~ldm/logs/ldmd.log* files. I found multiple instances of problems sending data to moingobe. Here is one example: ldmd.log.9:Jun 17 18:48:00 solon moingobe(feed)[124]: topo: moingobe.cptec.inpe.br WMO ldmd.log.9:Jun 17 18:48:01 solon moingobe(feed)[114]: up6.c:168: HEREIS: RPC: Unable to send; errno = Broken pipe This is saying that the pipe between solon and moingobe went away, and the feeding process on solon exited after it noticed this. The most likely cause of these is a dropped network connection between solon and moingobe fails. The good news here is that a new requesting process will be started on moingobe, and it will request all data in each datastream that are newer than the last ones it got. >I tried to delqueue and >mkqueue, clean, stop, start on both sides, and no effect. It simply >(apparently) appeared out of thin air. Since I was suspicious of the different ownership of the ~ldm/data directory, I removed and remade the directory as 'ldm'. The LDM has now been running for an hour, and data continues to flow. >Maybe its related to the issues >moingobe & Co. are having with solon, maybe not. Just extra input. I believe that the two problems are not related since they have different symptoms. > So, knowing you are very busy, can you please check the logs on >solon and try to find out why moingobe is getting denied so often, and >even local machines? I did so right after seeing your note. I will monitor the situation on cardeal to see if data ingestion stops. I must tell you that you are forging new territory here by running the LDM under Fedora Core 2. We have one machine running this release, and we just started testing under it on Monday. >Waldenio and me discussed some possible causes, but >none seem plausible so far. No net stoppages or conf changes seems to be >related to the issues. OK. > BTW, you can log on cardeal.meteoro.ufrj.br, account ldm same >password as others ldm accounts, to check if you need/wish. Its a dual >Athlon MP 2100+ running Fedora Core 2 (kernel 2.6), one of the nodes of >our small cluster (which in turn is running South America MM5 model >forecasts in real time, soon to be available to the public). Being able to logon was very helpful, thanks. By the way, in the future you should send LDM problems directly to our Support box, address@hidden. Since three of us read this mailbox, you will be more likely to get a fast response than if you send the note directly to me. We should keep an eye on cardeal to see if the exiting problem reoccurs. > Best regards, >########################################################### ># David Garrana Coelho - address@hidden # ># Pesquisador Associado ao # ># Laboratorio de Prognosticos em Mesoescala - UFRJ # ># http://www.lpm.meteoro.ufrj.br # ># Meteorologista da Equipe do Alerta-Rio/GEORIO # ># http://www2.rio.rj.gov.br/georio/site/alerta/alerta.htm # >########################################################### Cheers, Tom -- +-----------------------------------------------------------------------------+ * Tom Yoksas UCAR Unidata Program * * (303) 497-8642 (last resort) P.O. Box 3000 * * address@hidden Boulder, CO 80307 * * Unidata WWW Service http://www.unidata.ucar.edu/ * +-----------------------------------------------------------------------------+