[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050214: LDM product queue corruption

This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.

Subject: 20050214: LDM product queue corruption
Date: Mon, 14 Feb 2005 12:37:07 -0700

Gabe,

>Date: Mon, 14 Feb 2005 13:47:01 -0500 (EST)
>From: Gabe Langbauer <address@hidden>
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20050214: LDM product queue corruption 

The above message contained the following:

> I am unsure where this ldmping initiated from.  My ldm crontab is as
> follows:
> 35 * * * * /usr/local/ldm/bin/ldmadmin dostats
> 0 0 * * * /usr/local/ldm/bin/ldmadmin newlog

"ldmadmin dostats", eh?  That command is no longer useful.  I don't
think it could affect a running LDM, but, just to be sure, do the
following:

    1.  Remove the "ldmadmin dostats" command from the LDM user's
        crontab(1) file.

    2.  Have the following entry enabled in the LDM
        configuration-file, etc/ldmd.conf:

            exec        "pqbinstats"

        The pqbinstats(1) program saves statistics on the LDM system in
        *.stats files in the LDM user's "logs" subdirectory.

    2.  Add the following entry to the scour(1) configuration-file,
        etc/scour.conf:

            ~ldm/logs   1       *.stats

        This ensures that the number of *.stats files won't increase
        indefinitely.

> #  Check for incoming data and failover if upstream site is dead
> #10,30,50 * * * * /usr/local/ldm/bin/ldmfail -p stokes.metr.ou.edu -f
> pluto.met.fsu.edu > /dev/null 2>&1 /dev/null
>   
> #  Scour the data directories
> 0 * * * * /usr/local/ldm/bin/ldmadmin scour > /dev/null 2>&1
>   
> #  Rotate and remove the decoder logs - the trailing digit
> #  tells the script how many days of logs to keep
> #
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcacars.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcamos.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcmmos.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcnmos.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcnldn.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcncprof.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dctrop.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcwatch.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcffg.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcstorm.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcgrib.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dchrly.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcsynop_sb.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcsynop_syn.log 1
> 0 0 * * * /usr/local/ldm/bin/newlog data/logs/dcuair.log 1
> 
> So, the only things going on here are rotating logs and some stats.  A
> check of my gempak crontab (ldm and gempak are virtually the only things
> running on the machine) shows nothing occuring at ~20:43 except
> scripts that are called at the same time every hour or possibly
> ngm.csh which is called at 20:00  ngm.csh simply is a script that calls
> other ngm scripts to create gempak products. we source the Gemenviron and
> set the display then use the 'date' command to get the current time then
> run gempak.  Nowhere is there any mention of ldm nor do I believe it would
> have permissions to make a call such as ldmping 

You've got to find-out where that ldmping(1) came from to ensure that
whatever's causing it isn't interfering in other ways with the LDM.

> Another interesting development occured this weekend.  I was able to
> "capture ldm in the act".  LDM crashed around 00:15 UTC  and I
> realized that it was down.

Can you send me the log entries for that time?

> I ssh'd in and issued the command ldmadmin
> clean.

Did you ensure that the LDM system wasn't running?  Doing an "ldmadmin
clean" when the LDM is running will cause the *.pid file to be removed
and could result in "orphaned" LDM processes.

> This commmand successfully completed.  I then issued the commmand
> ldmadmin start this command appeared to work correctly.  However, when I
> issued ldmadmin watch I was given the message "there is no ldm running on
> this machine"

That's mighty suspicious.  Can you send me the log entries for that
time?

> I tried this same sequence a couple more times and I
> delqueued and mkqueued and physically removed (via rm) the pid file and so
> forth.  LDM however refused to start.  Immediatly at 01:00 UTC I issued
> the same command ldmadmin clean && ldmadmin start as I had done several
> times during the previous hour.  Magically, this time it worked.  This
> leads me to believe that there was some program running at that time that
> immediatly corrupted the ldm.  But I'm unsure what could be responsible

Any chance of my logging onto the system in question to examine the LDM
setup?

Regards,
Steve Emmerson

Prev by Date: 20050214: LDM product queue corruption
Next by Date: 20050214: LDM product queue corruption
Previous by thread: 20050214: LDM product queue corruption
Next by thread: 20050214: LDM product queue corruption
Index(es):
- Date
- Thread