[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: issues with LDM
- Subject: Re: issues with LDM
- Date: Mon, 18 Jun 2007 09:43:38 -0600
Chi and Justin,
You mentioned a rate limit for your I2 connection.
The throughput before Thursday had been up to 4GB per hour (with a few
periods higher). As of Thursday we are seeing about 500MB per hour,
which more closely
approaches a T1 capacity:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc1?CONDUIT+atm.cise-nsf.gov
Could a default route or gateway have changed?
Steve Chiswell
Unidata User Support
On Mon, 2007-06-18 at 08:03 -0400, Chi Y Kang wrote:
> Justin Cooke wrote:
> > Chi,
> >
> > It looks like you switched us back to ldm1 on Saturday but according to
> > the graphs from Steve they experienced the same delays.
>
> Running on ldm2 right now. It looks like the send-Q on our end seems to
> be okay. All of these connections are going via I2, let me see what
> rate limit there is set on I2 connection coming out of the campus.
>
>
> >
> > Justin
> >
> > Steve Chiswell wrote:
> >> Chi & Justin,
> >>
> >> The latency of data today has been high like yesterday, even with the
> >> switch of
> >> ldm2. The throughput looks restricted either by a router or
> >> firewall/packet
> >> shaping, but was wondering if coincident with Justin's restart was
> >> that the
> >> connections had to be re-established, so changes took effect at that
> >> time.
> >>
> >> Thanks for all your efforts,
> >>
> >> Steve Chiswell
> >> Unidata User Support
> >>
> >>
> >>
> >> On Fri, 15 Jun 2007, Chi Y Kang wrote:
> >>
> >>
> >>> Wait a minute here,
> >>>
> >>> 128.117.140.208 isn't in the mix. The other hosts are.
> >>>
> >>> I updated the LDM access list. Should we just have some class C ranges
> >>> to have access rather then ip at a time?
> >>>
> >>> Also, i noticed that the send-Q are pretty normal on ldm2 server right
> >>> now but was pretty high on ldm1. might be just an issue with the ACL
> >>> list.
> >>>
> >>>
> >>> 128.117.12.2
> >>> 128.117.12.3
> >>> 128.117.130.220
> >>> 128.117.140.208
> >>> 128.117.140.220
> >>> 128.117.149.220
> >>> 128.117.156.220
> >>> 128.174.80.16
> >>> 128.174.80.47
> >>> 140.90.193.19
> >>> 140.90.193.227
> >>> 140.90.193.228
> >>> 140.90.193.99
> >>> 140.90.226.201
> >>> 140.90.226.202
> >>> 140.90.226.203
> >>> 140.90.226.204
> >>> 140.90.37.12
> >>> 140.90.37.13
> >>> 140.90.37.15
> >>> 140.90.37.16
> >>> 140.90.37.40
> >>> 144.92.130.88
> >>> 144.92.131.244
> >>> 150.9.117.128
> >>> 192.12.209.57
> >>> 192.58.3.194
> >>> 192.58.3.195
> >>> 192.58.3.196
> >>> 192.58.3.197
> >>> 193.61.196.74
> >>> 198.181.231.53
> >>> 208.64.117.128
> >>>
> >>>
> >>> Justin Cooke wrote:
> >>>
> >>>> Chi,
> >>>>
> >>>> The reboot doesn't seem to have helped. Is there anything else that may
> >>>> be causing these issues? Network related after I performed the restart
> >>>> of LDM? Steve has a few possibilities:
> >>>>
> >>>> /It seems to be network related at your end, but strange that it
> >>>> occurred at the time when you retsrtaed the LDM- unless there was some
> >>>> sort of firewall or packet filter that occurred when the LDM's
> >>>> re-connected. /
> >>>>
> >>>> Justin
> >>>>
> >>>> Steve Chiswell wrote:
> >>>>
> >>>>> Justin,
> >>>>>
> >>>>> I haven't seen any improvement from ncepldm to the top level relays
> >>>>> daffy.unidata.ucar.edu (Unidata), idd.aos.wisc.edu (U. WIsconsin),
> >>>>> flood.atmos.uiuc.edu (U. Illinois) or atm.cise-nsf.gov (NSF, DC).
> >>>>>
> >>>>> It seems to be network related at your end, but strange that it
> >>>>> occurred
> >>>>> at the time when you retsrtaed the LDM- unless there was some sort of
> >>>>> firewall or packet filter that occurred when the LDM's re-connected.
> >>>>>
> >>>>> Thanks for your time in looking at this,
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>>
> >>>>> On Fri, 2007-06-15 at 15:31 -0400, Justin Cooke wrote:
> >>>>>
> >>>>>
> >>>>>> Steve and Doug,
> >>>>>>
> >>>>>> I just got a call from Chi at the WOC, he rebooted LDM1 after
> >>>>>> noticing
> >>>>>> an unusual load on the machine. LDM is again running on that box
> >>>>>> and it
> >>>>>> remains primary, can you check to see how the latencies are now?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>> Doug Schuster wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Justin,
> >>>>>>>
> >>>>>>> 28,079 products are missing from the 12z cycle. You'll be
> >>>>>>> getting the
> >>>>>>> automated email shortly.
> >>>>>>>
> >>>>>>> -Doug
> >>>>>>>
> >>>>>>> On Jun 15, 2007, at 12:48 PM, Justin Cooke wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Steve,
> >>>>>>>>
> >>>>>>>> I've turned off the feed to LDM2.
> >>>>>>>>
> >>>>>>>> There is no other load on the ldm1 system except for LDM.
> >>>>>>>>
> >>>>>>>> Doug, are you missing many of the TIGGE params for 12Z?
> >>>>>>>>
> >>>>>>>> Justin
> >>>>>>>>
> >>>>>>>> Steve Chiswell wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Justin,
> >>>>>>>>>
> >>>>>>>>> That didn't change the behavior. Still seeing latency.
> >>>>>>>>> perhaps turning off the other feed. Is there any load
> >>>>>>>>> other than LDM on the system?
> >>>>>>>>>
> >>>>>>>>> Steve
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, 2007-06-15 at 12:56 -0400, Justin Cooke wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Steve,
> >>>>>>>>>>
> >>>>>>>>>> I've recreated the queue, let me know if you are still seeing
> >>>>>>>>>> issues.
> >>>>>>>>>>
> >>>>>>>>>> If so I'll turn off the feed to ldm2 to see if that corrects
> >>>>>>>>>> things.
> >>>>>>>>>>
> >>>>>>>>>> Justin
> >>>>>>>>>>
> >>>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Justin,
> >>>>>>>>>>>
> >>>>>>>>>>> I don't know if they saw a disk space problem with
> >>>>>>>>>>> log files not being rotated, but it might just be
> >>>>>>>>>>> best today to build a new queue:
> >>>>>>>>>>>
> >>>>>>>>>>> ldmadmin stop
> >>>>>>>>>>> ldmadmin delqueue
> >>>>>>>>>>> ldmadmin mkqueue
> >>>>>>>>>>> ldmadmin start
> >>>>>>>>>>>
> >>>>>>>>>>> That will mean some queued data would be lost, but if users
> >>>>>>>>>>> aren't
> >>>>>>>>>>> getting it
> >>>>>>>>>>> anyway, then its best to ensure that the queue isn't corrupt
> >>>>>>>>>>> for the
> >>>>>>>>>>> weekend.
> >>>>>>>>>>>
> >>>>>>>>>>> Happy Friday....
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Steve
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, 2007-06-15 at 12:13 -0400, Justin Cooke wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Steve,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Our logs on the primary ldm system "ldm1" had not rotated for
> >>>>>>>>>>>> nearly a week. I sent email to the WOC support and this was the
> >>>>>>>>>>>> response:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Looks like the seed file was missing after we brought the
> >>>>>>>>>>>> system
> >>>>>>>>>>>> backup
> >>>>>>>>>>>> from the last outage. should be good now.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justin Cooke wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> WOC,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I noticed that our logs for LDM have not been rotated on
> >>>>>>>>>>>>> machine
> >>>>>>>>>>>>> ldm1
> >>>>>>>>>>>>> since 06/05/2007. We have a cron entry that runs "ldmadmin
> >>>>>>>>>>>>> newlog" at
> >>>>>>>>>>>>> 00Z every day.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I attempted to run the command by hand and got the
> >>>>>>>>>>>>> following back:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ldm@ldm1:~$ bin/ldmadmin newlog
> >>>>>>>>>>>>> hupsyslog: couldn't open /var/run/syslogd.pid
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I checked but /var/run/syslogd.pid is not there but it is
> >>>>>>>>>>>>> on ldm2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Could there be a problem with syslogd on ldm1?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Also around that time I turned on our backup feed to the ldm2
> >>>>>>>>>>>> system which had been off since that system had issues a few
> >>>>>>>>>>>> weeks ago (we were asked by WOC to turn it back on). I have
> >>>>>>>>>>>> sent
> >>>>>>>>>>>> email to their support group asking if both ldm1 and ldm2 are
> >>>>>>>>>>>> responding to the ncepldm.woc.noaa.gov address or if something
> >>>>>>>>>>>> else is going on.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justin
> >>>>>>>>>>>>
> >>>>>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Justin,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yesterday just after 18Z, the data flow from
> >>>>>>>>>>>>> ncepldm.woc.noaa.gov
> >>>>>>>>>>>>> to top level sites at NSF and Unidata both began showing high
> >>>>>>>>>>>>> latency:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +atm.cise-nsf.gov
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +daffy.unidata.ucar.edu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Data volume out has dropped as a result:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +atm.cise-nsf.gov
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Since the behavior is similar at both sites at separate
> >>>>>>>>>>>>> locations, the
> >>>>>>>>>>>>> problem would appear to be near your end. Since that coincides
> >>>>>>>>>>>>> with your
> >>>>>>>>>>>>> restart of the LDM, could you fill me in on the issues you
> >>>>>>>>>>>>> were
> >>>>>>>>>>>>> experiencing?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Steve Chiswell
> >>>>>>>>>>>>> Unidata User Support
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, 2007-06-15 at 11:38 -0400, Justin Cooke wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Doug,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I had to restart our LDM yesterday right before the 18Z
> >>>>>>>>>>>>>> cycle,
> >>>>>>>>>>>>>> we had an issue with out logging but none of the
> >>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>> files changed. Could one of your feeds have lost the
> >>>>>>>>>>>>>> connection
> >>>>>>>>>>>>>> to our LDM during that restart?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Douglas Schuster wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yes, we've received partial cycles. More than half of the
> >>>>>>>>>>>>>>> expected fields have been missing
> >>>>>>>>>>>>>>> in each cycle from June 14 18Z, to June 15, 06Z. The number
> >>>>>>>>>>>>>>> of missing fields varies between
> >>>>>>>>>>>>>>> each cycle.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Doug
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jun 15, 2007, at 9:11 AM, Justin Cooke wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Doug,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Have you received any GEFS data from us today? Or is it
> >>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>> certain fields you are missing?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>> --
> >>> Chi Y. Kang
> >>> Contractor
> >>> Principal Engineer
> >>> Phone: 301-713-3333 x201
> >>> Cell: 240-338-1059
> >>>
> >>>
>
>
--
Steve Chiswell <address@hidden>
Unidata