[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20020906: thelma not too good
- Subject: Re: 20020906: thelma not too good
- Date: Sat, 07 Sep 2002 13:27:02 -0600
Hi all,
The uptime.log is even wierder today. Look at this snippet:
12:06pm up 1 day(s), 21:17, 6 users, load average: 9.14, 10.77,
11.99
12:07pm up 1 day(s), 21:18, 6 users, load average: 7.82, 10.04,
11.64
12:08pm up 1 day(s), 21:19, 6 users, load average: 5.48, 8.91, 11.14
12:09pm up 1 day(s), 21:20, 6 users, load average: 4.11, 7.91, 10.64
12:10pm up 1 day(s), 21:21, 6 users, load average: 3.32, 7.02, 10.14
12:11pm up 1 day(s), 21:22, 6 users, load average: 4.17, 6.61, 9.80
12:12pm up 1 day(s), 21:23, 6 users, load average: 4.37, 6.23, 9.45
12:13pm up 1 day(s), 21:24, 6 users, load average: 4.88, 6.07, 9.20
12:14pm up 1 day(s), 21:25, 6 users, load average: 3.68, 5.58, 8.82
12:15pm up 1 day(s), 21:26, 6 users, load average: 4.66, 5.47, 8.57
12:16pm up 1 day(s), 21:27, 6 users, load average: 4.48, 5.29, 8.31
12:17pm up 1 day(s), 21:28, 6 users, load average: 3.37, 4.86, 7.96
12:18pm up 1 day(s), 21:29, 6 users, load average: 4.57, 4.86, 7.75
12:19pm up 1 day(s), 21:30, 6 users, load average: 5.41, 5.16, 7.68
12:20pm up 1 day(s), 21:31, 6 users, load average: 3.70, 4.70, 7.36
12:21pm up 1 day(s), 21:32, 6 users, load average: 3.75, 4.52, 7.12
12:22pm up 1 day(s), 21:33, 6 users, load average: 2.58, 4.08, 6.80
12:23pm up 1 day(s), 21:34, 6 users, load average: 12.65, 6.56, 7.50
12:24pm up 1 day(s), 21:35, 6 users, load average: 15.59, 8.51, 8.13
12:25pm up 1 day(s), 21:36, 6 users, load average: 17.77, 10.44,
8.84
12:26pm up 1 day(s), 21:37, 6 users, load average: 18.57, 11.98,
9.50
I can't correlate that 12:23 moment with anything in the LDM logs or the
system logs. (/var/adm/messages is practically empty.)
And, here's a traceroute from thelma to Penn State:
/local/ldm% traceroute ldm.meteo.psu.edu
traceroute: Warning: Multiple interfaces found; using 192.52.106.21 @
ge0
traceroute to ldm.meteo.psu.edu (128.118.28.12), 30 hops max, 40 byte
packets
1 vbnsr-dmzfnet (192.52.106.10) 0.698 ms 0.690 ms 0.434 ms
2 mlra-n2 (128.117.2.253) 0.382 ms 0.375 ms 0.594 ms
3 gin-n243-72 (128.117.243.73) 0.849 ms 0.735 ms 0.565 ms
4 frgp-gw-1 (128.117.243.34) 1.543 ms 2.415 ms 1.700 ms
5 198.32.11.105 (198.32.11.105) 2.239 ms 1.709 ms 1.509 ms
6 kscy-dnvr.abilene.ucaid.edu (198.32.8.14) 12.183 ms 12.184 ms
12.815 ms
7 ipls-kscy.abilene.ucaid.edu (198.32.8.6) 22.066 ms 21.362 ms
21.394 ms
8 clev-ipls.abilene.ucaid.edu (198.32.8.26) 27.925 ms 27.939 ms
27.706 ms
9 abilene.psc.net (192.88.115.122) 31.138 ms 30.860 ms 31.129 ms
10 bar-beast.psc.net (192.88.115.17) 31.111 ms 30.987 ms 31.156 ms
11 psu-i2.psc.net (192.88.115.98) 57.862 ms 42.568 ms 73.063 ms
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
From the LDM log, they're definately losing CONDUIT products.
It probably would be helpful to get 5.2.1 in place on thelma, and to get
rtstats from Harry and Art. I think I'll 5.2.1 it on milton this
weekend and let it run a bit to try to ensure it's in a usable state.
Anne
Tom Yoksas wrote:
>
> >From: anne <address@hidden>
> >Organization: UCAR/Unidata
> >Keywords: 200209070333.g873XUj09291
>
> Anne and Jeff,
>
> >While thelma looked pretty good about 6:30 today, with a load average
> >around 5, now it's not looking so good. The load average was about 14,
> >and it was sluggish in responding.
>
> Nuts.
>
> >There are only 71 rpc.ldmds at the moment, less than the 72 that I
> >thought we were able to handle easily before the reboot. There are lots
> >of reclasses to atm, plus some to sunset.aos.wisc.edu.
>
> >(What's 'aos'?).
>
> This appears to be f5.aos.wisc.edu. They are reporting realtime stats,
> and their latencies don't look good. Seems to me that they should
> be feeding from SSEC, no?
>
> >And connections are being dropped.
>
> So, when the load average goes above some level, data stops getting
> delivered reliably and reclass messages ensue.
>
> >I started a cron job to run uptime every minute to track the load
> >average. The resulting log is in ~logs/uptime.log.
>
> The contents of this file are very interesting. The load average comes
> and goes. We now need to correlate that with CONDUIT data volume (or
> anything else).
>
> It seems to me that we need to jump on getting 5.2.1 ready so we can
> get both Washington and Penn State to upgrade to it and run rtstats.
> This should help us understand what is happening at these sites.
>
> The overnight rtstats from atm and f5.aos are really interesting.
> atm looks OK except for NNEXRAD, and f5 looks bad. I don't know
> what to make of this!
>
> Tom
> --
> +-----------------------------------------------------------------------------+
> * Tom Yoksas UCAR Unidata Program
> *
> * (303) 497-8642 (last resort) P.O. Box 3000
> *
> * address@hidden Boulder, CO 80307 *
> * Unidata WWW Service
> http://www.unidata.ucar.edu/*
> +-----------------------------------------------------------------------------+
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************