[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20030630: IDD feeds from LSU to any non LSU downstream sites (cont.)
- Subject: 20030630: IDD feeds from LSU to any non LSU downstream sites (cont.)
- Date: Mon, 30 Jun 2003 11:52:54 -0600
>From: Robert Leche <address@hidden>
>Organization: LSU
>Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
Bob,
>As we have switched to 'event mode' with the hurricane in the gulf, I
>have had to drop the network investigation. Today is out, and at least
>part of tomorrow. Also, I have lost email over the last 4 days.
>Please resend the emails you sent from Friday on.
>
>Speaking of hurricanes, our computer "Hurricane" died. I am in the
>process of rebuilding it with Gempak to bring to the Office of Emergency
>Preparedness. Murphy's Law.. If it can fail, it will!
The most important thing I asked for in email sent since last Friday
was for you to contact the telecomm folks at LSU and/or LANET to see
what they possibly did over the weekend to first make the HDS latencies
from seistan to zero.unidata.ucar.edu drop significantly starting on
Friday evening, AND then to rise back starting on Sunday afternoon.
Whatever was done holds the information for finally closing out the
feed problems being experienced by sites downstream of LSU.
Tom
Here are all of the messages I sent to you since last Friday morning:
From address@hidden Sun Jun 29 20:06:45 2003
To: address@hidden
cc: address@hidden, Kevin Robbins <address@hidden>
Subject: 20030628: HDS feed to/from seistan (cont.)
>From: Unidata Support <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
Bob,
Well, after most of a weekend of pretty good HDS latencies from seistan
to zero.unidata.ucar.edu, the feed problems reappeared. This can be
seen by the 'latency' plot from the real time statistics page:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?HDS+zero.unidata.ucar.edu
The questions now are:
- what changed at LSU/LANET on Saturday to make the latencies drop to near
what they should be
- what changed at LSU/LANET on Sunday afternoon to make the latencies
climb to their previous bad levels
I think a call to the LSU telecomm folks is in order. If you can't get
anywhere with them (please try, you should have more clout with them
than us), can you send along their contact information?
From address@hidden Sat Jun 28 07:53:30 2003
To: address@hidden
cc: address@hidden, Kevin Robbins <address@hidden>
Subject: 20030627: HDS feed to/from seistan (cont.)
>From: Robert Leche <address@hidden>
>Organization: LSU
>Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
Hi Bob,
>take a look at the following two cases. Notice the LSU to ULM hop is
>via the network address translation firewall:
>dynip422.nat.ulm.edu.(Line 6) Interestingly, ULM does not pass through
>the same NAT/firewall process. I beleive this could offer a clue.
>The traceroute report is missing the last 3 hops and untill the firewall at
>ULM is opened allow you to ping tornado we will not have a complete picture.
I don't think that this has anything to do with the feed problems we
were seeing from LSU to others. It only explains the inability to do
complete traceroutes to ULM. This is/was not part of the feed problems
we have been seeing.
>Some how two different paths are connecting ULM. And this suggests a
>reason why it takes more time to issue packets from Seistan to
>Torando.
It does not explain the asymmetry in feed to/from UCAR. ULM has been
out of the picture as far as high volume data feeds from seistan for
well over a week now. Ever since I switched them to feed from
CU/CIRES (rainbow.al.noaa.gov), their HDS latencies have been at or
very near zero.
Now, back to the problem at hand. Something significant changed yesterday
night:
- the HDS latencies from seistan to zero.unidata.ucar.edu dropped to
near zero after a spike at around 7Z
- for the first time since setting up the feed test from emo.unidata.ucar.edu
to seistan and then back out to zero.unidata.ucar.edu, all HDS data was
relayed from seistan to zero.unidata.ucar.edu
- latencies for all feeds from seistan to tornado.geos.ulm.edu
(e.g., FSL2, IDS|DDPLUS, UNIWISC, and NNEXRAD) dropped significantly
Given these three observations from the real time statistics page:
http://www.unidata.ucar.edu/staff/chiz/rtstats/siteindex.shtml
for seistan.srcc.lsu.edu, zero.unidata.ucar.edu, and tornado.geos.ulm.edu
I conclude that something changed in the network path out of LSU or
in LANET.
Did you receive a change notification from the LSU telecomm folks? If
not, will you contact them to find out exactly what was done? A complete
picture of what went wrong and its fix will help others if they run into
similar problems.
From address@hidden Fri Jun 27 12:11:39 2003
To: address@hidden
cc: address@hidden, Kevin Robbins <address@hidden>
Subject: 20030627: HDS feed to/from seistan (cont.)
>From: Robert Leche <address@hidden>
>Organization: LSU
>Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
Hi Bob,
re: ULM rerouted their traffic from I2 to "I1:
>I did not know this happened, but it explains why ULM is able to communicate
>with rainbow.al.noaa.gov.
The ULM folks told us that during a total outage at LSU at some point in
the past they fed from thelma.ucar.edu and experienced no problems. This
predated either your or ULM's upgrade to LDM-6 by quite a bit.
Here a portion of the original note we received on problems ULM was
having feeding from srcc.lsu.edu:
"For more than a year, we have been having serious data feed problems
when our upstream site is at LSU (sirocco). We have tried everything
that we can, including contacting LSU repeatedly, but cannot seem to
resolve the situation satisfactorily. We have worked extensively with
our network people and believe that the problem is at LSU. We are
basing this conclusion on the fact that, while sirocco was down and we
were feeding from Unidata's thelma machine, everything was fine. We
received all data without significant losses. However, once sirocco
came on-line again and we switched over to them, we began to experience
substantial losses of data. Our fallback site is OU's stokes machine
and we have used them in the past, but they are feeding so many sites
that we tend to fall significantly behind in the data feed.
Can you help us resolve this problem?"
>It would be interesting to also force an I1 connection to LSU and repeat
>the test.
I agree, running feed tests using a different route to/from LSU would
certainly be welcome.
re: "I1"
>Internet one?
That is what we asked.
>A better question in this case is, what is I2 in the context
>to the LANET sonnet connecting ULM to LANET?
Here is the route from ULM to seistan.srcc.lsu.edu:
Matt's traceroute [v0.49]
tornado.geos.ulm.edu Fri Jun 27 10:56:14
2003
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. 10.16.0.1 0% 18 18 1 1 1 1
2. 10.1.1.1 0% 18 18 0 0 0 1
3. 198.232.231.1 0% 18 18 0 0 0 1
4. laNoc-ulm.LEARN.la.net 0% 17 17 13 13 19 76
5. lsubr-laNoc.LEARN.la.netponse 2. (serve0% 17 17 14 14 15 26
6. howe-e241a-4006-dsw-1.g1.lsu.edu 0% 17 17 18 15 22 50
7. seistan.srcc.lsu.edu 0% 17 17 15 14 19 42
This can be compared with LSU's route from seistan to tornado.geos.ulm.edu:
Matt's traceroute [v0.49]
seistan.srcc.lsu.edu Fri Jun 27 10:58:56
2003
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. 130.39.188.1 0% 11 11 4 1 2
5
2. lsubr1-118-6509-dsw-1.g2.lsu.edu 0% 11 11 1 0 1
1
3. laNoc-lsubr.LEARN.la.net 0% 11 11 2 1 2
4
4. ulm-laNoc.LEARN.la.net 0% 11 11 14 14 36
91
5. 198.232.231.2 0% 11 11 29 14 41
127
6. dynip422.nat.ulm.edu 0% 11 11 16 15 25 61
7. tornado.geos.ulm.edu 0% 10 10 15 14 16 23
Resolver: Received error response 2. (server failure)
>My limited understanding of
>what I2 is, is that traffic is I2 if it passes through Abilene's system.
I believe that is correct.
>That being the case, unless ULM is passing through Abilenes routers, ULM
>is really on I1 anyway.
Please see the route above. This, at least, reflects ULM's current
connection to LSU. UCAR's connection to ULM, however, traverses I2
until Houston where the bridge is made to LEARN.La.Net:
zero.unidata.ucar.edu -> tornado.geos.ulm.edu:
Matt's traceroute [v0.44]
zero.unidata.ucar.edu Fri Jun 27 12:02:58
2003
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. flra-n140.unidata.ucar.edu 0% 71 71 0 0 0 29
2. gin-n243-80.ucar.edu 0% 71 71 0 0 0 6
3. frgp-gw-1.frgp.net 0% 71 71 1 1 2 25
4. 198.32.11.105 0% 71 71 1 1 1 6
5. kscyng-dnvrng.abilene.ucaid.edu 0% 71 71 12 12 13 26
6. hstnng-kscyng.abilene.ucaid.edu 0% 71 71 27 27 27 27
7. laNoc-abileneHou.LEARN.La.Net 0% 71 71 33 32 33 36
8. ulm-laNoc.LEARN.La.Net 0% 70 70 45 45 46 71
9. ???
tornado.geos.ulm.edu -> zero.unidata.ucar.edu
Matt's traceroute [v0.49]
tornado.geos.ulm.edu Fri Jun 27 13:04:05
2003
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. 10.16.0.1 0% 4 4 1 1 1 1
2. 10.1.1.1 0% 4 4 0 0 0 0
3. 198.232.231.1 0% 4 4 0 0 0 0
4. laNoc-ulm.LEARN.la.net 0% 4 4 13 13 13 13
5. abileneHou-laNoc.LEARN.la.net 2. (serve0% 4 4 18 18 25 45
6. kscyng-hstnng.abilene.ucaid.edu 0% 3 3 34 34 34 34
7. dnvrng-kscyng.abilene.ucaid.edu 0% 3 3 44 44 44 44
8. 198.32.11.106 0% 3 3 44 44 44 45
9. gin.ucar.edu 0% 3 3 46 45 45 46
10. flrb.ucar.edu 0% 3 3 45 45 46 46
11. zero.unidata.ucar.edu 0% 3 3 56 45 49 56
re: ULM rerouted away from the problematic I2 connection
>LANET indicated this trouble ticket
>has been open for "some time". We do not know what "some time" means in terms
>of days or months.
It would be useful to know how long that trouble ticket has been open.
>CRC, and retransmission errors are consistent with delays
>in network traffic.
I agree.
re: is CRC and retransmission (trouble ticket at LANET) affecting LSU also
>I think the communication issue will require resolving before we will
>know.
The really strange part is the asymmetry in the problem. Since we are
are feeding seistan.srcc.lsu.edu the HDS stream from
emo.unidata.ucar.edu with no latencies, while at the same time we are
_unable_ to feed the data back to a different machine here at the UPC,
zero.unidata.ucar.edu (zero and emo are in the same room on the same
subnet), perhaps a look at the route from Unidata to seistan and back
again would be instructive:
zero.unidata.ucar.edu -> seistan.srcc.lsu.edu
Matt's traceroute [v0.44]
zero.unidata.ucar.edu Fri Jun 27 10:16:40
2003
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. flra-n140.unidata.ucar.edu 0% 8 8 10 0 1 10
2. gin-n243-80.ucar.edu 0% 8 8 0 0 0 0
3. frgp-gw-1.frgp.net 0% 8 8 1 1 1 2
4. 198.32.11.105 0% 8 8 1 1 1 1
5. kscyng-dnvrng.abilene.ucaid.edu 0% 8 8 22 12 13 22
6. hstnng-kscyng.abilene.ucaid.edu 0% 8 8 27 27 27 27
7. laNoc-abileneHou.LEARN.La.Net 0% 8 8 33 33 33 33
8. lsubr-laNoc.LEARN.La.Net 0% 8 8 34 34 34 34
9. howe-e241a-4006-dsw-1.g2.lsu.edu 0% 8 8 39 35 37 42
10. seistan.srcc.lsu.edu 0% 7 7 34 34 34 35
seistan.srcc.lsu.edu -> zero.unidata.ucar.edu
Matt's traceroute [v0.49]
seistan.srcc.lsu.edu Fri Jun 27 11:15:53
2003
Keys: D - Display mode R - Restart statistics Q - Quit
Packets Pings
Hostname %Loss Rcv Snt Last Best Avg Worst
1. 130.39.188.1 0% 14 14 1 1 3 16
2. lsubr1-118-6509-dsw-1.g2.lsu.edu 0% 14 14 0 0 1 6
3. laNoc-lsubr.LEARN.la.net 0% 14 14 2 1 2 5
4. abileneHou-laNoc.LEARN.la.net 0% 14 14 8 7 16 46
5. kscyng-hstnng.abilene.ucaid.edu 0% 14 14 23 22 22 23
6. dnvrng-kscyng.abilene.ucaid.edu 0% 14 14 33 33 36 71
7. 198.32.11.106 0% 14 14 34 33 36 59
8. gin.ucar.edu 0% 14 14 35 34 35 45
9. flrb.ucar.edu 0% 14 14 34 34 35 45
10. zero.unidata.ucar.edu 0% 13 13 34 34 36 57
The major difference in routes that I notice is the route from zero
to seistan goes through howe-e241a-4006-dsw-1.g2.lsu.edu, but the
route from seistan to zero goes through lsubr1-118-6509-dsw-1.g2.lsu.edu.
Perhaps this is a big clue that we are overlooking? Could it be
that there is something amiss on the howe-e241a-4006-dsw-1.g2.lsu.edu
gateway/router?
re: What did the telecomm folks have to say about the asymmetry seen moving
data to/from srcc.lsu.edu from zero.unidata.ucar.edu?
>The issue of asymmetry was not the paramount issue with telecom. Again, the
>telecom guys want to wait and see the communications issues are fixed, as
>they believe the errors in the circuit are causing the problems between LSU
>and ULM.
The problem is not _just_ between LSU and ULM. We (zero.unidata.ucar.edu)
are seeing the exact same problem that ULM was seeing when trying to
feed HDS from seistan.srcc.lsu.edu. Moreover, we saw the exact same
problem during our test of feeding the HDS stream from
seistan.srcc.lsu.edu to the University of South Florida machine,
metlab.cas.usf.edu. The problem most likely exists between seistan
and Jackson State, but we can't verify this because they are not reporting
stats AND we do not have current contact information for them.
If the LSU telecomm folks are under the impression that the only
problem is between LSU and ULM, then they need to be contacted and made
aware of the problems going to such diverse sites as UCAR and USF.
>From address@hidden Fri Jun 27 07:34:11 2003
To: address@hidden
cc: Kevin Robbins <address@hidden>, address@hidden
Subject: 20030626: 20030624: HDS feed to/from seistan (cont.)
>From: Robert Leche <address@hidden>
>Organization: LSU
>Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
Hi Bob,
>In talking with our telecommunications people:
>
>1) The Louisiana Office of Telecommunications ("LANET") was contacted with
>the problem and LANET reports Bell South (The states communications provider)
>has an open trouble ticket on the Public Switched sonnet network connecting
>ULM to the LANET. The trouble ticket reports: CRC, Retransmission errors.
>This is a DS-3 Private Virtual Circuit (PVC) on the Public Switched sonnet
>network connecting ULM to the LANET.
This sounds like the problem we uncovered at ULM. They contacted their
service provider and rerouted their traffice from I2 to "I1". We
never did get a reply from them as to what "I1" means. After they
rerouted away from their problematic I2 connection, we were able to feed
all of HDS to them with virtually no latency.
>LANET indicated this trouble ticket
>has been open for "some time". We do not know what "some time" means in terms
>of days or months. CRC, and retransmission errors are consistent with delays
>in network traffic.
Is this also affecting the LSU connection? If not, there is still a
problem to be solved.
>2) Concerning Ping (ICMP):
> A) LSU has limitation's placed on ICMP payload sizes to limit "the
>Ping Of Death" hacks. So it is interesting that even though LSU has this
>policy in place we can demonstrate large ICMP traffic to correctly query
>systems other than ULM but not ULM.
OK.
> B) The telecommunications people pointed out that Cisco router interface
>ping (ICMP) buffers have a hard limitation of 18,000 bytes. Unix/Linux
systems
>do not have this issue. So the theory goes.... Ping LSU's Cisco border
router,
>then LANET's Cisco border router and problems seem apparent. Yet ping an
>UNIX device with a large pay load beyond the Cisco device and travel time
>delays suddenly do not seem excessive.
I understand. Even still, the pings with large ICMP packets from
seistan (RedHat 7.2 Linux) to zero.unidata.ucar.edu (Sun Solaris SPARC
5.9) show dramatic round trip time increases after the ping packet size
exceeds 20KB. The 18000 byte limit you note does seem like what were
were seeing when trying to ping laNoc-lsubr.LEARN.la.net.
>3) It would be interesting to know who ULM is feeding HDS from. Chances are,
>the communications circuit they are currently using is the same DS-3 circuit
>that LANET uses.
Right now, ULM is feeding HDS from rainbow.al.noaa.gov (this is a
CU/CIRES lab here in Boulder). We also fed them with no latency from
emo.unidata.ucar.edu.
>4) Limitations placed on ICMP payload sizes on any devices in a networks
>path will cause problems in using ICMP round trip time to measure network
>metrics. But at this time, I do not have an alternative method to measure
>network latencies. My network guy said network latencies issues are handled
>by the circuit provider. No help there.
The ping packet size issue was just an interesting observation. The
real issue is the latency when feeding the HDS stream out of LSU as
compared to virtually no latency when feeding the HDS stream _into_
LSU. This observation is something that the telcomm people should be
able use to help isolate where the throttling is occuring on or near
the LSU campus. Our being able to feed ULM all of the HDS feed from at
least two other sites and our not being able to feed HDS from seistan
but being able to feed seistan shows us that the problem is not at ULM,
but at LSU.
What did the telecomm folks have to say about the asymmetry seen moving
data to/from srcc.lsu.edu?
Tom
>From address@hidden Mon Jun 30 12:30:54 2003
>To: Unidata Support <address@hidden>
>Subject: Re: 20030630: IDD feeds from LSU to any non LSU downstream sites
>(cont.)
>Tom,
>thanks for sending the email to me.
>The LSU telcom folks report no changes where made with the LSU network
>configuration over the weekend. The LANET part of this remains to be
>answered, and our telcom will contact them.
>Just to let you know, we have not made any changes to Seistan either.