[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20030409: ldm-6.0.10 issues under irix 6.5
- Subject: 20030409: ldm-6.0.10 issues under irix 6.5
- Date: Wed, 09 Apr 2003 11:24:55 -0600
Pete,
> To: address@hidden,
> To: address@hidden
> From: address@hidden (Pete Pokrandt)
> Subject: ldm-6.0.10 issues under irix 6.5
> Organization: University of Wisconsin
The above message contained the following:
> Built, and am running, ldm-6.0.10 on two linux machines (f5.aos.wisc,edu,
> and profhorn.aos.wisc.edu, mapmaker will be next), all good on that
> front.
>
> However, when I built on our SGI running irix 6.5.15m, using
> gcc compilers (freeware version 3.0.1), I get assertion failures
> and core dumps.
I'll try to duplicate your problem here on our IRIX 6.5 system.
> I'm going to have to fail back to 6.0.2 for now (which, after
> core-dumping before I rebuilt the queue file the first time,
> had been running ok.)
>
>
> Unidata support: here are some excerpts from the log files when 6.0.10
> crashes under irix.
>
> This time, it just died, but did not dump core:
>
> Apr 08 21:33:43 5Q:sunset zeus(feed)[1782]: topo: zeus.lsc.vsc.edu DIFAX
> Apr 08 21:36:54 5Q:sunset kelvin[1991]: ldmprog_4: ldmping from
> kelvin.ca.uky.edu
> Apr 08 21:43:02 5Q:sunset rpc.ldmd[1697]: child 1710 terminated by signal 9
What was process 1710?
Signal 9 is SIGKILL, which cannot be caught or ignored by a process and
is actually handled by the operating system on behalf of the "receiving"
process. Because this signal isn't used by the LDM package, the only
way a process of the LDM package could be "sent" this signal is by an
outside source.
Who or what "sent" the SIGKILL to process 1710?
> Apr 08 21:43:15 3Q:sunset DCNLDN[1692]: nldninput(): no data within timeout
> period: returning EOF
> Apr 08 21:43:15 3Q:sunset DCNLDN[1692]: nldninput(): NLDN read error
> Apr 08 21:43:19 5Q:sunset pqact[1699]: child 1692 exited with status 110
> Apr 08 21:46:54 5Q:sunset kelvin[2688]: ldmprog_4: ldmping from
> kelvin.ca.uky.edu
> Apr 08 21:52:53 5Q:sunset rpc.ldmd[1697]: child 1685 terminated by signal 11
What was process 1685?
Signal 11 is SIGSEGV and indicates an attempt to access memory that
isn't in the address-space of the process.
The rest of the log entries are what I would expect.
> Apr 08 21:52:53 5Q:sunset rpc.ldmd[1697]: Killing (SIGINT) process group
> Apr 08 21:52:53 5Q:sunset rpc.ldmd[1697]: SIGINT
> Apr 08 21:52:53 5Q:sunset mapmaker[1706]: SIGINT
> Apr 08 21:52:54 5Q:sunset mapmaker[1713]: SIGINT
> Apr 08 21:52:55 3Q:sunset mapmaker[1706]: pmap_unset(LDMPROG 300029, LDMVERS
> 5) failed
> Apr 08 21:52:55 3Q:sunset mapmaker[1713]: pmap_unset(LDMPROG 300029, LDMVERS
> 5) failed
> Apr 08 21:52:55 3Q:sunset mapmaker[1706]: pmap_unset(LDMPROG 300029, LDMVERS
> 6) failed
> Apr 08 21:52:55 3Q:sunset mapmaker[1713]: pmap_unset(LDMPROG 300029, LDMVERS
> 6) failed
> Apr 08 21:53:15 5Q:sunset rpc.ldmd[1697]: Terminating process group
> Apr 08 21:53:15 5Q:sunset mapmaker[1706]: SIGTERM
> Apr 08 21:53:15 5Q:sunset mapmaker[1713]: SIGTERM
> Apr 08 21:53:15 5Q:sunset pqbinstats[1701]: Interrupt
> Apr 08 21:53:15 5Q:sunset io(feed)[1757]: SIGTERM
> Apr 08 21:53:15 5Q:sunset pqact[1699]: Interrupt
> Apr 08 21:53:15 5Q:sunset f5(feed)[1750]: SIGTERM
> Apr 08 21:53:16 5Q:sunset io(feed)[1757]: SIGINT
> Apr 08 21:53:15 5Q:sunset zeus(feed)[1782]: SIGTERM
> Apr 08 21:53:16 5Q:sunset f5(feed)[1750]: SIGINT
> Apr 08 21:53:16 5Q:sunset pqbinstats[1701]: Exiting
> Apr 08 21:53:15 5Q:sunset shadow(feed)[1739]: SIGTERM
> Apr 08 21:53:15 5Q:sunset storm2(feed)[1743]: SIGTERM
> Apr 08 21:53:16 5Q:sunset kelvin(feed)[1763]: SIGTERM
> Apr 08 21:53:15 5Q:sunset accas(feed)[1746]: SIGTERM
> Apr 08 21:53:16 5Q:sunset zeus(feed)[1769]: SIGTERM
> Apr 08 21:53:16 5Q:sunset shadow(feed)[1739]: SIGINT
> Apr 08 21:53:16 5Q:sunset storm2(feed)[1743]: SIGINT
> Apr 08 21:53:16 5Q:sunset kelvin(feed)[1763]: SIGINT
> Apr 08 21:53:16 5Q:sunset accas(feed)[1746]: SIGINT
> Apr 08 21:53:16 5Q:sunset zeus(feed)[1769]: SIGINT
> Apr 08 21:53:16 5Q:sunset zeus(feed)[1782]: SIGINT
> Apr 08 21:53:16 5Q:sunset pqact[1699]: Exiting
> Apr 08 21:53:16 3Q:sunset pqact[1699]: mm0_mtof: Couldn't riul_r_find 0
> Apr 08 21:53:16 5Q:sunset io(feed)[1759]: SIGTERM
> Apr 08 21:53:16 5Q:sunset rtstats[1703]: Interrupt
> Apr 08 21:53:16 5Q:sunset io(feed)[1759]: SIGINT
> Apr 08 21:53:16 5Q:sunset rtstats[1703]: Exiting
> Apr 08 21:53:17 5Q:sunset f5[1711]: SIGTERM
> Apr 08 21:53:17 5Q:sunset f5[1711]: SIGINT
> Apr 08 21:53:18 5Q:sunset thelma[1707]: SIGTERM
> Apr 08 21:53:18 5Q:sunset thelma[1707]: SIGINT
> Apr 08 21:53:18 3Q:sunset thelma[1707]: pmap_unset(LDMPROG 300029, LDMVERS 5)
> failed
> Apr 08 21:53:18 3Q:sunset thelma[1707]: pmap_unset(LDMPROG 300029, LDMVERS 6)
> failed
> Apr 08 21:53:18 3Q:sunset f5[1711]: pmap_unset(LDMPROG 300029, LDMVERS 5)
> failed
> Apr 08 21:53:18 3Q:sunset f5[1711]: pmap_unset(LDMPROG 300029, LDMVERS 6)
> failed
I'll put the rest of your email aside for now while I try to duplicate
your problem and await your answers to the above questions.
Regards,
Steve Emmerson