This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Well I took it one more step, forward this time. I couln't find 2.2.17, but instead got 2.2.16-22. It works!! I haven't let it run very long yet, but so far I can stop and start it at will. With the kernel's that didn't work I couldn't stop it once it was started without problems. So I am confident. I'll run it awhile and see how it goes. Randy Randy Weatherly wrote: > Sandy, > Thanks for the response. > > I did upgrade to 2.2.16-4. And also made sure I had the latest drivers and > even firmware > for the hardware. I think it is strictly a kernel issue since I have a > second Dell quad > processor that has different raid hardware and it exhibits the same > behaviour. I just > replied to Robb's message about dropping back to an old kernel and getting it > to work. > > I thought about progressing up through the kernels until I find the one that > it breaks on > and then posting a message on bugzillia to see if anyone knows what changed. > > But for now it works with the old kernel, and other than security fixes I'm > not sure what > value the newer kernels have for me. > > Randy > > Sandy Whitesel wrote: > > > since this is obviously an issue in kernel, the first thing i would do is > > update it. > > redhat-6.2 has a "patch" which upgrades the kernel to 2.2.16 (be sure to > > read all the > > directions, and update your initrd and lilo if you're using it). It looks > > like you are > > using the default kernel, but if you need to compile it for some reason, > > you might as > > well download and compile 2.2.17. you may consider looking for latest > > versions of > > drivers (like for your raid card) also. that would be my first steps, and > > if that > > didn't solve the problem, i'd get more creative. > > > > -sandy > > > > Robb Kambic wrote: > > > > > Randy, et al > > > > > > I'm not an expert with Linux so I'm cc our sys admin who knows much more > > > about Linux. Maybe he can shed some light on the problem that appears to > > > be the management of the LDM queue ( memory mapped file). There are some > > > other IDD sites that use Linux, maybe they can also add some imput. My gut > > > opinion is your system configuration about memory mapped files needs to be > > > changed or the LDM interaction with queue needs to be changed. So maybe > > > you could do some research in that direction. Is it possible to run the > > > LDM only on one of the four processors? > > > > > > Robb... > > > > > > On Fri, 22 Sep 2000, Randy Weatherly wrote: > > > > > > > Robb, > > > > > > > > Thanks for your response to Jason's email. We are still having > > > > problems and I > > > > would like to run a few more things past you. > > > > > > > > We have two Redhat Linux 6.2 SMP machines. Both running > > > > 2.2.14-6.1.1smp (if that > > > > matters). Both machines are 4 processor Dell servers, Pentium III > > > > chips and both > > > > have hardware based raid. Both exhibit the same behaviour although one > > > > of them is > > > > more problematic. > > > > > > > > I've been spending most of my time on ted. It is the more problematic > > > > of the two. > > > > maul runs, but will die occasionally. Here's what I've found. > > > > > > > > I've tried both binaries and source, 5.0.8 and 5.1.2. All behave the > > > > same. So I > > > > don't think it is a problem with LDM, but rather something between ldm > > > > and this > > > > version of Linux on this type of machine. > > > > > > > > I am able to get ldm to stop and start pretty well if I stop it, delete > > > > the queue, > > > > make the queue, and start it. But if I stop it, then try and start > > > > (after making > > > > sure no processes are left running), then I get the kernel errors in > > > > the system log > > > > file: > > > > > > > > Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer > > > > dereference at virtual address 00000008 > > > > Sep 18 21:47:32 ted kernel: current->tss.cr3 = 346dc000, %cr3 = > > > > 346dc000 > > > > > > > > Sep 18 21:47:32 ted kernel: *pde = 00000000 > > > > Sep 18 21:47:32 ted kernel: Oops: 0000 > > > > Sep 18 21:47:32 ted kernel: CPU: 0 > > > > Sep 18 21:47:32 ted kernel: EIP: 0010:[locks_remove_flock+14/148] > > > > Sep 18 21:47:32 ted kernel: EFLAGS: 00010296 > > > > Sep 18 21:47:32 ted kernel: eax: 00000000 ebx: eeee0ca0 ecx: > > > > eeee0ca0 edx: 00000000 > > > > Sep 18 21:47:32 ted kernel: esi: 00000006 edi: f8e72cbc ebp: > > > > bffff664 esp: f0851f10 > > > > Sep 18 21:47:32 ted kernel: ds: 0018 es: 0018 ss: 0018 > > > > Sep 18 21:47:32 ted kernel: Process rpc.ldmd (pid: 8209, process nr: > > > > 143, stackpage=f0851000) > > > > Sep 18 21:47:32 ted kernel: Stack: f8e72cbc bffff664 00001020 00000000 > > > > bffff674 c0143cc0 eeee0ca0 f8797ed0 > > > > Sep 18 21:47:32 ted kernel: c0129a0a 00000286 00000002 f98f1da0 > > > > 40016000 f9301ba0 f98f1ddc eeee0ca0 > > > > Sep 18 21:47:32 ted kernel: c011f5e0 fbf92680 f98f1da0 f98f1da0 > > > > 00000286 c012ae29 eeee0ca0 ffffffea > > > > Sep 18 21:47:32 ted kernel: Call Trace: [<00001020>] [<00000000>] > > > > [ext2_release_file+20/28] [__fput+62/72] [<00000286>] [<00000002>] > > > > [unmap_fixup+116/348] > > > > Sep 18 21:47:32 ted kernel: [<00000286>] [fput+17/72] > > > > [sys_fcntl+1031/1064] [<00001020>] [<00001020>] [<00002000>] > > > > [sys_munmap+61/100] [system_call+52/56] > > > > Sep 18 21:47:32 ted kernel: [<00000001>] [<00000006>] > > > > [<00001020>] [<00000037>] [<0000002b>] [<0000002b>] [<00000037>] > > > > [<00000023>] > > > > Sep 18 21:47:32 ted kernel: [<00000296>] [<0000002b>] > > > > Sep 18 21:47:32 ted kernel: Code: 8b 40 08 89 44 24 14 83 c0 74 89 44 > > > > 24 > > > > 10 8b 4c 24 14 8b 6c > > > > > > > > Pretty ugly stuff. Since you mentioned that you thought this was a > > > > corrupt queue, > > > > and since if I delete and remake the queue it works ok, that seems like > > > > the right > > > > track. On our other machine, maul, I see those error messages in the > > > > system log > > > > file as well. But it doesn't fail very often. I think in the last > > > > month, it has > > > > failed only a couple of times. > > > > > > > > Any ideas? I'm sure this isn't your normal everyday stuff, but I > > > > thought if maybe > > > > you knew of others that had problems with SMP machines or the SMP > > > > version of the > > > > kernel it might help. > > > > > > > > Thanks in advance > > > > > > > > Randy Weatherly > > > > > > > > > > =============================================================================== > > > Robb Kambic Unidata Program Center > > > Software Engineer III Univ. Corp for Atmospheric > > > Research > > > address@hidden WWW: http://www.unidata.ucar.edu/ > > > =============================================================================== > > -- > Randy Weatherly AWIPS/Computer Systems Analyst > National Weather Service > Salt Lake City UT > address@hidden 801-524-5120 x284 -- Randy Weatherly AWIPS/Computer Systems Analyst National Weather Service Salt Lake City UT address@hidden 801-524-5120 x284