Click to See Complete Forum and Search --> : kernel oops - system crash


greyhammer
04-06-2005, 11:26 AM
Hello people.

A strange thing happened. I just sshed into a server, and popped back out - I didn't type a single command. Not one - just typed ctrl+D to log out. But from then on, the machine didn't reply to any http, imap, or ssh requests. It was pingable though. Here's the /var/log/messages which looks strange.

Apr 6 04:05:14 eto kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000034
Apr 6 04:05:14 eto kernel: printing eip:
Apr 6 04:05:14 eto kernel: c0158f21
Apr 6 04:05:14 eto kernel: *pde = 00000000
Apr 6 04:05:14 eto kernel: Oops: 0000
Apr 6 04:05:14 eto kernel: ipt_limit ip_conntrack_ftp iptable_mangle iptable_nat ipt_state ip_conntrack iptable_filter ip_tables autofs e100 sg scsi_mod microcode ext3 jbd

more of that till I log in thru ssh as zeus

Apr 6 04:05:20 eto kernel:
Apr 6 04:05:20 eto kernel: Code: 39 6e 44 8b 1b 75 e8 8b 7c 24 34 39 7e 0c 75 df 8b 57 4c 85
Apr 6 04:39:42 eto named[575]: zone soostones.net/IN: refresh: could not set file modification time of 'db.soostones.net': permission denied
Apr 6 07:06:18 eto sshd(pam_unix)[20773]: session opened for user zeus by (uid=2543)
Apr 6 07:06:26 eto sshd(pam_unix)[20773]: session closed for user zeus

and then, the machine just freezes. Doesn't respond to any service requests except ping. Any ideas WHAT this could be? I'll tell you one thing though, this machine does run a lot of memory hogging perl scripts. An lsmod also gives all of the above ipt_limit ip_conntrack scsi_microcode, etc.

I have NO clue. Help would be very much appreciated.

Thanks

XiaoKJ
04-06-2005, 12:56 PM
I don't know what did ctrl-d sent to the machine, but did rebooting help? if it did, then refrain from using ctrl-d.

If not, then read your logs. something seems to have gone wrong and it should be logged.

JavaPunk
04-06-2005, 03:16 PM
Ctrl-D is the logoff command.

bwkaz
04-06-2005, 06:43 PM
Yeah, ctrl-d wouldn't have broken it. Perhaps some part of the kernel's process cleanup code broke, though? When you hit ctrl-d, the kernel had to clean up after both your shell and the ssh process that was handling your connection. Maybe it had something to do with one of those cleanups -- though that's just a guess.

I'd reboot it (you may not have even had a choice; it may be that it required a reboot), and see if it happens again.

greyhammer
04-06-2005, 07:47 PM
Pardon my ignorance, but what exactly is a cleanup?

Well, yeah - it did require a reboot. And I've logged out using ctrl+d thousands of times before. And there were 2 processes mentioned in the messages file before the ssh login which had to do with the 0ops - one was updatedb, and the other was prelink. But that was 2 hours before the ssh login/logout. After which the machine just froze.

Could it also have been that the system was just plain low on memory - because it runs pretty memory intensive perl scripts - can't see what that would have to do with updatedb and prelink

bwkaz
04-07-2005, 06:30 PM
Originally posted by greyhammer
Pardon my ignorance, but what exactly is a cleanup? Whenever a process exits, the kernel has to clean up all the data structures that the process was using (stuff like the kernel's task_struct, all the physical memory that the process allocated, etc., etc.). That was the "cleanup" I was referring to.

And there were 2 processes mentioned in the messages file before the ssh login which had to do with the Oops - one was updatedb, and the other was prelink. But that was 2 hours before the ssh login/logout. Ah ha... that might be it.

If the kernel oops'ed twice on something those two processes did, then it was probably already in an "unstable" state. And if it was in a state like that, then it wouldn't surprise me if that was what caused the lockup after you hit ctrl-d. (Actually, it was probably another large set of oopses, the last one of which froze the whole kernel up.)

To find out the problem, you'd have to look at the stack trace for the first oops. The function at the top of the trace (kernel 2.6 says something like "EIP is at [function-name]+0x***" for some function-name and ***) is where the problem happened, and you might be able to trace backwards through the stack to find the original error.

greyhammer
04-10-2005, 01:46 PM
hello.

so what you're saying is this - i have to run something like gdb on prelink or updatedb (whichever one caused the first 0ops)?

I can find out how to do that, but anything else I should be aware of?

Thanks

bwkaz
04-10-2005, 01:56 PM
I don't think gdb will necessarily help. I think either prelink or updatedb put the system in a state that caused the oops, but it was probably not a bug in either of those programs. After all, many other people use those programs without any problems.

I did notice the other day that kernel 2.6.11.7 fixes a race condition in the JBD (journalled block device) code which is used by ext3, which could cause random oopses. I don't know what prelink does, but updatedb definitely exercises the filesystem driver -- if your kernel has that race in it, maybe updatedb triggered it, and caused the oops. If prelink does a lot of file I/O (and I suspect it does), then it might have done the same thing.

So, which kernel are you running? If it's a distro provided kernel, have you checked for updates from your distro recently? If not (and if it's a 2.6 kernel), you might consider upgrading to 2.6.11.7.

greyhammer
04-10-2005, 04:33 PM
2.4.22-1.2174.nptl #1 Wed Feb 18 16:38:32 EST 2004
is the kernel and version returned by uname -rv

and all my filesystems are ext3, distro Fedora Core 1

one other thing I've noticed is that i get a lot of "lame server resolving ..." entries in the messages file - surely BIND couldn't be causing the crash..? There's no mention of named in the 0ops.

Thanks

bwkaz
04-11-2005, 06:36 PM
(oops, not 0ops -- it's an o, not a zero. ;))

The "lame server" messages are caused by other (broken or misconfigured) DNS servers on the Internet. Basically, it means that the reported IP address is the target of an NS record for the domain you're looking up, but when your machine queried that server, it said "no, I'm not the authoritative server for that domain!".

For example, if you look up an A record for www.google.com, your machine first checks its cache. If it's not in the cache, then it asks its configured name server. If that server doesn't do recursion, then your machine will go out and query one of [a-m].root-servers.net for the NS record for "com.". It will query the resulting server for the NS record for "google.com.", and the server it queries should turn the "authoritative" flag on in its response. It will then query the resulting server for an A record for "www.google.com", and the server should set the "authoritative" flag in its response also.

If the "authoritative" flag is ever unset, then your machine will complain about the lame server.

So, anyway, your error. That kernel is pretty old (more than a year now). Can you upgrade it at all, either through the Fedora you have or by installing a newer version?

greyhammer
04-12-2005, 12:53 PM
I tell you it's a 0 ;) kernel: Oops: 0000 . At least that's what the /var/log/messages file says. Anyway, detracting slightly, because no one so far has been able to answer my lame server question properly (though I realize it probably doesn't belong in this forum)

My nameserver is known only to a few. On it I'm running qmail. Not too many people would use it as their DNS. Further it's authoritative for my domain, and it even has an

allow-recursion {localhost;}; in the /etc/named.conf.

So why and HOW does it try and resolve domains that I've never sent mail to / been sent mail from - never even known existed?? So why the lame server messages then? What causes them I'd like to know.

And no, the machine's not anywhere near me - it's with a server hosting company in florida - on the other side of the planet. I don't relish the idea of upgrading a kernel via ssh :(

Anyway, keeping my fingers crossed - no crashes for 5 days now. :eek:

bwkaz
04-12-2005, 06:50 PM
Originally posted by greyhammer
kernel: Oops: 0000 OK, it's oops number 0, but the word "oops" still starts with an o. ;)

My nameserver is known only to a few. On it I'm running qmail. Not too many people would use it as their DNS. Further it's authoritative for my domain, and it even has an

allow-recursion {localhost;}; in the /etc/named.conf.

So why and HOW does it try and resolve domains that I've never sent mail to / been sent mail from - never even known existed?? Because something running on that machine (or somewhere else, but considering your allow-recursion setting, that shouldn't happen) is looking up those domains. Can you post any of them?

What causes them I'd like to know. As above: An upstream DNS server is pointing to the offending DNS server saying it's the name server for a certain domain, but the offending DNS server is claiming that it isn't.

Googling for "lame server" also shows some interesting explanations, such as this one (the actual site is down, so I'm linking to the Google cache of the page):

http://64.233.167.104/search?q=cache:2nnkG-nqqjIJ:www.unixguide.net/sun/faq/5.61.shtml+lame+server&hl=en&lr=lang_en

greyhammer
04-13-2005, 01:48 PM
Sure - these are from /var/log/messages. And one thing about them is that a few (I'm not sure about all) of the in-addr-arpa entries are reverse DNS entries for some of the people connecting to the server to participate in online surveys (basically connecting to the apache server, some perl scripts running). Why would apache do nameserver lookups??

Apr 13 04:02:09 bto named[10537]: lame server resolving '33.67.88.199.in-addr.arpa' (in '67.88.199.in-addr.arpa'?): 128.121.101.12#53

Apr 13 04:02:10 bto named[10537]: lame server resolving '7.0/25.148.149.12.in-addr.arpa' (in '0/25.148.149.12.in-addr.arpa'?):
204.70.49.234#53

Apr 13 04:02:10 bto named[10537]: lame server resolving '7.0/25.148.149.12.in-addr.arpa' (in '0/25.148.149.12.in-addr.arpa'?):
208.134.245.2#53

Apr 13 04:02:11 bto named[10537]: lame server resolving '16.0.137.168.in-addr.arpa' (in '137.168.in-addr.arpa'?): 65.201.16.4#53

greyhammer
04-13-2005, 02:13 PM
well, I got it - that's webalizer - analyzing where people are coming from by doing reverse dns lookups (in-addr.arpa). Hmm. So the culprit isn't named for the oops ;) . Thanks bwkaz -i stand much enlightened as far as the workings of lame servers go now!

I was looking at top - could apache be crashing the system for too many http connections/perl scripts running?? I've seen load averages above 1.0 sometimes. And what's scared me from top's output, is the CPU% field and the CPU idle % field. The CPU idle % sometimes goes even below 30%. Isn't that a sure indicator of trouble??

Thanks

bwkaz
04-13-2005, 07:23 PM
Originally posted by greyhammer
I was looking at top - could apache be crashing the system for too many http connections/perl scripts running?? Possibly. If your kernel is susceptible to the race condition I was talking about, then the high Apache load might have been the second-level cause. If the high Apache load created a high filesystem load, then the race might have ended the wrong way, and the kernel would then have oopsed.

I've seen load averages above 1.0 sometimes. Well, while that isn't good from the perspective of someone that would want to use their CPU to do the most work possible, it isn't necessarily bad either.

The load average is (roughly) the average number of processes that are able to run at any time. To get the most work out of your CPU, you want this number to be at the number of CPUs you have installed (because one process can be running on each CPU at a time). If you had a quad-processor machine, you would want the load average to be near 4. Unless, of course, you don't mind under-utilizing your CPU (most desktop users don't) -- then, the lower the load average, the cooler the CPU will be (because it'll be in a power-saving idle state the rest of the time).

The CPU idle % sometimes goes even below 30%. Isn't that a sure indicator of trouble?? Not really. If you have 100% CPU usage, the only way that would hurt is if your hardware is flaky to begin with (or isn't cooled well, or something). This is because the high CPU usage will cause the CPU's core temperature to rise a bit, which might cause a malfunction.

It's much more likely, however, that you've actually stumbled across a bug somewhere in the kernel. It may be the race condition above, or it may be something else. Or it may be low quality hardware. (However, I don't think so, because the server hosting company uses reputable hardware, right? ;))

greyhammer
04-16-2005, 03:59 AM
bwkaz - lightning strikes again. The system oopsed. Had to be restarted. At this rate, I'll be fired very soon. :(

/var/log/messages said the same thing - prelink and updatedb oopsed, and then this time, kswapd was also a process which oopsed .

And then it said something about a KERNEL BUG in dcache.c:350

Log watch provides a few pointers though (I've put it in bold)...

/etc/cron.daily/00-logwatch:

Can't open: /etc/log.d/scripts/services/qmail at /etc/cron.daily/00-logwatch line 646.
/etc/cron.daily/00webalizer:

Error: Skipping oversized log record
Error: Skipping oversized log record
/etc/cron.daily/autodld:

Found no new rpms at mirror.hostcentric.net.
Checking selected rpms.
Upgrading rpms:
Failed to upgrade: spamassassin-2.63-0.2.i386.rpm
Would need: perl-Net-DNS spamassassin-tools
RPM Output:
warning: /var/spool/autoupdate/spamassassin-2.63-0.2.i386.rpm: V3 DSA signature:
NOKEY, key ID 4f2a6fd2
error: Failed dependencies:
perl-Net-DNS is needed by spamassassin-2.63-0.2
perl(Time::HiRes) is needed by spamassassin-2.63-0.2
perl-Mail-SpamAssassin = 2.61-2 is needed by (installed) spamassassin-tools-2.61-2
/etc/cron.daily/prelink:

/etc/cron.daily/prelink: line 36: 20737 Segmentation fault /usr/sbin/prelink
-av $PRELINK_OPTS >>/var/log/prelink.log 2>&1
/etc/cron.daily/slocate.cron:

/etc/cron.daily/slocate.cron: line 3: 20775 Segmentation fault
/usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e
"/tmp,/var/tmp,/usr/tmp,/afs,/net"

what if I removed those cron jobs? or, could this just be due to bad RAM on the server??

AT a complete loss here :mad:

bwkaz
04-16-2005, 10:00 AM
Originally posted by greyhammer
what if I removed those cron jobs? That might be a good idea. It would depend on whether you need an up-to-date database for locate or slocate -- you probably don't, unless you use locate or slocate a lot, and also add and remove files a lot. It would also depend on whether you need prelinked libraries -- you almost assuredly don't. Prelinking is nothing more than a performance hack; the system will work fine without doing it.

I have absolutely no idea why prelink is running as a cron job to begin with, though -- I believe you only have to prelink libraries once. Once you've done that, it stays in effect unless you recompile the library (or reinstall it, using the package manager, if it doesn't install the prelinked version). Maybe it's running to catch libraries that were (re-)installed later? Seems like a dumb idea to me... but that's just me.

or, could this just be due to bad RAM on the server?? Could be, maybe. It could also be a problem somewhere else in the path between the hard drive and the CPU -- the drive's cache, the IDE controller, the PCI bus, the RAM, the processor's L2 or L1 caches, or the processor itself (though the last two aren't likely -- if the cache or the processor was bad, you'd probably be getting oopses in other places too, not just when you do a lot of hard drive activity).

XiaoKJ
04-16-2005, 02:58 PM
IIRC, prelink, updatedb and kswapd have a tendency to use the hard discs very very urgently. it hints to me that you have your hard discs flaky. kswapd oops maybe because you are running very memory-intensive tasks.

I really suggest you remove them right away, then do them seperately. if they work seperately, then it means that they are conflicting -- running too many intensive tasks together is bound to give you headaches