Click to See Complete Forum and Search --> : root file system problem (HD?)


jjordan
05-05-2003, 11:34 AM
I've had a problem for about 3 weeks now. My system would just reboot for no apparent reason at random times.

I was able to let it run in the CMOS diag screen for several days without it rebooting, but whenever I would let it boot Linux, at some point (2 minutes to 2 days), it would do a hard reboot.

Temps are ok, no shutdown alarms set in CMOS, fans are clean and working properly, reseated all boards and connectors.

Welll, last night, when it rebooted on me, it came up with the message that the system had been shut down improperly and would I like it to do a system integraty check. This time I answered 'yes'. It proeceded to test the root file system and when it got to 30.8%, it froze for about 3 seconds and rebooted. I repeated this 5 times. Every time it would get to 30.8% checking the root file system and then do a hard reboot.

I have searched with Google and several forums including this one, and if the answer is out there, I sure can't seem to find it. Before I go in and try things on my own, I would like to hear from the experts as to what you think I should do.

1. What is going on?
2. What do I do at this point to correct the problem?

If you don't know the answer, do you know where I might find the answer or at least ask the questions again.

System: Pentium III, 800Mhz, RH7.3, Matrox 60GB system disk, no windoze stuff whatsoever. The Matrox HD is about 4 months old.

Thanks

madcompnerd
05-05-2003, 11:00 PM
Possibly plug in some type of small linux, like LOAF or alinux or even knoppix, and run 'fsck' on the drive.
I'm guessing it's ext3, it may help someone who is more of an expert troubleshoot this for ya.

As for your random power downs...... Odd. I had a computer that used to do it, but it was a failing motherboard.

jjordan
05-06-2003, 01:20 PM
It turned out to be a memory problem. I replaced my memory and all the problems disappeared.

So, the next time someone gets strange symptoms like these, check the memory.

madcompnerd
05-06-2003, 02:59 PM
They weren't hard reboots then? The motors didn't all stop then start again? It was more like the OS had rebooted it?

jjordan
05-06-2003, 06:20 PM
I'm sorry, maybe I don't understand what the difference between a hard reboot and a 'soft' reboot is.

If all the motors and power shut off, can a computer really reboot itself from that state?

In my situation, all power seem to be interrupted for a very brief time because all the lights went out and the fan noise reduced for a split second, before resuming their normal sound, and did a complete POST from scratch.

I always thought that is what is called a hard reboot, but I'd be curious to hear your definition, or what I should have called it.

I'm never too old to learn...or relearn.

madcompnerd
05-06-2003, 09:37 PM
Hmm, I'm not sure. Well, a hard reboot is when you hit the restart button, and soft is when your software reboots it.
I'm not sure how the software reboots though.
Anybody? I can only guess that there is a BIOS function to do it.

bwkaz
05-07-2003, 09:24 AM
I think it's classified as a soft reboot when (from DOS) you hit ctrl-alt-delete, too. In that case, no software runs to do the reboot -- ctrl-alt-delete just makes the CPU jump to its bootup address (which maps to somewhere in the motherboard's boot PROM).

There are APM functions to power down, but I don't know about rebooting. I think that the reboot function in the Linux kernel just does what ctrl-alt-delete did under DOS, but under software control -- it manually jmps to that address. Which reinitializes all the hardware, reads sector 1 off the disk (the MBR), executes the code in it, then goes from there to eventually reload the Linux kernel.

jjordan
05-07-2003, 09:56 AM
After reading everything I can find on the subject, I think bwkaz is close.

A soft reboot is when some software routine actually makes a call to the OS to do the reboot. There are two different types of 'soft' reboots.

One is where a lot of setting are still retained, such as memory locations still containing information, pointer, HD head positions are known and the reboot process can still access the HD cache.

The other 'soft' reboot is simply the OS calling the BIOS 'reboot' code. It is up to THAT code as to what it actually does.

A hard reboot on the other hand is not under software control. Like when you press the 'reset' button, that shorts a certain pin of the cpu to ground, effectivelly wiping out everything and having the BIOS HAVE to start the entire boot process from scratch.

Since the memory in my computer was acting up, I've got to believe that there was no software involved in actually making the computer reboot, since things were probably corrupt. But, not knowing the inner most workings of Linux ...yet..., I don't know what it does when it encounters a memory location(s) that it can't read or expects things to be there in memory. I don't have ECC memory, so there is not error detection or correction in my memory (Intel PC based).

So, until I find out what Linux does in those situations, I can't say for sure if it actually was a 'hard' reboot. ;-)

jjordan
05-13-2003, 12:17 PM
I"m back.

Since replacing the memory a week ago, the system has not rebooted once...until 2 days ago.

It rebooted all by itself 8 times in one evening.

Thinking that I may have gotten bad memory, I created a floppy with memtest86 on it, booted from the floppy and ran it.

I let it go through 2 complete cycles without any errors. I then rebooted Linux. It rebooted at the logon screen. I ran the test again. No problems. Booted linux. reboot after being logged in for 2 minutes.

It has been running for over 36 hours now, without a single problem

I plan on letting it run at least another 24 hours.

Now my question is:

What in Linux (RH7.3) would cause it to reboot? I am not running any extra background programs, just X-windows.

A second question - Can anything in Linux (the OS) or any program cause the system to reboot. It would have to have 'root' priveledges.

If I can run the memtest86 for at least 48 hours without any problems, this tells me that it is not the memory, power supply, mother board, or anything else hardware related.

In my mind, it MUST be software related!

Here's the only other strange thing.

If I manually reboot the system or even shut it down and power back up right away, then go into the CMOS screen and check the CPU temperature, it shows that it's within a degree or two of the normal operating temperature, which is 95 degrees F.

When the system reboots by itself, and I go into the CMOS screen and check the temp, it shows 80 degrees F. then slowly climbs to 95 over the next 8 minutes.

The system has never rebooted when linux is not running. i.e. when I have the CMOS screen up (24 hours) or when I boot from a resue disk, or when I boot from a floppy and test memory.

This is what leads me to believe that it's actually something in Linux that is causing the reboot problem.

I'm willing to listen to ANY ideas at this stage.

I have reseated all boards and connectors. Cleaned all contacts, etc.

I've even gone so far as to remove all unnecessary equipment such as extra disks, extra cards, such as ethernet, sound, modem, etc. It still reboots.

There are no line spikes. I'm running off a UPS that is off-line (on batteries) and it still rebooted.

I've checked all log files. Nothing. No program is running that shouldn't be.

Help!

retoon
05-13-2003, 12:29 PM
My friend, these intermittent errors are probably a result of a dying power supply. Try going barebones and seeing if it randomly reboots that way, I mean no hard drive attached, no peripherals, only ram, cpu, video card, and keyboard. leave the machine on for a little while, if it reboots out of nowhere again, try a new power supply. Also check the front panel connectors and see if your restart jumper is loose. My recommendation for a power supply is antec, they give you a three year warranty on their truepower series, they are like bricks (rule of thumb is if the PSu has some weight to it, then its good to go).:D Good luck, and I hope I helped!

retoon
05-13-2003, 12:36 PM
Oh, and this might sound strange, but check out your video card settings. the only reason I say this is because you are making use of it to run X. Try running linux in CLI and see if it reboots.

jjordan
05-13-2003, 01:52 PM
retoon,

I'm running a Matrox g400 and it's set to a very basic 1024x768 format.

I'm not sure I understand your two replies.

So far, no matter how long I leave the machine on, either showing the CMOS screen or the memtest86 program running from a bootable floppy image, it doesn't reboot. At least it hasn't in over 70 hours. I plan on leaving it on, running the memtest86 program awhile longer, just to make sure.

Wouldn't this rule out what you suggested doing? I can run it fully loaded, with all the drives and cards running, without it rebooting, as long as I don't load the RH7.3 Linux. I have not recompiled the kernel since installing it in December.

This problem started about 4 weeks ago and seems to be rather random except for the sessions where it reboots 6 to 10 times very close together.

BTW, In the boot log, the only thing that it shows is one line that says 'restart' in it.

Does this mean that something (software or hardware) is telling the OS to reboot?

StarKnight83
05-13-2003, 02:05 PM
I have a simmilar problem with my computer, but mine is caused by poor wiring. During times of high power demand almost an additional load will cause my comp to either shut down or reset. Either from a spike or a low power reset, not sure which and I cant afford a UPS :( So i have to accept this behavior til the end of the school year-which by the way is almost out YEY. well if i was you Id almost see if you cant borrow a ups or something like that from a friend and see if that fixer us problem. Hope this helps

Joe

kly546
05-13-2003, 02:37 PM
retoon is right. you most likely have a power supply problem and should try a different one. it shouldn't matter that there are no resets in cmos or the memtest program since they probably don't put the load on the cpu, memory, and hard drive that an operating system would.

jjordan
05-13-2003, 03:01 PM
ok, power supplies aren't very expensive, so I will try a new one and let you know what happens. (hope, hope, hope)

I have to go buy one since I don't have them lying around.

BTW, I was under the impression that running the memtest86 puts the CPU and memory under quite a load. The CPU pegs at near 100% and the memory sure is getting a workout. Can you explain your comment about the memtest NOT putting a load equivelant to having the computer just sitting there, not doing anything (when it reboots)?

I'm still open for other possibilities though.

jjordan
05-14-2003, 03:55 PM
OK guys, I put a new power supply in and it didn't help. It still reboots. It even rebooted while it was almost done loading Linux. Not sure at exactly what point.

I exercised and tested my disks with the IBM DFT320 program. No problems found so far. It is still running.

Put your thinking caps on because this one seems to have most people stumped.

I just don't get it.

1. I can let it run a memory tester for 60 hours without a problem.

2. I can run a disk tester/exerciser/fitness program for 1 hour and currently running for over 4 hours, without a single problem. This also tests all funtions of the interface, etc.

3. I can let the computer sit on the CMOS screen for 72 hours without it rebooting.

4. I've replaced memory and power supply. Still reboots.

Today was the first time that it rebooted DURING the Linux boot process. It got far enough along that it hasn't changed my mind about it being a Linux problem and not a hardware problem. It didn't get to loading CUPS or starting xinetd when it rebooted.

Does anyone know how I can do something like re-install the 'basic' RH7.3 package without it affecting most of my programs and data?

In other words, How can I eliminate what I think it is, namely, something in Linux itself?

I still can't believe that it's hardware related since I believe that I'm exercising and have tested all the hardware that the system would normally be using when it reboots.

Most of the time it has rebooted when NOT doing anything, such as reading a web page or at the logon screen, etc.

The technician at the computer store where I bought the power supply said that he thinks it's the MB and not the memory or power supply, when I explained the problem to him.

How can it be? When I'm running all these test, I'm using everything that the system would normally use.

Is it possible that there is a boot sector virus or some other type of virus that's causing this? I'm behind a firewall that I maintain.

OK, any other ideas, to where I don't have to replace my complete computer piece by piece?

redhat81
05-14-2003, 04:00 PM
Originally posted by jjordan

OK, any other ideas, to where I don't have to replace my complete computer piece by piece?

I think you've reached that last resort already.