I've got a weird issue with one of my test servers. It has happened twice now, originally, I thought it was related to my UPS battery going bad (and it probably was then), but last night, completely unplugged from the UPS (as I haven't ordered a new battery yet), by test server went down around 1:00 AM.
So far, I have no idea why. I've checked the syslog and the kern.log (along with apache logs), but nothing is indicating a direct failure that would cause a system halt.
I've just installed munin and munin-node to create system graphs (described here), what else should I be checking? The CPU and MB temperatures look good on reboot, and I've never seen them get high temps even under major load.
it's usually overheating as per my experience. I had similar issues few years back on a Tower Server, changing the Thermal Compound solved the issue for me.
Alright, I'll have to give that a try. I'm hoping munin will point me in that direction, as it keeps a nice graph of the hd and mb temps so hopefully the next time it occurs I can verify the heat was excessive.
If it turns out not to be heat related might be worth doing a memory integrity test as well.
For incase the heat is degrading the memory? Or...
The heat isn't likely degrading memory, if it is memory then it is a bad stick of RAM. Did you add extra hard-drives or any other power sucking equipment to this machine?
Nope, quite the opposite actually, I removed several as I moved the data to a NAS that is mounted via smbfs
smbfs could be your problem actually -- smb implementations are uneven, especially when talking about cheapo NAS boxes. I'd prefer iSCSI or at least NFS.
Unfortunately, the NAS I am using doesn't properly support NFS without a LOT of tinkering (bootstraping it, installing it, etc). I'm not properly motivated to go through that process just yet, and I have other linux machines that are not having a reboot issue that have the same SMBFS mount.
If it is not a power issue then one other thing to check is to look at the capacitors on the mother board. If they are not flat and are bulging then it could also be a capacitor issue. I've seen this on a number of older servers/workstations.
Something I'll keep in mind. The MB is probably 7 years old. It does get a lot of load every day between specific hours (load tests and major processing is done to produce static data -- enough to where all 4 cores are 90-98% for several hours). Now I rarely reboot this machine, it usually runs for months until a patch requires a reboot, or power goes out (since my UPS is still without a battery), or something, so it is very obvious when it goes down.
It has only gone down twice in the past 6-8 months on its own. I've restarted it maybe once during that time frame.
A board capacitor issue will happen far more frequently then twice in months, it will blue-screen server fault a machine in a 1/2 day of work - even less time. This doesn't sound like the issue. If it was then at that point you need to replace the motherboard.
Have you looked at any of the persistent services on the machine to see if they are issuing routines at intervals that coincide with the crashes. Anything in the logs?
It still smells like a hardware problem so have your run any hardware diagnostics yet to see if anything come up?
Nothing in the logs are indicative of the problem yet (including apache logs, the logs of the services being executed that utilize a lot of load, etc). I gave up looking at the logs because all of them had no problems around the time that I suspected that the server halted (other than that was their last entry before it halted and the next entry was the startup).
Hardware, seems to be good. There is only 1 primary hard drive that the operating system boots off of (was replaced earlier this year; February, I think; because the prior one was failing), then it mounts a couple of USB drives for backup purposes, and a SMBFS network share from a NAS (for both backup purposes and for reading of a few specific files).
Memory looks to be good, I usually have screen with at least one session running htop so I can quickly check vitals and memory usage. SWAP memory is always low, network is always available, doesn't run a GUI so no video to deal with, CPU and MB temps are always decent when I look (hoping munin will show otherwise the next time it occurs though!), no funny noises (so nothing sounds like it is failing or being over worked).
I'm not 100% sure SMBFS would be the culprit, because that was just a recent conversion, so the prior failure a few months ago, would have had nothing to do with that.
I do know one of the USB drives unmounts or goes unresponsive fairly regularly (I need to replace it, but just haven't had the time), so I have an automated script that runs every hour to see if it is unresponsive and if it is, to remount it. Originally, I thought that was the issue several months ago, but removing the script from cron didn't resolve the issue, it happened again within days (I believe this was a hardware issue and I replaced a few drives), but hasn't happened again until last week.
I'm just at a lost, because 1) it happens on rare occasions, and 2) none of the logs are indicative as to why it happened, if the command was harshly issued, etc.
Granted, this could have been a power surge (small enough to kill my server, and not affect any other appliances), but I have several servers located against this wall, all without UPS right now, and only this one went down... albeit, not impossible, but seems improbable to be the cause.
I might never know what exactly happened, was just looking for other indicators everyone else may consider that I may have overlooked and that I am getting plenty of suggestions on Thanks everyone for that. Hat Trick to @PromptSpace ; for the suggestion on over-heating, that got me to find munin (awesome utility!)
Glad to help
This topic is now closed. New replies are no longer allowed.