Webserver goes down every few days, want to learn how to find the cause

odoisc · August 19, 2013, 12:32am

Two weeks ago I started with renting a new dedicated server, but it has gone ‘down’ every few days. Only an automatic hardware reset (through the site of my hosting company) brings the server back to normal. Sometimes it takes me a while to notice the downtime (once it decided to go down 5 minutes after I went to sleep) , so those downtimes are costing me money. I don’t expect anyone to just point out the problem to me, but I would very much like to learn how to find the cause of such a problem and how I can solve it.

Here is some information I gathered :

hardware : Intel® Core i7-4770 Quadcore Processor Haswell architecture, 32 GB DDR3 RAM. two 2 TB SATA 6 Gb/s
software : Debian 3.2.46-1 x86_64 , Apache/2.2.22 , Mysql 5.5.31-0+wheezy1 , PHP Version 5.4.4-14+deb7u2
The only website on this server used to be hosted on another dedicated server, where there was only one other website hosted too. The two of them together were sometimes a bit slower then they should be, and if something went slow it was hard to tell wich of the databases might have a problem, but in the end that one server could handle those 2 websites well. The other website is still running fine now that it has the whole server for itself again, and the new server is doing fine too outside of the downtimes. (almost no cpu or memory usage in top)
The website on this server consists of some simple and short php files, about 50.000 small static jpeg images (cached by apache expires mod) and a small mysql database with 2 active tables (25.000 rows and 2000 rows, indexes where needed). Websites like this are not likely to cause loading troubles.
After the downtimes, I checked several logfiles in /var/log/ . I could usually locate the time of the hardware reset, but in the last hundreds lines before the reset I could not find anything that could point to a problem. Then again, I’m more of a website developer than a server expert. I don’t really know what to look for, and where to look for it.
A few days ago I wrote a php script that logged the load every minute to a txt file through crontab. After the last crash I checked it, and it had ‘0.16 , 0.11 , 0.13’ as last line before the downtime. The next entry was 14 minutes later, wich was the first after the reboot initated just earlier. The load in the 10 minutes before the last line was mostly even lower than 0.16 . Under normal circumstances, even with heavy traffic, the load is usually never high. Sometimes it makes little jumps above 0.60 , but that’s all.
There seems to be no correlation between the timing of the crashes and the traffic on my website. Sometimes the server stops at peak times, sometimes at quiet times. Based on my experience, it really should be able to handle all the traffic it gets.
I noticed something odd while viewing google real-time analytics before and during the downtimes. The graph of the usual traffic of 30-50 page hits per minute obviously goes flat when the server goes down, but not completly . Even hours after the start of a downtime, there are moments where it still registers 1 or 2 page hits per minute. I’m not always 100% sure, but I think I can find those visits in the apache logs too and they look like regular visits of my site. So whatever might go wrong, a few visitors sometimes randomly get through. My site is not local are timezone-related, there is usually almost equal traffic at all hours. Apache apparently still sometimes works during the downtime. I’m not sure anything else is still working, but I can’t find mysql errors in logs that can be matched to those few succesfull visits. The google analytics code is in index.php, if php doesn’t work it can not load the analytics javascript code, so I guess at that moment php still works too. (but not when my cron minute job tries to write the load to a txt file). When I try to connect through ftp during the downtime it gives me ‘connection timed out’, when I try to connect with ssh it gives me a similar ‘network connection lost’. I saw in the syslog file that even cron jobs are not started anymore.

I used to have servers before that went down during heavy traffic. But that was different, the sites on it just load slow and the server load was dangerously high (10 to 200+). Now there is no sign of high load or heavy traffic. What I see is that most services decide to stop working for 99% of the time without an apparent reason.

So what can I do to find the cause of those downtimes ? What do I have to log from now, what excisting log files could I check ? Could it be a hardware problem that I should mention to my hosting company ?

SpacePhoenix · August 19, 2013, 1:18am

If you have the access rights on the server, search for the Apache error log (on my local testing setup - WAMP - the error log file is called apache_error.log). Also check MySQL’s error log. With PHP set both websites to log errors to a file (never display php errors on a live site) and crank up the error reporting level to maximum (so that it reports absolutely everything).

Is it always the exact same time of the day when it goes down (any pattern)?

odoisc · August 19, 2013, 2:07am

I have root access, and went through apache error log again. The file is huge (165 Mb for one day) and is hard to open. However, about 98% of it consists of the same ‘PHP Notice: Undefined offset:’ and ‘PHP Notice: Undefined variable’ over and over again. So it already reports everything, but a bit too much to be workable. I have removed the logging of notices from php.ini now. Between all those notices, it’s not easy to find a single line that is different and suspicious. But I’m not finding anything else than notices and non-excisting files. For example there are no errors about an ‘execution timeout’, wich could be caused by a dysfunctional loop and wich is able to halt down a webserver.

There is also ‘other_vhosts_access’, containing all requests of pages and images. That too is a huge file (135 Mb) and again it’s hard to find a single line that might be wrong. It’s just visitor after visitor, with some spider bots in between. If there was a DDOS attack, then surely it must have caused thousands of different and noticable lines, but nothing of such is there.

No pattern that I can think of. The last 5 downtimes where all at random moments throughout the day or night. Sometimes there were 4 days in between, sometimes 3 hours. For a while I thought it was caused by a crontab php script that writes a XML-sitemap file (5 Mb) every few days, but I had disabled that and the next downtime still came. I have played around with mysql cache settings and apache process settings too, but it’s not stopping the crashes.

ServerStorm · August 23, 2013, 5:48pm

If you are getting a frequent error that may not properly dispose of memory like an array index that is referenced but does not exist, it could consume most of Apache’s allocated memory and crash the server. Have you located and fixed the ‘undefined index’ error and then see if the server still crashes.

One more thing to look at, is are there any backups, or maintenance process running in the night. I’ve seen backup software take down Apache before. I’ve also seen cron scripts cause this error where then repeatedly call a PHP process but never dispose the memory they use; or have trouble running based on concurrent connections. You might want to temporarily disable the crontab code to see if it is the culprit?

Steve

Rubble · August 23, 2013, 6:03pm

I would have thought you would be recieving emails with any errors; unless the email address is on the same server which is not a good idea.

I have problems with Spam assassin which stops access to the server but normaly for just for 15min max and again I get an email informing me there is a problem.

Is it a managed server or are you completely on your own?

odoisc · September 4, 2013, 12:45am

I’ve also seen maintenance/crontab scripts take a server down before, but I’ve ran this server for a few days already with all crontabs off and it still crashed. I haven’t looked at solving those php notices, yet it’s the same code as on the old server and it was never a problem there. I agree it’s not perfect code, however I don’t assume it will take so much memory that this would be the cause.

Dedicated server. There is no Spam assasin running, the whole mail part is luckely somewhere else. I recieve emails from the hosting company site that informs me if there are http/ping errors, but it doesn’t tell me what exactly might be wrong.

The main reason I’m replying now is that I finally might have some more usefull information. Today the server went down 3 times already, and the last time I had SSH open with ‘top’ running. When the server crashed I saw some kind of error message mangled in between the latest top results (wich stopped updating itself). This is what I could get out of it :

Message from syslogd@servername at Sep  4 01:17:47 ...   
 kernel:[11053.462263] double fault: 0000 [#1] SMP  
Message from syslogd@servername at Sep  4 01:17:47 ... 
 kernel:[11053.462723] Stack:9.8m   
Message from syslogd@servername at Sep  4 01:17:47 ... 
 kernel:[11053.462823] Oops: 0000 [#2] SMP 
Message from syslogd@servername at Sep  4 01:17:47 ... 
 kernel:[11053.464958] Stack:9.8m  
Message from syslogd@servername at Sep  4 01:17:47 ...  
 kernel:[11053.465091] Call Trace: 
Message from syslogd@servername at Sep  4 01:17:47 ... 
 kernel:[11053.465107]  <#DF>   
Message from syslogd@servername at Sep  4 01:17:47 ...  
 kernel:[11053.465287]  <<EOE>> 
 
Message from syslogd@servername at Sep  4 01:17:47 ...  

 kernel:[11053.465292] Code: 6d f8 e8 06 87 33 00 eb 09 41 f7 c5 ff 1f 00 00 74 40 45 85 f6 74 14 41 f6 c6 03 75 0e 48 c7 c7 06 fe 4b 81 31 c0 e8 e2 86 33 00 <49> 8b 75 00 48 c7 

c7 26 02 4c 81 31 c0 49 83 c5 08 41 ff c6 e8

Message from syslogd@servername at Sep  4 01:17:47 ...
 kernel:[11053.465452] CR2: 0000000000000028

I have now moved the site back to the old server while I’m trying to understand this server. I also found some kind of automatic hardware-inspection option on the site of my hosting company, so once all the dns has been updated and all the traffic goes to the old server again, I’ll try to run that. Whatever the result of that, I’ll then contact my hosting company together with the error above. I had orderd a running dedicated server with debian/apache/mysql installed, where I only had to install small additional stuff like some php mods or change mysql configs. It all looked running fine at first, but now I have the impression that this is more hardware or kernel-related matter that is beyond my reach and not my responsability.

If anyone could tell me something about the error above that would be great ofcourse. I’ll keep this topic updated, thanks for the replies so far.

infinitnet · October 1, 2013, 11:19am

This could either be some TCP based DDoS attack which would overload the kernel (%si in “top” would be interesting), or an issue with the kernel or your hardware. So the first thing I’d do is add some proper monitoring, to find out if anything is happening with the network/traffic when the server goes down and then compile another kernel or just use the previous version from your package manager (apt) and see if that helps. If it doesn’t, check the hardware.

tom44 · October 6, 2013, 3:19pm

Hetzner? I have the exact same problem! My server is i7-4770 Quad-Core Haswell with two SSD and 32 ram. Every two/three days my server goes down. Only way to bring it up is “automatic hardware reset” in hetzner panel! Did you found solution?