I have a web app with a game and chat updating with ajax very frequently sending thousands of XHR requests to the server during a user's session. The problem is that most clients experience intermittent connection failures or timeouts. On the server it appears to be fine, all requests are logged to have completed successfully and on time but the client gets either a communication failure or a timeout. This makes the user feel like the game or chat just got stuck.
I use YUI for making the XHR calls but I replaced that with jQuery with exact same results. I make sure that only one query is active at a time, so I am sure that I am not hitting the max 2 connections problem. The failure rate may be 1 in 200 hundred requests or something but it seems that when a client has a failure then it fails at a higher rate, say 1 in 10 until the problem goes away. I have clients send back an error request when such errors happen in order to track the problem. I have tried various things to try to figure out what is wrong but I have not found anything. I don't think that it's the client's connection as it happens to a sizable portion of my users, it doesn't look like it's the server either.
So what I am wondering is whether this failure rate is common or expected when doing heavy ajax staff and I need to find a way to work around it or this must be a problem in my system that I need to figure out. If the latter any pointers where to look at?
When transmitting data packets, sometimes you get data loss due to no cause at either end. It's just on of those expected facts of life on the internet, for which timeout and retry protocols exist.
Well, yes, but TCP should take care of that, ie make sure all packets arrive or resend them, otherwise you would never be able to download anything bigger than a few megabytes without suffering data loss. So unless your internet connection drops, even momentarily, I would not have expected something like this.
It could have be something like the huge number of xhr could be causing a memory (or some other kind of) leak in the browser but it happens for various different browsers so that can't be it. Also it doesn't appear to be my application because I added a test feed in my page that keeps downloading an empty static file from another server with similar drop/failure rates.
I guess the only possible theory I have right now is that the server and/or network in my host company drop connections. I used two different server but they were both VPSes from the same host company (it is a big US VPS hosting company so I wouldn't have expected this). I guess will try to find a different server somewhere else and repeat.
The problem with timeout/retry would be that in order to simulate real-time experience would be to have the timeout not much longer than the polling frequency in order not to "get stuck" if a connection is dropped but that would problably timeout a lot of connections that would otherwise be fine and start hitting the system even harder and make performance even worse at the end. That's why I am almost convinced there must be another explanation to timeout/retry. (Timeout/retry should still be valid but for much rarer events).
> The problem with timeout/retry would be that in order to simulate real-time experience would be to have the timeout not much longer than the polling frequency in order not to "get stuck" if a connection is dropped
that doesn't seem that bad to me. worth doing. likely a fix? (so long as you make sure repeat same requests don't cause a problem)
> but that would problably timeout a lot of connections that would otherwise be fine
can't you make that not be the case? just timeout that specific connection, not in a batch/bulk way
> the only possible theory I have right now is that the server and/or network in my host company drop connections
without being that knowledgeable, and taking a blind guess, i'd say it's the user's slightly flakey connections; wifi etc.