Unbuffered Queries

This is a new area to me within PHP. I’m gearing up for a new role in which PHP will be at my disposal to automate some reporting on some very large DBs. I’ll be dumping some CSVs out in which the queries to run and fully export the file will most likely run 20+ minutes.

I’m a complete beginner when it comes to understanding how PHP handles output while its still running. I’m curious, when using mysql_unbuffered_query(), is the script going to timeout at the set time limit? If so what are my options? SHould I be jacking the PHP timeout up to 2 hours? I’ll also want to be able to retrieve task progress while it is running, what options might I have for this?

What might some error handling look like for this?

I don’t fully understand what you mean but by default PHP buffers output in chunks and sends them when the amount of buffered data exceeds a limit and then continues to buffer the next chunk and so on. If a chunk of data is sent and the browser is not responding (for example the stop button was pressed or the connection ended) the script is terminated (unless ignore_user_abort() is set). This may work differently if you have zlib compression enabled or some other form of output buffering. Therefore, if you want the script to run for a long time it’s best not to output anything at all until it finishes or start output buffering and then flush the buffer at the end of the script. For such a long running script I would write any output to a log file - this way I can monitor the progress independently from the script by reading the log. I would certainly not rely on the browser connection to be open for 20+ minutes.

I think PHP will not complain if you set the right time limit but the db connection can be dropped, I think this article should answer your question, especially the second point 2. IMHO, it would be best not to rely on unbuffered queries but instead split the queries with LIMIT x, y so that they return smaller result sets in chunks. But YMMV, I think the reliability of long unbuffered queries will vary greatly depending on the server setup.

A log file, or progress information written to a db. Of course, the progress viewing should be handled by a separate script, if it is to be displayed on the same page then using ajax to display current progress could be made.

You can use set_error_handler to log any errors to a log file.

While this is the PHP forum, I will recommend that you look for a more suited language for task. Both Python and Perl works wonder with manipulating files for example.

In addition, I would recommend trying to base most of the actual generating of the information as stored procedures if it is not already setup this way, the stored procedures will generate the actual report into a temporary table created for this specific report. Please note this is not a real temporary table, but one you created for this report and will delete after everything is completed.

Then all the other language needs to do is compile the csv files from the temporary table and append any magic if something need to be altered or combined.

Remember that it is usually much faster to run several smaller queries than one large query, so make sure you test this (The larger data set you work with, the more true this statement is).

If you decide to stick with PHP, write this as a CLI script.

Using unbuffered query is a great option when you need to get a larger data set, since it does not pre-fetch all the data, but instead feed it to you on the fly. The problem is of course that you cant do anything else with the connection. Though I would not use the mysql_ functions, consider switching to mysqli or PDO if you stick with PHP.

You are actually working against multiple time limits, PHP will have one, the web server will have one, and the database server will also have one. In reality each has various different timeout times, for various connection types. The main one I would worry about here would be the web server and database timeout times.

On another side, a good question is if you will be able to change these times? And if you are, will it be a good idea to do it? I.e. are the data accessed by other people etc. Since if you increase these timeouts it also makes it much easier to do a successful denial of service attack with limited resources, i.e. capturing all of the connections and keeping them alive.