Is retrieving frm database faster or from a file

rblon · January 9, 2010, 8:52pm

I would say that in theory file storage is faster as you can optimize it perfectly for your specific task. In practice however, it will be hard to optimize better than your database system, and it will probably save you a lot of implementation time to use a database (which will also scale better).

Many database systems use a server, which means it involves interprocess communication causing some delay compared to reading from file (again, most likely this will be a theoretical difference). One exception is SQLite, which doesn’t involve a server process and has every database stored in a single file.

disgracian · January 9, 2010, 9:21pm

My point was that indexing would negate this. If the fields specified in the SELECT and WHERE clauses are in the index, then the query won’t even use the table at all.

Cheers,
D.

Mittineague · January 10, 2010, 11:29am

I was going to post a comment based on my gut feelings that any difference in getting small result sets would be insignificant. But I decided to run an admittedly unprofessional benchmark test just to see. I created an XML file

menu.xml

<?xml version="1.0" encoding="utf-8" ?>
<menu>
	<page>
		<title>First</title>
		<path>first/index.php</path>
	</page>
	<page>
		<title>Second</title>
		<path>second/index.php</path>
	</page>
	<page>
		<title>Third</title>
		<path>third/index.php</path>
	</page>
	<page>
		<title>Fourth</title>
		<path>fourth/index.php</path>
	</page>
	<page>
		<title>Fifth</title>
		<path>fifth/index.php</path>
	</page>
	<page>
		<title>Sixth</title>
		<path>sixth/index.php</path>
	</page>
</menu>

a CSV file

menu.csv

First|first/index.php
Second|second/index.php
Third|third/index.php
Fourth|fourth/index.php
Fifth|fifth/index.php
Sixth|sixth/index.php

and a database table “menu” with varchar(10) “title” and varchar(20) “path” fields containing the same values as the files.

Then I threw together this file to test

menu.php

<?php
$iter = 100;

set_time_limit(120);
$fs_xml_start_time = microtime(true);
for ($i = 0; $i < $iter; $i++)
{
	//get xml stuff here
	$xml = simplexml_load_file('./menu.xml');
	foreach ($xml->children() as $page){
		$page_title = $page->title;
		$page_path = $page->path;
		if ($i == 0){
			echo $page_title . " ~ " . $page_path . "<br/>\\r\
";
		}
	}
}
$fs_xml_stop_time = microtime(true);
$fs_xml_run_time = $fs_xml_stop_time - $fs_xml_start_time;
echo '<p>filesystem XML: ' . $fs_xml_run_time . '</p>';


set_time_limit(120);
$fs_csv_start_time = microtime(true);
for ($j = 0; $j < $iter; $j++)
{
	//get csv stuff here
	$csv_f = fopen("./menu.csv","r");
	while (($data = fgetcsv($csv_f, 110, "|")) !== FALSE) {
		$page_title = $data[0];
		$page_path = $data[1];
		if ($j == 0){
			echo $page_title . " ~ " . $page_path . "<br/>\\r\
";
		}
	}
	fclose($csv_f);
}
$fs_csv_stop_time = microtime(true);
$fs_csv_run_time = $fs_csv_stop_time - $fs_csv_start_time;
echo '<p>filesystem CSV: ' . $fs_csv_run_time . '</p>';


set_time_limit(120);
$db_tbl_start_time = microtime(true);
for ($k = 0; $k < $iter; $k++)
{
	//get db stuff here
	$host = 'localhost';
	$user = '*****'; //user name
	$pswd = '*****'; //password
	$db = 'test';
	$connection = mysql_connect($host,$user,$pswd) or die('Could not connect: ' . mysql_error());
	mysql_select_db($db) or die('Could not select database');
	$query = "SELECT title, path FROM menu";
	$result = mysql_query($query, $connection) or exit(mysql_error());
	while ($row = mysql_fetch_assoc($result)) {
		$page_title =$row["title"];
		$page_path = $row["path"];
		if ($k == 0){
			echo $page_title . " ~ " . $page_path . "<br/>\\r\
";
		}
	}
	mysql_free_result($result);
	mysql_close($connection);
}
$db_tbl_stop_time = microtime(true);
$db_tbl_run_time = $db_tbl_stop_time - $db_tbl_start_time;
echo '<p>Database table: ' . $db_tbl_run_time . '</p>';

?>

Several runs (100 iterations each) consistently gave around .2 sec for XML, .08 sec for CSV and .9 sec for MySQL

So I guess it depends on whether you want it to scale and what other things you might want to do with the data.

If anyone sees anything I’m missing, please be politely harsh, I’m interested in learning more.

SpacePhoenix · January 11, 2010, 7:17pm

Just looking at three open source apps (MediaWiki, phpBB and SMF) they all store their images (uploaded user avatars) in the file system. Anyone know how vBulletin stores avatars (in the database or file system)?

Most forums I’ve used don’t allow any file uploads apart from uploading avatars (MediaWiki does but it stores the images in the file system). Most make you provide a link to where the image resides (generally on sites like flickr, image shack or photo bucket).

The ones who would really know which way is most effiecient are the sites like flickr, image shack or photo bucket given the huge numbers of image files they host between them.

EDIT: What might be more efficient stored in the database is thumbnails of larger images.

wwb_99 · January 12, 2010, 4:31pm

Personally, I’d make the database the ultimate backing store for the blobs. Doesn’t mean I wouldn’t cache them out to disk if need be, but for backup/restore/dr purposes having all the application data living in one place is the vastly superior option.

I’d also argue, for most cases, scalability is a bit of a red herring. Any modern setup should be able to handle the traffic for most sites we discuss here, and sustain- and maintainability will pay off lots more than performance improvements.

Oh, and thirding the “databases are far, far more than glorified filesystem abstraction” sentiment.

SpacePhoenix · January 12, 2010, 6:39pm

I’d say that having the files in the file system makes them easier to recover and restore, as there are a number of programs that you can use. If your database and files are on the same disk and that disk fails you’ll loose all no matter what way you store the files.

Depending on the size and number of files by storing them in the database you are limiting your options for storage medium for backups to hard drives. If you store themn in the file system you have the option of optical disks (DVD, Blueray).

What would you do with database stored files if some hacker got into the database and deleted it all, you would looks the files, if they are stored in the file system then you would just need to run the appropriate script to “re-index” them.

wwb_99 · January 12, 2010, 6:44pm

Yes, file recovery is a nice option. Then again, this stuff should be living on SANs and/or RAID5+ storage where risk of hardware failure is minimal presuming operators aren’t sleeping on the job. I’d argue it is easier to worry about a single-file recovery scenario (eg–a single db backup) than grabbing hundreds or thousands of files and making sure you have them and they are in the right place with the right permissions.

Blueray holds plenty of data. If you are kicking around 50+ gb databases, you probably aren’t backing up to optical media due to speed constraints.

If your box got that compromised, you should really be building a new de-novo box. Also, a database restore is much more reliable than file system recoveries. With proper transaction logging, you can often even restore things to a specified point in time too.

SpacePhoenix · January 12, 2010, 7:35pm

I just found this thread about it. A good test if someone has the time to test it would be to take 2,000 image files, store 1,000 of them in a BLOB field in one table and store the other 1,000 in the file system (with filename, MIME etc stored in the table) and then for each table time how long it takes to display the 1,000 files. The display of the files should be done about 15-20 times for each method to get an average for the time taken to display the files.

When you google around it seems like the storage of files in BLOBs is one of them situations where you either absolutely love it or absolutely hate it, there doesn’t seen to be too much middle ground.

wwb_99 · January 12, 2010, 8:04pm

Yeah, this is definitely a war of religion. Things seem to be civil here, so we might as well continue.

Interesting experiment, but I’d ask “does it matter?” as well as “are you measuring the important part?”

For most scenarios, pulling from either source will be fast enough, the real bottleneck usually being the WAN connection between client and web server, not anything that might or might not be happening on the server. Now, the trouble you can run into is that, if you are pulling large blobs out of your DB all the time either the web server or the db server can choke on this. Trick is, if you are managing this in your application, you can employ a variety of caching strategies within your application to ease that bottleneck and still keep the management advantages of database-based storage.

technov5 · January 15, 2010, 2:32am

sorry for the late reply, have a lot of work these days.

Thanks for all the help, I learned a lot from them. I guess it comes down to what I am storing and my application, so I think i’ll have to test it with some sample data.

I’ll make sure to post the results when I am done.

And thanks a lot for all the help, please feel free to say anything else.

THank You