Separation of database access between objects and their subordinates

cpradio · September 27, 2012, 8:21pm

Can you elaborate on why it is strange to pull all of the data you need upfront and not have to go back time and time again for more data? So you ultimately, make 1 connection, do your query (or queries), disconnect and never connect again until the next page request/submission. I just curious as to why you consider it strange (call it my curiosity).

TomB · September 27, 2012, 9:28pm

Well yes, however, writing an ORM as a learning exercise is not a bad idea at all. When you hit those brick walls it gives you an understanding of why things are done how they are.

When I looked into Doctrine (admittedly, 2 years ago) the learning curve was huge and it wasn’t obvious why they’d made certain design decisions. It also took a hell of a lot of configuration to do something simple. I want something that simply works without needing to configure everything up front.

For instance, in my own mapper I can run echo $datamapper->user->findById(123)->name; without defining a single class and only configuring the data mapper to default to the database and the database connection.

edit: I just looked at the manual for doctrine 2… metadata in comments?! From a design perspective that is insanity. The idea of code which can break if comments are changed, frankly, is shocking. There’s no way to syntax check it, other programmers looking at the code have a whole new syntax to learn at least XML is vaguely readable by anyone and not only that, storing metadata with data is always bad practice as it tightly couples them preventing them from being interchangeable as well as highly reducing portability. It genuinely worries me that they’re teaching people that this is a good practice and encouraging others to do it.

TheRedDevil · September 27, 2012, 10:35pm

I think I have possibly misunderstood your previous sentence, what I assumed from it was that the database connection was opened and then closed per separated query. I.e. that it would set up the connection to the database several times per load, each time you run a separate query. If that is not the case, then it is just a misunderstanding from my side, and please ignore my comment.

Claiming that using an ORM will get you more effective queries than writing them yourself just show ignorance. The only case that will be true is if the programmer does not know SQL.

Have you tried using it on a table with a larger record set?
Have you tried using it in an enviroment requiring database sharding?
Have you tried using it in a master/slave database server setup?

A ORM can have its place, and I have used it many times when creating windows applications, but I still believe using it in a PHP application is a bad idea. It can work on smaller websites, but the second you need to expand and scale, you have problems, not to mention you would need to scale earlier as well.

cpradio · September 27, 2012, 10:44pm

Thanks. In our prior company, we were required to re-connect (but we would run multiple queries within a single connection to try and only use 1 connection for the entire page), as leaving a SQL connection open ended up causing us to max out connections (we had fairly high traffic at all hours). In PHP, I’d either pass in the connection or make it a base variable so it can be persisted across queries.

Lemon_Juice · September 28, 2012, 7:59am

Considering there are many more cases apart from pulling the first item from the data source you need to have a very intelligent translator of object access properties and methods to sql so that it remains efficient. Still, even the most intelligent translator cannot optimize everything if you don’t tell it up front what data you will need to request later because often it’s much faster to get more data in bigger chunks than to request many smaller ones. Therefore from performance point of view what cpradio suggests makes more sense - fetch all data beforehand in one go.

It looks like you have built a very flexible system with almost ideal OO data access. However, I don’t think it doesn’t sacrifice some portion of performance which can be important for any large system. For anything small it’s sweet to have so much data source abstraction and independence but when a site gets large and datasets expand then it becomes important to tweak or rewrite individual sql queries and at that stage this abstraction becomes a hindrance. And I don’t think being able to substitute a relational database to XML files or whatever else you may think of is important (unless you have a specific requirement for this in a project). In a small system you can play with it but with a large database this would be almost insane.

This is an ongoing question of how far we are willing to go with implementing good and flexible OOP at the expense of practical usefulness (performance). I don’t think in PHP I would go as far as you but certainly that would be a good exercise in learning OOP techniques. Everyone chooses their own balance.

Yes, I agree, it’s worth the effort. I did this some 3 years ago and it wasn’t just a learning exercise because now I use my ORM in almost all of my projects and it seems to work very well and fast. I still make small improvements to it from time to time but generally it’s done. This way I am able to work with an ORM that suits my requirements:

simple - as few magical things happening under the hood as possible, support only for the simplest relations, I don’t want an ORM to do everything for me because then I spend more time maintaining my ORM (or learning a third-party ORM) than actually using it.
fast even by sacrificing database independence - when I choose a database for a project I don’t intend to change it ever.
provide convenience for basic CRUD operations and provide easy ways to execute my own SQL - I like to write SQL and I like having total control of optimizing queries, I just need my ORM to put the result set into objects.
no need to define metadata, no XML or other model/relationship configuration files, no other maintenance chores (in my case it’s just a question of running one script that will automatically reverse-engineer all my database structure into proper objects)
the major goal of ORM for me is to provide me with structure of objects which I can use to manipulate data and add my convenience methods, etc.

I have achieved all these goal knowing that I have sacrificed a few things like database independence and a few good OOP practices but the system works, it’s fairly easy to use and my sites load very fast. I understand this is a not full-features ORM but I don’t mind.

Everyone will have different priorities but if you have time to play with writing your own ORM then by all means go for it. However, I think that novices in OOP should first get some experience with an existing system just to gain some basic knowledge.

TomB:

When I looked into Doctrine (admittedly, 2 years ago) the learning curve was huge and it wasn’t obvious why they’d made certain design decisions. It also took a hell of a lot of configuration to do something simple. I want something that simply works without needing to configure everything up front.

For instance, in my own mapper I can run echo $datamapper->user->findById(123)->name; without defining a single class and only configuring the data mapper to default to the database and the database connection.

edit: I just looked at the manual for doctrine 2… metadata in comments?! From a design perspective that is insanity. The idea of code which can break if comments are changed, frankly, is shocking. There’s no way to syntax check it, other programmers looking at the code have a whole new syntax to learn at least XML is vaguely readable by anyone and not only that, storing metadata with data is always bad practice as it tightly couples them preventing them from being interchangeable as well as highly reducing portability. It genuinely worries me that they’re teaching people that this is a good practice and encouraging others to do it.

yes, I also don’t like systems where I have to define all kinds of configurations in xml, comments, etc. There is a point where learning to use a system and configuring it takes so much time that the development time savings by actually using the system are not enough to be profitable.

TomB · September 28, 2012, 9:17am

Considering there are many more cases apart from pulling the first item from the data source you need to have a very intelligent translator of object access properties and methods to sql so that it remains efficient. Still, even the most intelligent translator cannot optimize everything if you don’t tell it up front what data you will need to request later because often it’s much faster to get more data in bigger chunks than to request many smaller ones. Therefore from performance point of view what cpradio suggests makes more sense - fetch all data beforehand in one go.

No ORM can optimise everything. You don’t want to tell it what data to fetch. Part of the job of the ORM is to ensure consistency. From an OOP perspective objects should always be complete. Admittedly, in the name of performance it’s usually better to specify certain columns (such as blob ones) which are lazy loaded, but as a rule, a record is a data set and needs to have integrity. By fetching part of the data, you lose this.

Whether you’re using joins or separate queries, the DB is going to be the slowest part of the system.

Have you benchmarked that? Most of the time additional queries are faster than joins, especially when sorting is involved or you take prepared statements into account. The number of queries has little effect, if you’re running 10 queries that do a PK lookup and all run in a matter of miliseconds it’s better than running a query which does 10 joins, a sort and takes 2 seconds to run.

If you are indeed “fetching all the data beforehand” that is potentially far worse performance! You’re fetching data you may never need or want.

However, I don’t think it doesn’t sacrifice some portion of performance which can be important for any large system. For anything small it’s sweet to have so much data source abstraction and independence but when a site gets large and datasets expand then it becomes important to tweak or rewrite individual sql queries and at that stage this abstraction becomes a hindrance. And I don’t think being able to substitute a relational database to XML files or whatever else you may think of is important (unless you have a specific requirement for this in a project). In a small system you can play with it but with a large database this would be almost insane

Again, you’re making assumptions about performance problems that don’t exist because additional queries are often faster!

Funnily enough, it’s those larger systems where being able to join from multiple sources has the largest benefit and practical use because they often do need to connect to external data sources, multiple databases, etc. In fact, it’s only in small self-contained systems where you wanted to use a DB specific data mapper because you can be fairly sure no other data sources will be needed. The larger the system, the more features there are and it becomes increasingly more likley that external/unknown data sources are needed.

Consider:


//Show user's latest tweet
echo $this->mapper->user->findById(123)->twitter[0]->content;

Which would use a mapper that connected to the users twitter feed and found the tweets. That is, of course, the tip of the iceberg.

This is an ongoing question of how far we are willing to go with implementing good and flexible OOP at the expense of practical usefulness (performance). I don’t think in PHP I would go as far as you but certainly that would be a good exercise in learning OOP techniques. Everyone chooses their own balance.

Again though, this is a strawman argument. There is no performance issue. Using inner joins can be slower!

no need to define metadata, no XML or other model/relationship configuration files, no other maintenance chores (in my case it’s just a question of running one script that will automatically reverse-engineer all my database structure into proper objects)

You do need some metadata somewhere. The fact that you have expressed it inside queries doesn’t mean that it isn’t there. Those joins in your queries ARE the metadata. They are defining the relationships between your entities. Whether they’re stored in an SQL query, an XML file or wherever is irrelevant, any ORM which handles relationships will need some knowledge of it.

You are too hung up on a performance issue (which doesn’t even exist!) ignoring that, there is always a cost-to-benefit ratio. The cost to performance is nil or almost nil whereas the benefit is huge. Consider testing. Being able to quickly and easily substitute any mapper to use an XML file instead of the database immediately makes testing far easier by removing the database dependency.

A practical example: On one site I worked on, a client had a stock control system used at their physical store which already contained almost all the information we’d ever need about the products. This was a desktop application running on their network that could be connected to as a web service to retrieve the data. By simply altering a couple of lines in the mapper definition all the data could be used in real-time and work with our existing shopping cart system with no changes and no need to store a copy of the data in the database potentially creating syncing issues! Then simply save the data back into the stock control system transparently. All using the existing code that we have used on several entirely database driven sites.

Lemon_Juice · September 29, 2012, 2:21am

Yes, I agree, objects should be always complete. That’s why I said that loading data in advance is a compromise if we want to achieve better performance - if performance can be achieved that way, for as far as I can see it is debatable right now :).

I can’t agree with that statement entirely. It all depends on what kind of query is run and we can’t generalise it like that. In my experience it happens sometimes that separate queries are faster than joins but I’ve had very few cases like that. I can see that most of my joins perform very fast. Just for the sake of curiosity I’ve just made a benchmark of a common join, a simple case where I make a listing of products and need to fetch manufacturer name from a separate table:


SELECT p.*, m.*	FROM products p
	LEFT JOIN manufacturer m
	ON m.id=p.id
	
	WHERE p.hidden=0
	ORDER BY p.price
	LIMIT 1000;

Table about 6000 records, no indexes except on ids. I did this without any ORMs, I just loaded the entire result set into array and timed it. Then I did the same without a join and in a loop I loaded each manufacturer data from the database. The whole operation was about 2.5 times faster for the join. The interesting thing was that when I did the same benchmark loading the data into objects through my ORM then the join performed about 2.5 times slower - but that was due to the way I fetch data - when I fetch data into a single object (without joins) then I load all row with a single statement, however for joins there is a loop that traverses all fetched fields so that the appropriate ones are populated into related objects. So taking this into consideration join will be slower in my case because I can’t optimize fetching the data (I can’t find a way to load the related row without looping over each column and doing some logic to determine where the data should go).

So it’s not so much a question of joins vs separate queries but rather the overall overhead of the mapper/ORM. Loading everything properly into objects can be 2 or 3 times slower compared to a more direct method - this may not be much for most sites but for busy ones it may mean sooner need for a dedicated server. If I allow myself some inconsistency in my objects then I can for example load the manufacturer name with a JOIN into the Product object and the speed will be better than any other solution. Sure, inconsistency will have its price.

TomB:

Funnily enough, it’s those larger systems where being able to join from multiple sources has the largest benefit and practical use because they often do need to connect to external data sources, multiple databases, etc. In fact, it’s only in small self-contained systems where you wanted to use a DB specific data mapper because you can be fairly sure no other data sources will be needed. The larger the system, the more features there are and it becomes increasingly more likley that external/unknown data sources are needed.

Consider:
//Show user's latest tweet
echo $this->mapper->user->findById(123)->twitter[0]->content;
Which would use a mapper that connected to the users twitter feed and found the tweets. That is, of course, the tip of the iceberg.

You may have a point here with the larger systems but I don’t think the twitter example is the best one because even in a DB specific system I can easily do the above if I write my own method for connecting to twitter. But then yes, I need to do it all properly and not try to fetch twitter content with sql, I need to stick to the objects. The problem might be if I first had twitter data in DB and used SQL, then I switched to some remote feed.

Yes, I fully agree that I need metadata somewhere. I meant that I don’t have to define them anywhere because they are generated automatically. The point is I don’t need to spend time on them.

TomB:

You are too hung up on a performance issue (which doesn’t even exist!) ignoring that, there is always a cost-to-benefit ratio. The cost to performance is nil or almost nil whereas the benefit is huge. Consider testing. Being able to quickly and easily substitute any mapper to use an XML file instead of the database immediately makes testing far easier by removing the database dependency.

A practical example: On one site I worked on, a client had a stock control system used at their physical store which already contained almost all the information we’d ever need about the products. This was a desktop application running on their network that could be connected to as a web service to retrieve the data. By simply altering a couple of lines in the mapper definition all the data could be used in real-time and work with our existing shopping cart system with no changes and no need to store a copy of the data in the database potentially creating syncing issues! Then simply save the data back into the stock control system transparently. All using the existing code that we have used on several entirely database driven sites.

Yep, I can see benefit in that, point taken!

BTW, do you implement some form of caching? When you do a chain like $this->mapper->user->findById(123)->twitter[0]->content a few times then does your mapper fetch the data again from the DB?

cpradio · September 29, 2012, 9:59am

Just wanted to point this out, as there is where you can gain significant performance improvements. If you ever deal with a SQL Server or MySQL Server or whatever database that is installed on a separate server or controlled/monitored by third parties, you can get significant performance savings by only needing to go to that server once. It is important to consider the fact that you will have network latency in your connections and fetching of results. Doing that repeatedly is going to add up in cost.

Let me put that into a real life example. I worked on a project where the average page load time was 5-7 seconds (not terrible for the amount of data they were loading, but still slow). Ultimately, they were performing 600+ queries/fetches for data needed for the page. Changing those 600 queries (it was in a loop), down to 2 queries dropped the entire page load time to under 1 second. Their SQL Server was located on a separate server, so they did have network latency to consider, but they never realized that has their product grew, the loop was going to run more and more queries.

The problem I see, is that some ORMs could have this affect, if not closely watched. Especially with remote sources. I’m not saying ORMs are bad, just do your due diligence and test them in a variety of situations before going all in on one.

TomB · September 29, 2012, 5:43pm

Lemon_Juice:

I can’t agree with that statement entirely. It all depends on what kind of query is run and we can’t generalise it like that. In my experience it happens sometimes that separate queries are faster than joins but I’ve had very few cases like that. I can see that most of my joins perform very fast. Just for the sake of curiosity I’ve just made a benchmark of a common join, a simple case where I make a listing of products and need to fetch manufacturer name from a separate table:
SELECT p.*, m.*	FROM products p
	LEFT JOIN manufacturer m
	ON m.id=p.id
	
	WHERE p.hidden=0
	ORDER BY p.price
	LIMIT 1000;
Table about 6000 records, no indexes except on ids. I did this without any ORMs, I just loaded the entire result set into array and timed it. Then I did the same without a join and in a loop I loaded each manufacturer data from the database. The whole operation was about 2.5 times faster for the join. The interesting thing was that when I did the same benchmark loading the data into objects through my ORM then the join performed about 2.5 times slower - but that was due to the way I fetch data - when I fetch data into a single object (without joins) then I load all row with a single statement, however for joins there is a loop that traverses all fetched fields so that the appropriate ones are populated into related objects. So taking this into consideration join will be slower in my case because I can’t optimize fetching the data (I can’t find a way to load the related row without looping over each column and doing some logic to determine where the data should go).

How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries. Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it’s sorting the entire data set. Move the sort to array_shuffle($products) and it’s suddenly much faster. This is scaled by the number of tables you are joining.

So it’s not so much a question of joins vs separate queries but rather the overall overhead of the mapper/ORM. Loading everything properly into objects can be 2 or 3 times slower compared to a more direct method - this may not be much for most sites but for busy ones it may mean sooner need for a dedicated server. If I allow myself some inconsistency in my objects then I can for example load the manufacturer name with a JOIN into the Product object and the speed will be better than any other solution. Sure, inconsistency will have its price.

But it depends what your query is doing. Once you start adding more joins, sorts and grouping (which you’ll need if you using joins in this way) the single query method very quickly loses any performance benefit it had. Simple is always faster and once you start using prepared statements to fetch the manufacturers the difference will become even less in your very simplistic example.

That’s fair enough, but we recently had an interesting topic on premature optimisation and that’s exactly what it sounds like to me. The problem is you’re sacrificing flexibility in the name of performance. Fixing performance problems can easily be done after the fact with a host of different methods. Fixing flexibility problems generally involves re-writing a lot of code because the code isn’t flexible enough to handle anything outside what it was initially designed to do. Once the client moves the goalposts and says

Yes, I fully agree that I need metadata somewhere. I meant that I don’t have to define them anywhere because they are generated automatically. The point is I don’t need to spend time on them.

How are your joins generated automatically? Somewhere you need to store data about the primary key/foreign key relationships. The query itself can be generated but which fields to join on needs to be stored somewhere. You’ll need to spend time defining those somewhere if your ORM can support joins. It will also need to know the kind of join: 1:1m 1:n. This is all metadata and needs to be configured to enable your queries to be generated.

BTW, do you implement some form of caching? When you do a chain like $this->mapper->user->findById(123)->twitter[0]->content a few times then does your mapper fetch the data again from the DB?

Yes. Any time an individual record is fetched, it’s stored in the mapper. Individual mappers can turn this off but it’s something I rarely need to use. Instead I have it so that certain fields (e.g. longblob) can be lazy loaded.

I did play around with caching results from queries that returned multiple results but I found that in the real world, 99% of the time these are only used once whereas fetching by ID, in addition to being a lot less data to store in memory, is often something that ends up being used multiple times. If I want a product manufacturer name, it’s likely I’ll also want that product’s manufacturer logo. So in that example, the user would be cached in memory for the duration of the script.

Just wanted to point this out, as there is where you can gain significant performance improvements. If you ever deal with a SQL Server or MySQL Server or whatever database that is installed on a separate server or controlled/monitored by third parties, you can get significant performance savings by only needing to go to that server once. It is important to consider the fact that you will have network latency in your connections and fetching of results. Doing that repeatedly is going to add up in cost.

Let me put that into a real life example. I worked on a project where the average page load time was 5-7 seconds (not terrible for the amount of data they were loading, but still slow). Ultimately, they were performing 600+ queries/fetches for data needed for the page. Changing those 600 queries (it was in a loop), down to 2 queries dropped the entire page load time to under 1 second. Their SQL Server was located on a separate server, so they did have network latency to consider, but they never realized that has their product grew, the loop was going to run more and more queries.

The problem I see, is that some ORMs could have this affect, if not closely watched. Especially with remote sources. I’m not saying ORMs are bad, just do your due diligence and test them in a variety of situations before going all in on one.

Of course 600+ queries is too much, at the same time and equally anecdotally I’ve fixed a very slow page by breaking it up into separate queries. It was a reporting query dealing with a hell of a lot of data and joining around 15 tables, sorting and grouping. In both our examples, the problem was the original developer neglecting to think about consequences. It was taking around 45 seconds to run.

However, slow single queries are substantially worse than a page that is slow because it’s executing too many queries. That reporting query that took 45+ seconds to run (actually that’s generous, some of them were more than double that!) joined a table that stored questionnaire answers. Questionnaires were being filled in constantly (probably around every 5-10 seconds a write would happen to one of the answers table.) when this happens, because the SELECT was still running and the UPDATE/INSERT was issued, the answers table got locked and suddenly nobody across the entire site could even view a questionnaire or any kind of results page. All the result of someone viewing the results of a questionnaire from the past! Joyous! Once the reporting pages were changed to use separate queries, they still took a while to run (significantly less though! Roughly half the time) whether they were running or not, it never broke functionality across other parts of the site.

SpacePhoenix · September 30, 2012, 7:37pm

TomB:

How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries. Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it’s sorting the entire data set. Move the sort to array_shuffle($products) and it’s suddenly much faster. This is scaled by the number of tables you are joining.

But it depends what your query is doing. Once you start adding more joins, sorts and grouping (which you’ll need if you using joins in this way) the single query method very quickly loses any performance benefit it had. Simple is always faster and once you start using prepared statements to fetch the manufacturers the difference will become even less in your very simplistic example.

That’s fair enough, but we recently had an interesting topic on premature optimisation and that’s exactly what it sounds like to me. The problem is you’re sacrificing flexibility in the name of performance. Fixing performance problems can easily be done after the fact with a host of different methods. Fixing flexibility problems generally involves re-writing a lot of code because the code isn’t flexible enough to handle anything outside what it was initially designed to do. Once the client moves the goalposts and says

How are your joins generated automatically? Somewhere you need to store data about the primary key/foreign key relationships. The query itself can be generated but which fields to join on needs to be stored somewhere. You’ll need to spend time defining those somewhere if your ORM can support joins. It will also need to know the kind of join: 1:1m 1:n. This is all metadata and needs to be configured to enable your queries to be generated.

Yes. Any time an individual record is fetched, it’s stored in the mapper. Individual mappers can turn this off but it’s something I rarely need to use. Instead I have it so that certain fields (e.g. longblob) can be lazy loaded.

I did play around with caching results from queries that returned multiple results but I found that in the real world, 99% of the time these are only used once whereas fetching by ID, in addition to being a lot less data to store in memory, is often something that ends up being used multiple times. If I want a product manufacturer name, it’s likely I’ll also want that product’s manufacturer logo. So in that example, the user would be cached in memory for the duration of the script.

Of course 600+ queries is too much, at the same time and equally anecdotally I’ve fixed a very slow page by breaking it up into separate queries. It was a reporting query dealing with a hell of a lot of data and joining around 15 tables, sorting and grouping. In both our examples, the problem was the original developer neglecting to think about consequences. It was taking around 45 seconds to run.

However, slow single queries are substantially worse than a page that is slow because it’s executing too many queries. That reporting query that took 45+ seconds to run (actually that’s generous, some of them were more than double that!) joined a table that stored questionnaire answers. Questionnaires were being filled in constantly (probably around every 5-10 seconds a write would happen to one of the answers table.) when this happens, because the SELECT was still running and the UPDATE/INSERT was issued, the answers table got locked and suddenly nobody across the entire site could even view a questionnaire or any kind of results page. All the result of someone viewing the results of a questionnaire from the past! Joyous! Once the reporting pages were changed to use separate queries, they still took a while to run (significantly less though! Roughly half the time) whether they were running or not, it never broke functionality across other parts of the site.

How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries. Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it’s sorting the entire data set. Move the sort to array_shuffle($products) and it’s suddenly much faster. This is scaled by the number of tables you are joining.

What DB engine was in use at the time? Assuming that it was a MySQL server my gut feeling is that it was a MyISAM table, the table(s) could be migrated over to InnoDB but you’d have to keep in mind is that versions of MySQL prior to 5.6.4 have no support for full text search and has only been available in any form from version 5.6.4 of MySQL (http://blogs.innodb.com/wp/2011/12/innodb-full-text-search-in-mysql-5-6-4/).

In this thread various methods of getting the results from a large result set in a random order and the most efficient turned out to be the one where all the sorting was offloaded to the language being used to process the dataset (PHP).


SELECT p.*, m.*
    FROM products p
     LEFT JOIN manufacturer m
     ON m.id=p.id
          WHERE p.hidden=0
     ORDER BY p.price
     LIMIT 1000;

You’re selecting all the fields from both the tables, how many of them fields do you actually need? You should list the ones required only in the SELECT clause otherwise your transferring the data from x number of un-needed fields which you don’t need, a waste of bandwidth, especially if MySQL and PHP aren’t on the same server box.

It would be interesting to hear @r937;'s take on this thread from a database POV

Lemon_Juice · September 30, 2012, 7:42pm

Yes, in my benchmark I did N+1 queires, which was 1001 queries. 1001 simple queries were 2.5x slower than a single LEFT JOIN to fetch the same data from two tables. This was on my local installation so there were no network latencies.

Yes, ordering by product name was slightly slower but that’s understandable because it’s a text field and not a number (price). In this case the separate 1001 were 2x slower than 1 LEFT JOIN. Still JOIN was much better.

Okay, just out of curiosity I added another LEFT JOIN that fetched product’s availability - a row from a small lookup table joined by a numeric availability_id:


SELECT p.*, m.*, a.*
    FROM products p
     LEFT JOIN manufacturer m
     ON m.id=p.id
		 
     LEFT JOIN availability a
     ON a.availability_id=p.availability_id

     WHERE p.hidden=0
     ORDER BY p.name
     LIMIT 1000;

The time for the single JOIN query became very marginally slower (almost the same as with 1 JOIN) but fetching the same data in separate queries became much longer - understandable since now there are 2001 queries to run. This time the single query performed 3.5x faster than the separate queries.

Okay, your scaling argument holds true for INNER JOINs, though. But still the single query was 2x faster. However, with a little bit of tweaking I was able to run the INNER JOINs as fast as LEFT JOINs - by adding index to product.name and adding FORCE INDEX(prod_name) to the query.

I don’t see that joins are as slow as you’re making them to be. Sure, there will be cases where a join will be inferior but it’s not that often, and if that happens there’s still room for improvement by optimising, adding indexes, etc. And when a large query takes too much time it’s often enough to take 1 or 2 joins into separate tables to get rid of performance problems. I’m not trying to be argumentative to prove joins are the way to go - I was curious myself how they perform, that’s why I did the above benchmarks. Another thing - recent versions of Mysql have had many performance improvements and the same joins you found slow many years ago might run much faster now.

It’s very subjective where premature optimisation starts. Personally I prefer not to optimise too much but still to keep performance in mind when developing a web site. In other words use the convenience of objects whenever possible but still try to implement solutions that will perform fairy fast. Depending on a project the flexibility of being able to replace the whole data source may not be needed at all.

If you develop the whole system using perfect objects oriented approach with an ORM or a mapper for data access then how are you going to optimise it if you stumble accross a problem like cpradio described - the network latency is too high and you need to reduce the number of queries to 1 or 2? Your mapper can’t do it because it hasn’t enough knowledge for that, then you need to get out of your way and introduce ugly hacks giving up consistency and flexibility that your objects provide - because now you need to run a single big query instead of hundreds of small ones. If the system is prepared for this from the ground up then it’s very easy task.

It looks like you have one kind of flexibility at the expense of another one - the flexibility to optimise database access. There’s nothing wrong with it but it only proves that there’s no ideal system that can be 100% flexible in every aspect.

Ok, thanks for the description!

Lemon_Juice · September 30, 2012, 8:03pm

The engine was InnoDb on Mysql 5.5.19.

Yes, ORDER BY RAND() is known to be slow in Mysql. In this case it was not that bad - almost the same as order by product.name - but that’s probably because the product table is not that big.

SpacePhoenix:

SELECT p.*, m.*
    FROM products p
     LEFT JOIN manufacturer m
     ON m.id=p.id
          WHERE p.hidden=0
     ORDER BY p.price
     LIMIT 1000;
You’re selecting all the fields from both the tables, how many of them fields do you actually need? You should list the ones required only in the SELECT clause otherwise your transferring the data from x number of un-needed fields which you don’t need, a waste of bandwidth, especially if MySQL and PHP aren’t on the same server box.

It would be interesting to hear @r937;'s take on this thread from a database POV

Haha, it’s funny that you mention that! I am selecting all fields from both of the tables because selecting all fields is necessary when you are using an ORM - you need to fill the objects with all fields to keep them complete, that’s the whole point. This is at odds with the optimisation practices you mention above. From my observation fetching all rows is only marginally less performant unless you fetch some big texts or blobs unnecessarily. But I had to benchmark selecting all fields for the results to be relevant to the topic we are discussing.

r937 is welcome to comment, it would be interesting what he thinks about running tens or hundreds separate queries instead of joins, not worrying about what queries are run by your system under the hood and the like… :). Mapping db relations to objects is not easy and always has some drawbacks - some people claim it’s not even possible or doesn’t make sense because the two paradigms are so much different!

TomB · October 1, 2012, 11:23am

The time for the single JOIN query became very marginally slower (almost the same as with 1 JOIN) but fetching the same data in separate queries became much longer - understandable since now there are 2001 queries to run. This time the single query performed 3.5x faster than the separate queries.

Okay, your scaling argument holds true for INNER JOINs, though. But still the single query was 2x faster. However, with a little bit of tweaking I was able to run the INNER JOINs as fast as LEFT JOINs - by adding index to product.name and adding FORCE INDEX(prod_name) to the query.

And this is always going to be the case. In your 1001 queries vs 1 query example, you’re comparing apples and oranges because you need to account for the time PHP takes to populate the object graph. Add caching to the multi queries so that each manufacturer is only loaded once and the number of queries will be (number of manufacturers used by the 2000 products)+1 which in a lot of real-life scenarios will be significantly lower than n+1. You’ll probably find the database does something similar internally when you run your query with the join. With the addition of prepared statements you’ll find the difference in speed even less (and make sure they’re prepared, not using PDO::ATTR_EMULATE_PREPARES)

More importantly, with separate queries data is only loaded when it’s needed. So although one page may be marginally slower because it does more queries, another may be faster because it never needs to load manufacturer info for a product. It depends how efficient your ORM is at only fetching the exact data it needs. Mine will only ever fetch records it’s going to use because it’s all done JIT.

I don’t see that joins are as slow as you’re making them to be.

I never said they were slow, I stated that there are enough cases where they are and where they are faster it’s not enough of a difference to sacrifice the flexibility gained by allowing data from multiple sources.

the network latency is too high and you need to reduce the number of queries to 1 or 2? Your mapper can’t do it because it hasn’t enough knowledge for that, then you need to get out of your way and introduce ugly hacks giving up consistency and flexibility that your objects provide - because now you need to run a single big query instead of hundreds of small ones. If the system is prepared for this from the ground up then it’s very easy task.

Actually, the mapper has all the information it needs and would be able to generate queries which fetched an entire object graph. My first ORM did exactly this and fetched data only when it got to an endpoint where data was actually used. It worked well and allowed $product->manufactuer->logo; type chaining which eventually ran a query that joined products/manufactuers but it simply wasn’t flexible enough.

r937 · October 1, 2012, 7:27pm

i have nothing to contribute, sorry

i don’t do php, i have no idea what classes are, and object-oriented anything is way over my head

cpradio · October 1, 2012, 7:56pm

@r937 ;, I think they were more interested on your thoughts of performing p., m. in a query instead of specifically stating what columns you want (from a performance/paging perspective).

r937 · October 1, 2012, 8:04pm

oh, you mean the dreaded, evil “select star”?

cpradio · October 1, 2012, 8:55pm

Yup