Won’t this query result in the entire database having to be scanned each time?
I’d do this by storing a count column in the database table (so the column is an int that holds the number of characters in the article), and then query according to that.
Agreed, that would be the optimum way of handling it so a table scan doesn’t occur. Or removing testimonials under 50 characters and removing the WHERE clause.
This has to do with how mysql will treat your query. See, when you do the LEN() query, it will work the way you want (in that it produces the expected results), but you see, mysql has no way of knowing how many characters are in each testimonial without first checking each row in the database to see how many characters are there. This means for every single row, it will read the entire thing and do a count on the number of characters, so if you had 1000 testimonials, each time you do this query, it will have to scan all 1000 rows.
If instead of doing this, you cache the results of the number of characters and store this in a new column, mysql can tell by querying this number, which rows to return, without having to scan all of them. So say you have 1000 rows and say 50 of them should be turned, mysql will work out instantly which rows they are by querying this number, which has already done the work of the LEN() method in the original query. So in this instance, you would have 1000 rows, but mysql won’t have to scan them all and will simply return 50 without going through every row.
In order to do this, you will need to insert the cached number in the new column (call it something like number_of_characters or something like that) each time you insert a testimonial in the database. For the existing records you have, you will have to write a one-off query to go through each row, do a count and insert the value, but this will only need to be done once.
Basically the job of doing the count is already done with query count column, so mysql can use that data to do the comparison, whereas the LEN() method requires both, meaning each row has to be scanned every time the query runs.
No, if you do the len() method in the query, it will do the full table scan again. If you just use the second query (WHERE character_count_column > 7), then you solve the problem.
Basically, you are correct, but just thought I would explain a table scan a bit more.
Whenever you use a function around a column name in your WHERE clause, GROUP BY, or HAVING statements you will invoke a table scan. Depending on the rest of your WHERE clause, GROUP BY, or HAVING statements it may have to look at EVERY row in your table or it may have been able to filter it a bit further before scanning the rest of the rows.
Since this would be the ONLY condition to your WHERE clause it would have to scan every row in your table, count the number of characters and then use that in the condition. The additional column will allow your table to use a more optimized routine for getting the data you want. As the count is no longer needing to be calculated, that part was already done. So now you can technically index the count column and the database will perform an INDEX SEEK instead of a TABLE SCAN which is much much faster.
Granted with 200 records you won’t see a huge performance loss going either way. Once you start approaching more records, then it will become more apparent.