Help: "not equals !=" is very slow

peppy · April 8, 2014, 7:23pm

Greetings,

I have a script where I would like to show 5 random items from a particular seller, and 5 random items from other sellers:

$sql = "SELECT * FROM items WHERE seller = 'johndoe' LIMIT 5";
$sql = "SELECT * FROM items WHERE seller != 'johndoe' LIMIT 5";

The ‘seller’ column is indexed and I have over 300,000 items.
The first query runs through super quick (0.01 seconds).
The second query takes 10 seconds.

I see that the Not Equals or “!=” does not work on an index. Is there a way to speed this up somehow?

Thanks
Kind regards

Rubble · April 8, 2014, 7:45pm

I think you have a logic fault as your code is not going to show 5 random items as it will stop after the first 5 items it finds.

Do you just have the one table with 300,00 items and all the details in the one table?

DaveMaxwell · April 8, 2014, 7:50pm

peppy:

Greetings,

I have a script where I would like to show 5 random items from a particular seller, and 5 random items from other sellers:
$sql = "SELECT * FROM items WHERE seller = 'johndoe' LIMIT 5";
$sql = "SELECT * FROM items WHERE seller != 'johndoe' LIMIT 5";
The ‘seller’ column is indexed and I have over 300,000 items.
The first query runs through super quick (0.01 seconds).
The second query takes 10 seconds.

I see that the Not Equals or “!=” does not work on an index. Is there a way to speed this up somehow?

Thanks
Kind regards

The reason for the performance difference is the quantity of result sets in the two queries - if there are 500 johndoes there, the first query will process those 500 records to get the first five, and the second will be working with 299,500 records.

I don’t think your 2nd query will do exactly what you want - without any sort of sort order in it, it should return the same five records each time.

peppy · April 8, 2014, 7:51pm

The actual queries are much larger and complex, which includes the randomization part. The only difference between the two queries is the “!=” part on the seller column, which is the difference between 0.01 seconds and 10 seconds.

I just provided the smallest example above for simplicity.

mawburn · April 8, 2014, 7:54pm

As Rubble said, this won’t find random items. Finding random items in SQL is very hard, because you first have to pull all the rows in, assign each row a unique number, then pick out randomly numbered rows, and check . It’s much more than just a simple select statement because this is going way out of bounds of what SQL is really designed for. Even if you have a numeric ID column, it’s still not going to be easy to do in SQL.

You’re probably better off returning all results to a php array, then randomly pulling the array elements by number.

As far as efficiency of the query, you could try NOT EXISTS:


$sql = "SELECT DISTINCT * FROM items WHERE seller = 'johndoe'"; 
$sql = "SELECT DISTINCT * FROM items WHERE NOT EXISTS(" . $sql . ");";

peppy · April 8, 2014, 8:19pm

Is there a fast way to join the table to itself where the seller is not equal to “johndoe” and provide a new table without “johndoe” within it? or will that be just as slow?

peppy · April 8, 2014, 8:52pm

What if I use the “NOT EXISTS”, how do I add conditions to it? for example:

$sql = "SELECT * FROM items WHERE NOT EXISTS (SELECT * FROM items WHERE seller = 'johndoe') AND category = '21'

In words, I would like to select items from other sellers, but they must be contained within specific categories.

mawburn · April 8, 2014, 11:00pm

That is syntactically correct.

peppy · April 8, 2014, 11:18pm

Whenever I do the query, I end up with 0 rows returned. There are multiple sellers with products within this category though.

EDIT:
I used the “EXPLAIN” on this query and got this:

id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
2 SUBQUERY items ref seller seller 92 180206 Using where; Using index

This says the Primary query has: “Impossible WHERE” . I’m not sure why…

mawburn · April 9, 2014, 12:24am

Whoops. My fault. That wasn’t correct. I’ll admit that I have only used NOT EXISTS a few times.

http://sqlfiddle.com/#!2/6d449/1

There’s 3 options for you above. NOT IN, LEFT JOIN, and NOT EXISTS.

CREATE TABLE fish (
    id INT(11) NOT NULL AUTO_INCREMENT,
    name VARCHAR(20) NOT NULL,
    length INT(11) NOT NULL,
    weight INT(11) NOT NULL,

    PRIMARY KEY pk (id)
) ENGINE=InnoDb;

INSERT INTO fish VALUES (1, 'catfish', 7, 22);
INSERT INTO fish VALUES (2, 'pike', 5, 2);
INSERT INTO fish VALUES (3, 'bass', 8, 22);
INSERT INTO fish VALUES (4, 'tuna', 24, 200);
INSERT INTO fish VALUES (5, 'shark', 60, 956);
INSERT INTO fish VALUES (6, 'catfish', 9, 22);


SELECT f.* FROM fish f
WHERE id NOT IN (
                  SELECT id FROM fish f2
                  WHERE f2.name='catfish'
                 )
      AND length > 8;


	
SELECT f.* FROM fish f 
LEFT JOIN fish f2 ON f2.id = f.id AND f2.name = 'catfish'
WHERE f2.id IS NULL
      AND f.length > 8;



SELECT f.* FROM fish f
WHERE NOT EXISTS (
                  SELECT null FROM fish f2 
                  WHERE f2.id = f.id AND f2.name = 'catfish'
                 )
      AND f.length > 8;

peppy · April 9, 2014, 2:37am

Thanks for the providing the options.

Unfortunately, I’ve tried all three methods and it’s still taking just as long to perform the query. Even if I strip out any random generators and use LIMIT, it still takes just as long. You would think MySQL would have something for dealing with this.

peppy · April 9, 2014, 3:08am

Modified Question:

I modified the slow equation that takes 10 seconds and needs improvement:

$sql = "SELECT * FROM items WHERE seller != 'johndoe' AND category = '21' LIMIT 100";

There has to be a faster query: Select any 100 items listed in category ‘21’ from any seller except “johndoe”… The table contains 300,000 items total.

Mittineague · April 9, 2014, 3:24am

I may be wrong, but I have a feeling that the problems you are having may be a result of less than optimal database design.

I’m far from a “normalization” expert, and because I had little experience crafting complex queries, I set up complex tables.
They were OK as long as the amount of data was minimal, but they didn’t scale well.

Short of breaking your table into smaller cross-referenced relational tables, you could try explicity specifying the fields you’re interested in getting instead of using the * wildcard. That might help some.

peppy · April 9, 2014, 3:31am

Unfortunately, this doesn’t make too much of a difference on my end. I’ve simply selected the “id” column and this takes 10 seconds or so.

Mittineague · April 9, 2014, 4:21am

Can you use seller_id = and seller_id != (presumed to be an Int) instead of searching for the text (presumed to be Varchar) ?
AFAIK text queries are more resource intensive than int queries.

Rubble · April 9, 2014, 7:24am

You could run the search code say every night at midnight and generate an array just with the row ID - this would negate any time taken - the array will be written to a file on the server.
When the user visits the page 5 random values are taken from the two arrays, the data is fetched from the database and displayed.

John_Betong · April 9, 2014, 8:07am

peppy:

Greetings,

I have a script where I would like to show 5 random items from a particular seller, and 5 random items from other sellers:
$sql = "SELECT * FROM items WHERE seller = 'johndoe' LIMIT 5";
$sql = "SELECT * FROM items WHERE seller != 'johndoe' LIMIT 5";
The ‘seller’ column is indexed and I have over 300,000 items.
The first query runs through super quick (0.01 seconds).
The second query takes 10 seconds.

I see that the Not Equals or “!=” does not work on an index. Is there a way to speed this up somehow?

Thanks
Kind regards

Try this:



"SELECT * FROM items WHERE seller  > 'johndoe'  LIMIT 5";

mawburn · April 9, 2014, 1:28pm

Well, remember he’s also doing some randomization in his SQL statement and isn’t giving us his fully query. I think there is something else going on. Like I said earlier, doing the random selection in SQL is probably the least preferred method.

peppy · April 9, 2014, 6:36pm

I modified the question a bit to remove any randomization. All I want is to get this simple function to work in a fast manner:

$sql = "SELECT * FROM items WHERE seller != 'johndoe' AND category = '21' LIMIT 100";

I tried this and unfortunately there is no difference. I also tried this and it doesn’t make a differnce either:

$sql = "SELECT * FROM items WHERE seller <> 'johndoe' AND category = '21' LIMIT 100";

This could work, although our site has over 10,000 categories and subcategories with many items within them. I’m guessing this would end up being impractical.

There must be another way, I know I’m not the only one on the internet who shows a list of “related items from other sellers” on their sites.

TheRedDevil · April 9, 2014, 7:40pm

You most probably have a database model problem and possible index problems as well.

First off, are you using composite keys? If, why is it “seller, category”?

If you add a composite key that is: “category, seller” and then update your query to equal this:

$sql = "SELECT * FROM items WHERE category = '21' AND seller<>'johndoe' LIMIT 100";

I am certain you will get a massive speed increase, since now we first limit the query to any records in category 21 that is not made by seller johndoe, instead of limiting the query to any records not made by seller johndoe that is also in category 21.

However, the problem you will see when this works is that you will always pull the same 100 records (unless some is deleted) i.e. ASC order. In addition the RAND() function in SQL should be avoided like the plague since it quite honestly is a “DB killer” on larger tables.

Instead what you can do is this:

$range_from = mt_rand(1,$total)
$sql = "SELECT * FROM items WHERE category = '21' AND seller<>'johndoe' LIMIT $range_from, 100";

However, please note that LIMIT X,Y is also very slow. The reason for this is due to if you have a table with a million records and write: LIMIT 997000, 100 the SQL engine actually read all 997000 records to get to the ones you want to pull. This means if you want to speed this up more than the suggested query above, you need to alter your database models with this in mind, allowing you to base queries on primary keys, and being certain that you can actually use it internally in the software after.