llnitoll — 2013-03-03T14:55:47-05:00 — #1
Duplicate millions of username to faster search or use LOWER queries? Or in other words how slow is LOWER(username)=LOWER('$username').
I guess use LOWER is the way but I'm concerned about the performance of using LOWER.
I'm using utf8_bin, so my MySQL is case sensitive, I should stop using it, at least for that table?
r937 — 2013-03-03T16:01:11-05:00 — #2
if you have an index on that column, it will be ignored because your query applies a function to it
stop using case sensitivity for that column, is my suggestion
alternatively, go through (once) and fix all values in that column to lower case, and then never ever put anything into that column again unless you have first converted to lower case in php, and then your case sensitivie column can use the index
llnitoll — 2013-03-03T18:55:51-05:00 — #3
Now I'm doing the "duplicated" way. I start doing it because phpbb forums work like that for usernames, so I think it must be good.
Is just I'm trying to remove all duplicated data I can, but this is a critical column, I can't have two users with the same name, so the collation I use must be 100% accurate, when searching if a username alredy exist.
I read utf8_unicode_ci has good accuracy, but I'm afraid is not as accurate as utf8_bin is.
r937 — 2013-03-03T19:24:51-05:00 — #4
i'm sorry, i have no idea what this means, and i don't know if i can help you any further
llnitoll — 2013-03-03T22:35:21-05:00 — #5
Perhaps I'm confused, reading about the difference of unicode and general:
Accuracy of sorting:
utf8_unicode_ci is based on the Unicode standard for sorting, and sorts accurately in a wide range of languages.
utf8_general_ci comes very close to correct Unicode sorting in many languages, but has a number of inaccuracies in some languages.
The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.
Perhaps this has nothing to do when searching matches :-/
llnitoll — 2013-03-04T14:55:47-05:00 — #6
Here is more clear:
It's MySQL trying to be confusing.
The difference lies in comparisions. If you did
CODE: SELECT ALL
SELECT username FROM users WHERE username < 'Cheese';
It might return different results based on the encoding. For example, utf8_unicode_ci changes the german "double-s" character (ß) into "ss" for comparison. But the bin variation uses the strings "as they are" with their binary value.
So a better example would be these two queries:
CODE: SELECT ALL
SELECT username FROM users WHERE username < 'ß';
SELECT username FROM users WHERE username < 'ss';
On the utf8_unicode_ci, they should give the same result. They should not give the same result in utf8_bin -- assuming that your colletion of usernames is large enough.
I think this does not means that utf8_unicode_ci is less accurate (well it is but it finds all matches that have to) when comparing. Then I read this:
The utf8_bin collation compares strings based purely on their Unicode codepoint values. If all of the codepoints have the same values, then the strings are equal. However, this falls apart when you have strings with different composition for combining marks (composed vs. decomposed) or characters that are canonically equivalent but don't have the same codepoint value. In some cases, using utf8_bin will result in strings not matching when you expect them to. Theoretically, utf8_bin is the fastest because no Unicode normalisation is applied to the strings, but it may not be what you want.
If I undestand it right utf8_unicode_ci will find even more matches than utf8_bin, with utf8_bin two usernames that look the same, will be different for the bin variation so they can coexist.
I'm gonna do some tests with utf8_unicode_ci, to see if it works correctly, and how slow it is.