SQL and RDBMS Database Design DO's and DON'Ts

Since this is a general do’s and don’ts thread, I throw in one regarding a dimensionally modeled reporting enviornment, replacated from the source system.

When building a star or snowflake schema in a dimensional model for reporting purposes, DO NOT reuse the surragate keys from the source system as the join path from the fact tables to the dimensions. If something about a record changes in the source system, it’s the job of the reporting enviornment to capture the record as it was originally, as well as what it was updated to. This will be impossible if you reuse the source system’s surragate keys.

wow… some heated debate.
Personally when developing apps using mysql, i’ve found myself almost always using auto increment columns, the only exception being when there is a good single natural key column of a numeric or short fixed length data type. I always add an alternate key for the natural key of the table though (if i didnt use it as th pk). Something like last name, first name, phone number i would never consider using as a pk… not only would it be extremely slow to join but the coding would be quite a bit more complicated. Also, update cascade doesnt work in all dbms’s (ORACLE!!!) and its annoying to have to write triggers in every case. So far i’ve found it easiest to code with auto increment’s, and i’ll stick to it.
Web forms typically have things like checkboxes and select dropdowns, if you need to get at a row its nice to have a single value key so you don’t have to do extra work parsing the data after the fact.

Remember queries which use wild cards such as SELECT * WHERE myCharColumn LIKE ‘%anything’ will always result in a table scan, so only use leading wild cards when absolutely necessary.
Are you guys sure using * always leads to a table scan? What if a developer needs to return every column and is sure of it?

not what he said :slight_smile:

LIKE ‘%anything’ gives a table scan

LIKE ‘anything%’ doesn’t

the dreaded, evil “select star” should never be used, full stop

actually, there are a few instances where it can be used, but these are obscure outliers

It depends upon the WHERE condition, and whether an index can be used.

In the case of the LIKE operator, an index can be used if the wildcard (%) appears at the end of the character string. Why? Because the index is sorted in ascending or descending order. The RDBMS can then scan the index (which has a hierarchical tree design), looking for a match for the first few characters of the string. It can quickly find the index pages that it needs to search, because of the index structure. It will only search the index pages that qualify and will only access the data pages that are needed.

However, if the wildcard is at the beginning of the string, then there is no way for the the optimizer to know how many qualifying occurrences exist in the database (no statistics, therefore cannot use the index to refine the search), it chooses to gut it out and simply scan the whole table, as its only possible strategy which it defaults to. Different optimizers work differently, and this explanation only applies to relational databases. Obviously, text search databases have entirely different strategies for parsing queries and scanning their databases.

Rich

rich, read the question again :slight_smile:

What if a developer needs to return every column and is sure of it?

he’s asking if SELECT * requires a table scan

Yes, it someone wants to return every row and every column in that row then it will require a table scan. I thought, by your answer, that he was asking the question relative to the LIKE operator and how it uses indexes when there are wildcards involved.

Rich

Ok looks like I’m not being clear. Let me do this by example:

Suppose a table has 5 columns, a/b/c/d/e.

There are two ways to do a query:

  1. “select * from table where b=5”

  2. “select a,b,c,d,e from table where b=5”

Which will execute faster? Will there be a performance difference?

i believe i understood you the first time :wink:

  1. will always be as fast or faster than 1.

you should never use 1.

If this is true, that there is a perceptible performance difference between these two statements, then someone has really written a poor optimizer somewhere, and * should be removed from the syntax.

Here is a good article on how a sargable predicate should be analyzed, and in practice a good optimizer should make the syntax of the SQL transparent to the RDBMS engine. In fact, that is the whole purpose of the optimized and the relational language. Whether or not MySQL has a good optimizer or not, I do not know from practical experience. However, I do have practical experience with many other RDBMS optimizers and using a * should be acceptable, especially if end-users are going to access the RDBMS.

http://www-128.ibm.com/developerworks/db2/library/techarticle/0303kuznetsov/0303kuznetsov.html

Rich

No.

The reason why select * is slightly slower then select a,b,c is exactly because the optimizer has to work out what * actually means.

Yes, I did say perceptible. My guess is that it is totally insignifcant since the table defintions should be cached as should the SQL parser logic. But if someone has some hard numbers, I would be very interested. I would estimate the difference can be measured in nanoseconds. If someone really believes there is a meaningful difference, then I will go over to the MySQL support forum to verify. I would be pretty surprised indeed.

Rich

asterix!! where have you been, man, what a long hiatus!!! nice to see you back again

actually, i believe the optimizer also checks each column name when you write SELECT a,b,c,…, because otherwise you could write SELECT a,x,c,… where x is not a column in the table

that the dreaded, evil “select star” should be removed from the syntax is wishful thinking, that’s not gonna happen any sooner than microsoft gets rid of support for =* and *= or oracle gets rid of support for (+)

meanwhile, we can all do our part and advise anyone and everyone, at each opportunity, not to fall victim to the dreaded, evil “select star”

:slight_smile:

There is certainly nothing at all evil about a *, however, for logical data independence, there may be reasons to enumerate the columns as opposed to using the *. However, this has nothing to do with performance.

Here is one good article that explains the issues fairly well:

As a matter of good programming practice, I think it is well advised to enumerate columns, but if you need all the columns, there is no difference in performance. I think that this article is a bit over the top in calling the * “evil”, and note that in the context of the question that is being responded to in this thread, the article specifically states:

“That said, if you do actually need all columns, then SELECT * is fine, provided that you never change the table. Or the query.”

Rich

thank you, thank you very much

i think i should change that picture, i’m definitely a lot older than that now :wink:

i’m gonna stand behind this as strongly as ever

IF you do actually need all columns, and
IF you never change the table, and
IF you never change the query…

… then i guess it’s okay :cool:

Hi Rudi, yeah, I took a break from spf, it was getting a bit, hmm, time consuming :slight_smile:

Rich: well, I think that (for me personally) the problem with select * is that you are not controlling what happens when you join (many) tables:

select * from a
left join a using aid
left join b using bid
left join c using cid

gives you absolutely no idea about how many columns will be returned. It could be thousands, when all you really wanted were two or three. I have seen so many errors in applications because the developers used “select *”, and then later added a column or joined a table.

So I tend to agree that select * is not so much a performance issue (usually) as it is a quality issue. If we don’t care what columns the tables hold then really we should be asking if we need to use an RDBMS at all. Just some big object tree dumped as XML would probably be as useful and conceptually less messy :slight_smile:

Good article. :slight_smile: This issue is a good one to understand, but I think it should be better couched as a “logical data independence” issue, as opposed to a performance issue. Certainly, for ad-hoc user queries, there is little reason to be concerned, and therefore will probably remain in the syntax for very good reason. If an end-user or developer wants to see all of the columns in a table (as I often do), why have to enumerate every column?

Rich

Better: if you want to list all the columns, why enumerate them using something horrible like

select * from mytable limit 1

?

Wouldn’t it be better to do something like

show columns from mytable

(sort of assuming this is a mysql talk…)

Sorry, what I meant to say was to show all data in all columns. Clearly, to show just the column names, you would have to either interrogate the catalog or use a catalog command command as you suggested.

I wonder if we have answered the question yet? :slight_smile:

Rich

I think you miss the point. The question was what query would guarantee that all data in all columns would be returned. The answer is SELECT * FROM table_name. This works, even if the underlying table structure has changed (e.g. more columns have been added), and the developer was not aware of this. The only other way to do this, would be to first read the catalog, get the column names, and then dynamically build the query based upon the result set.

In regard to your other statement that 2 will always be as fast or faster than 1, this may or may not be true depending upon whether the program first has to read the catalog (as many ad hoc query tools do), in order to satisfy the SELECT * FROM table_name requirement.

Rich