SQL and RDBMS Database Design DO's and DON'Ts

This is a document I’ve been slowly adding to as I come across things ‘in the wild’. Maybe it can be of some use?

SQL and RDBMS Database Design DO’s and DON’Ts

Generic DOs:

  1. Develop a backup strategy before you encounter a catastrophic database failure, and TEST it regularly so you will know what to do when disaster strikes.
  2. Always sanitize your input from the user to strip out quotes for strings and non-numeric data for number types. (http://www.sitepointforums.com/showthread.php?s=&threadid=60643)
  3. Perform table normalization in your DESIGN phase. It is much easier to change your tables ‘on paper’ than when it is in production use. (http://www.sitepointforums.com/showthread.php?s=&threadid=63582)
  4. Choose datatypes which are logical and fit your model. Designating a US Social Security number as a CHAR( 255 ) is very wasteful since they will not exceed 11 characters (XXX-XX-XXXX). CHAR( 11 ) is a perfect match.
  5. Run every query through your RDBMS’ query tuning tool and ensure that correct indexes are being used and that a table scan or Cartesian product (in a join) is not occurring when you do not want it to. MS SQL, <snip/> MySQL, [url=“http://dbforums.com/showthread.php?s=&postid=212692#post212692”]Sybase

Generic DON’Ts:

  1. Do not fall into the trap of what I like to call ‘Auto_Increment Induced Insanity’. Not every table needs a sequential integer as the primary key. Design your primary key logically based upon your data, and if need be then consider a sequential integer if traditional primary key design fails.
  2. Do not try and join mismatched datatypes. For instance, if you have an INT as a primary key in one table and a CHAR( 10 ) as the foreign key in another, problems will occur when you try and join the two columns, usually manifesting as a table-scan on one of the mis-matched tables. Or to put it another way – ensure datatypes match across tables.
  3. Remember queries which use wild cards such as SELECT * WHERE myCharColumn LIKE ‘%anything’ will always result in a table scan, so only use leading wild cards when absolutely necessary.
  4. Don’t create redundant indexes. In most, if not all, RDBMS’, designating something as a PRIMARY KEY will also create a unique index on the column(s). Creating an additional index on the leading column will be redundant and a waste of space. For example, creating an index on Col1, Col2 will be used on queries of type: WHERE Col1 = 34 AND Col2 = ‘Something’. It will also be used on queries such as WHERE Col1 = 123.
  5. Don’t use the $dbms_seek() functions in PHP to simulate pagination of a result set. Use LIMIT in MySQL, TOP n in MS SQL, etc. to achieve true pagination.

MySQL Specific DON’Ts:

  1. Do not enclose numeric values in quotes. This is very non-standard and ONLY works on MySQL. For example, WHERE someIntegerColumn = ‘1’;. This also pertains to integer values in CREATE TABLE statements such as CREATE TABLE bob ( bobID INT DEFAULT ‘0’ ) <– bad. DEFAULT 0 is the correct method.
  2. <snip/> Do not use the INT( M ) syntax unless you are using the ZEROFILL MySQL proprietary SQL extension.

edit: added another Don’t

Originally posted by MattR
[B]Generic DON’Ts:

[list=1]
[*]Do not fall into the trap of what I like to call ‘Auto_Increment Induced Insanity’. Not every table needs a sequential integer as the primary key. Design your primary key logically based upon your data, and if need be then consider a sequential integer if traditional primary key design fails.
[/list][/B]

I guess for people programming smaller systems this would be the case, but if you program anything that has potential to be somewhat large, then I think using auto incremented ints for primary keys should be the default action. The reason I say this is because of their size.

For instance, an average username of 10 digits is 10 bytes, while a nice int is only 4 bytes (MS SQL Server). If the users table was the only place that used the key, then it’s no big deal (even with hundreds of thousands of users, the size wasted wouldn’t be a big deal). However, the nature of the primary key is to use it to link to other data in other tables, which means that the extra 6 bytes are probably in just about every record (for a nice user/content management system, forum, or something along those lines). 10 Million records means 10 megs of storage (with 120 gig drives selling for $100 bucks, this might not be a big deal, but why waste space if you don’t need to?) The other reason you might want to use integers instead of strings is that most servers can perform opperations on integers much faster than on strings. Again, for a small system no big deal, but for a larger system it might be something to keep in mind.

Just thought I’d share my two cents,
Goof

Goof, that makes sense and is often how I design. It wasn’t what I had in mind with that comment, though.

Something along the lines of this I have seen a lot:


CREATE TABLE relationship(
  relationship_id INT AUTO_INCREMENT NOT NULL PRIMARY KEY,
  table1_id       INT                NOT NULL,
  table2_id       INT                NOT NULL
) 

Here there is no need for the relationship_id tacked on there.

For a user’s table the userid is nice as well since you don’t have to worry about the user editing his or her username and then being forced to cascade it throughout many tables (big performance hit).

However ‘by default’ is not a good habit to get into. What would you do if you had a state table, or a country table?

Hi Matt, thanks for your great article, but I have to admit that I don’t entirely agree with you here.

What you are saying is the the FKs here are used in an identifying relationship with master tables, and that using the two FKs here as a PK is the ideal way since it avoids an essentially redundant auto_increment PK.

You assumption is, that the parties to n:m relationship will never cahnge, since if they do there is no alternative but to delete the PK and add the two FKs, one of which has changed, as the replacement PK.

Now this may be true most of the time, but some times it is necessary to update a n:m relationship. For example, when one company replaces another company in the n part, on the m side you might have companies who are the customers. If you can reference the tuple using a single ID value, you can change the FK references without destroying the semantics of your schema.

So I would suggest keeping it simple. For this reason I long ago stopped worrying about compound PKs and decided to add an ID (long integer) as the PK to every entity. It certainly makes OO coding easier, every Data Object (Value Object for Java folks) can be instantiated and persisted with knowledge of a single ID. This keeps the coding simple and consistent for every persistable object.

Just my tuppence.
Richard

without destroying the semantics? huh?

even if your database does not support ON UPDATE CASCADE you can still accomplish the same yourself – just update the FK, which also happens to be part of the PK

i don’t understand how you would even get around to referencing the surrogate id this case, because in order to find the rows you need to update the keys of, you have to follow a join from one of the company tables, and this join fer shure ain’t gonna be using the surrogate id

the only reason for wanting a surrogate PK in a many-to-many table that i can think of is when that table is the parent of another table and the child table wants, for some reason, to minimize the size of its FK to the parent

for example,

team_player ( teamid, playerid )
team_player_roster ( teamid, playerid, startdate, enddate )

here you could use a surrogate id for the team_player table, and thus also for the team_player_roster table’s FK (and thereby remove both team_id and player_id from the team_player_roster table), but there’s a major drawback for this design – now you cannot search just the team_player_roster table by itself to find out which players have played more than once for the cubbies (team_id=37), without joining to the team_player table on the surrogate key just in order to find team_id=37

all in all i don’t like surrogate keys, especially for many-to-many tables, although i do use them in other situations

Nope, won’t work. You are assuming that the new relationship doesn’t already exist, if it does you’ll violate a constraint.

Yep, coming from either side you’ll need to use the FK. But thats less than challenging for an RDBMS! The point is, that when you need to update the PK, you don’t need to search on two fields, since you’ll likely already have retrieved your autoID previously.

I’m not saying that this method is pretty, I’m just saying that programmers can deal with n:m tables in the same way as any other persistent data. A consistent autoID field in every entity makes life a lot easy for programmers, even if does take your DB from 3NF to 2NF.

Good exmple. But I prefer a different reason: Primary Keys should not have any intrinsic meaning (semantic) whatsoever at all. If you say “the PK is built from the FK of the two participitating entities”, then you could have something like this:

Employee (SocSecNumber, Name…)
ProjectTeam (ProjectName, Customer…)
EmployeeToTeam (SocSecNumber,ProjectName)

Meaning that your PK now contains meiningful information, it violates (IMHO) the basic tenet that a PK’s meaning is only to identify a record. Now the PK contains meaning over and above its original purpose.

OK, if the PK is built from 2 other autoID fields, then the meaning is somewhat reduced, but it nevertheless confers over and above a means of identifying a record (actually the entire record is the PK).

I guess that this is a topic which we will disagree about all day :slight_smile:

I can see your point: you don’t like surrogate (technical) primary keys, you prefer identifying some existing attribute(s) which will confer uniqueness to your record. The thinking being that this is “pure”, “minimalistic”, “using the data”, and not adding any level of data which is of no use to anyone except maybe some DBAs or programmers.

I’ve been there and done that, spent many painful hours rejuggling a logical schema when a new requirement killed my previous assumptions about what made a good PK way back when.

I give some leeway about using compound PKs on intersection tables. At least until I have to explain to a programming team why their elaborate, custom built code generation tools will fail when modelling n:m relationships using compound keys. Then I take the easy route: I give them an autoID, they are happy, I am happy, the database has suffered by forcing it to manage a further 4 bytes per record.

In summing up, I would like to suggest that issuing generalities “do…, don’t…” whilst certainly being very useful in some cases, can lead to the creation of dogmas, stifling originality at the expense of “received wisdom”.

Cheers,
Richard

hey richard, i agree, let’s agree to disagree

Nope, won’t work. You are assuming that the new relationship doesn’t already exist, if it does you’ll violate a constraint.
this is true whether or not the many-to-many table has a surrogate PK
The point is, that when you need to update the PK, you don’t need to search on two fields, since you’ll likely already have retrieved your autoID previously.
that does not follow – if you need to update one of the two columns in the PK, then just do it: UPDATE M_N_TABLE SET FK2 = new value WHERE FK2 = old value (with the added proviso that the new value does not exist already – but your app takes care of that, right?)
A consistent autoID field in every entity makes life a lot easy for programmers, even if does take your DB from 3NF to 2NF.
i’d love to hear your explanation of how the presence of a surrogate PK changes a table from 3NF to 2NF
…Meaning that your PK now contains meiningful information, it violates (IMHO) the basic tenet that a PK’s meaning is only to identify a record. Now the PK contains meaning over and above its original purpose.
not at all, it has meaningful information only if the PKs that the two FKs refer to do
… (actually the entire record is the PK)
and the problem with that is…??

In summing up, I would like to suggest that issuing generalities “do…, don’t…” whilst certainly being very useful in some cases, can lead to the creation of dogmas, stifling originality at the expense of “received wisdom”.
on this last point, i totally agree

therefore, in keeping with the topic of this thread, let me summarize:

DO use natural keys instead of surrogate keys, but DON’T do it if it will get you into trouble

:wink:

>> DO use natural keys instead of surrogate keys, but DON’T do it if it will get you into trouble

But don’t natural keys violate 3nf? SSN would on the surface seem to be a good natural key, but we all know in reality that is not true. In fact, many developers agree that there is no place for natural keys in any application of serious consequence.

How do natural keys violate 3NF?

developers on mars, maybe

here on earth there are plenty of serious database developers who understand the difference between natural and surrogate keys

any developer who uses surrogates exclusively, by rote or by conviction, is, in my books, not serious

and no, to answer your question, natural keys do not violate 3NF, developers violate 3NF

:wink:

I understand the difference between natural and surrogate keys. I am from the school of thought that a good key is one whose sole purpose is identification of the row. Any other column is innapropriate for unique identification purposes, as it could potentially change. Take the SSN example, not only does the US recycle SSN’s after someone deceases, but what if they type it in incorrectly? A column that is used as an identifier should never be seen, edited, or heard from by a user using the consuming application. It is an atifact of the rdbms. Even industry standard EIN numbers aren’t always used as primary keys in the database. Agree or disagree, but it is not that difficult to always use a surrogate key - either a guid or an autoincrement - and in my opinion the benefits outweight the costs.

natural keys do not violate 3NF, developers violate 3NF

Wrong. developers who choose natural keys run the risk of violating 3NF.

I should clarify, though, that my dogmatic position contrary to the use of natural keys is only my opinion, and only for tables that are updated frequently.

“Natural” keys fulfill this requirement.

Any other column is innapropriate for unique identification purposes, as it could potentially change.

So what?

Wrong. developers who choose natural keys run the risk of violating 3NF.

You still haven’t explained this. Please do.

yeah, chacha, i’m interested in your theory too

just how does a natural key violate 3nf?

please, do reveal your thought process

oh, and by the way, if you cannot design a data model without surrogate keys, don’t take this personally, but i think you are missing some basic theory

“Natural” keys fulfill this requirement.

No, they don’t because they originally were to identify the data in the real world and were generated outside your system. As far as your system is concerned it should have no special meaning because it exists independently of your schema. Is it possible to find a natural key that will work, sure - but you are hoping that some unforeseen event never happens.

So what?

So you can’t guarantee your key is immutable, that’s not a big deal? Where did the data for the natural key come from? Users? You? Your boss? If the answer to any of these is yes, then it is untrustworthy data. What if the user typed his/her social security number wrong. What if the supplier typed the EIN number wrong - but it was unique at the time of insert. Now, conceptually, you have a problem. You have a unique identifier that is transient.

You still haven’t explained this. Please do.

In my mind, data in a row can’t possibly be dependent on a transient value - which as I stated earlier is what a natural key is when the data is not generated inside your system. Once I identify a row of data in my database, I don’t ever want that identifier to change. If I were to script out a sql query into a textfile the moment after insertion that looked like…

select * from Reliable where pk = SCOPE_IDENTITY()

then in 50 years if that row is still there and the USA is dead and gone I want to be able to find it. There is no guarantee that will happen with natural keys. That’s why.

oh, and by the way, if you cannot design a data model without surrogate keys, don’t take this personally, but i think you are missing some basic theory

Sure, I COULD design a data model without using surrogate keys, buy why would I want to? Don’t tell me performance, because no query is going to suffer appreciably from adding 1 column. Plus you can still index the other column. It’s this whole stupid DBA bravado thing - like look at my database - I understand the domain so well that I can use natural keys. Why would you waste your time on such minutia. It is laughable. Why risk using a natural key that might change in the future when you can use one that will not ever change - ever?

well, sure, that makes a lot of sense – if you’ve never heard of ON UPDATE CASCADE

i could give you lots of examples of where i don’t waste my time on declaring surrogate keys i don’t need

but let’s agree to disagree on that point

what i want to know is where’s your explanation of how natural keys violate 3nf?

if you’ve never heard of ON UPDATE CASCADE

I don’t doubt that you can easily cascade the update throughout the database if the natural key’s value has changed. The point is that a key should not be transient. Like I said, the natural key has special meaning outside the context of your database. That special status can be revoked by the powers that be without your consent, rendering your key useless.

i could give you lots of examples of where i don’t waste my time on declaring surrogate keys i don’t need

And for every one I could probably show you a remotely possible event that would render that key not unique.

what i want to know is where’s your explanation of how natural keys violate 3nf?

If a column value is transient then simply put the other columns in the row can’t be ‘functionally dependent’ on that column to provide identification for the tuple.

Why insist on using natural keys when the overhead for the alternative is so low, and gives more reliable identification?

Thanks, I will. I don’t have an axe to grind, but I do thank you for resting the discussion. I think this will be instructive for people learning to design data models. They can consider both sides and do as they please. However, I still am waiting to hear why, given the argument I outlined, anyone would ever insist on using a natural key whose value is dependent on a strategy that lies outside your system. That is the fundamental issue at heart. Whether the uniqueness of the key is dependent on something that lies outside your control. Thanks for the discussion.

Be careful about modifying the user data on the fly prior to insert/update in the database. Changing the user’s data without them being aware should be discouraged. If quotes are not allowed, then have the application inform the user rather than silently fixing the data. Or, be sure to put it into the documentation. Most users don’t read documentation so it may be better to provide a confirmation message of how the data was changed.