Foreign Keys and Field Value Population, What's the More Efficient Approach?

GG09 · July 17, 2012, 6:05pm

Hi everyone,

I’m trying to design a database and I want to know what’s the most efficient way to populate fields with referenced values.

I have the following database design:

For the country table, my unique ID is not a number but rather an ISO two-letter country code, i.e. JP, US, DE, etc. This leads to my first question:
Q1) Is this an efficient approach, using a non-numerical ID, or is numerical ID’s always the best?

For the vehicles table, I have the vehicleBrand field referencing the vehicleBrandID field of the vehicleBrands table. When I do this though of course, the dropdown to select what vehicleBrand entry is just a list of the vehicleBrands IDs, like so:

This leads me to my second question:
Is it best to use a numerical ID for internal database operations, and then when you generate some kind of view, i.e. a dropdown, add extra code to just populate those dropdown options with another field from that row

Thanks in advance!

r937 · July 17, 2012, 8:53pm

yes, it is

using a standard 2-character code instead of a meaningless number has several advantages

you will hear people say that indexing a 4-byte integer is better, but it isn’t

even if you use a 2-byte SMALLINT, you’re still faced with the need to do a join just to pull the country name or code

i don’t understand why you would want to do that

by the way, your diagram was quite confusing – is there any chance you could show us the actual CREATE TABLE statements including the actula PK/FK declarations?

p.s. shout out from the t.dot

GG09 · July 17, 2012, 9:16pm

Hey Rudy (making the assumption that’s your name from your site!) thanks for your response, and a shout out back to a fellow Torontonian! I’m really new to any database system; I think your mention of a join from my first question was what I was referring to in my second question, I think… Does a join basically say something like this:

“Oh, you want to refer to record/row with the ID of 123, but use something the more meaningful brand name field for that record/row instead”?

My appologies for the diagram, I was using MySQL Workbench and that’s what the ER diagram was; here’s the SQL statements. Please let me know if my relationships are correct! Thanks

Vehicles Table:
CREATE TABLE IF NOT EXISTS driverr.vehicles (
vehicleID INT NOT NULL ,
vehicleBrand INT NULL ,
PRIMARY KEY (vehicleID) ,
INDEX veicleBrandID (vehicleBrand ASC) ,
CONSTRAINT veicleBrandID
FOREIGN KEY (vehicleBrand )
REFERENCES driverr.vehicleBrands (vehicleBrandID )
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB

vehicleBrands Table:
CREATE TABLE IF NOT EXISTS driverr.vehicleBrands (
vehicleBrandID INT NOT NULL ,
vehicleBrandName VARCHAR(45) NULL ,
vehicleCountryOrigin VARCHAR(2) NOT NULL ,
PRIMARY KEY (vehicleBrandID, vehicleCountryOrigin) ,
INDEX countryID (vehicleCountryOrigin ASC) ,
CONSTRAINT countryID
FOREIGN KEY (vehicleCountryOrigin )
REFERENCES driverr.countries (countryID )
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB

countries Table:
CREATE TABLE IF NOT EXISTS driverr.countries (
countryID VARCHAR(2) NOT NULL ,
countryName VARCHAR(45) NULL ,
PRIMARY KEY (countryID) )
ENGINE = InnoDB

r937 · July 17, 2012, 10:52pm

no, the relationships aren’t right

one good thing about using mysql workbench is that you can “forward engineer” the diagram to produce the actual CREATE TABLE statements, and then – here’s the good part – actually run them to see if they’ll work

i suggest you do this repetitively as you build up your schema, dropping tables and re-creating them often

your vehiclebrands PK is { vehicleBrandID,vehicleCountryOrigin } – why? do you allow the same vehicleBrandID to be registered for multiple countries?

if so, and you want to reference a row in this table using a foreign key in the vehicles table, then your foreign key must also have a vehicleCountryOrigin column as part of the FK

system · July 18, 2012, 4:52am

The one problem I see with using actual information for the IDs is the fact that, in my experience, you should never change values for the ID columns.

In the event a country changes international two-letter country code, you are forced to do that, change the PK value. Using a “mindless” “disconnected” numeric ID saves you from doing that.

And since you are going to join data to find out the country details, numeric IDs are better.

r937 · July 18, 2012, 6:36am

gg, itmitică makes a great point

you will therefore have to decide what the risk is, should a country change its two-letter international code

see http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 which will give you an indication of how often this happens

by the way, in case should you decide to use the 2-letter country code, and one of your countries does happen to change its code, you can easily accommodate this with the ON UPDATE CASCADE option of the foreign key, although some purists will say that this goes against the spirit of primary keys never being allowed to change

now, as to the point about always having to perform a join, this is ~definitely~ true if you record the country code as a number (e.g. country 42)

however, i’m sure there will be instances where you simply want to look at data, say in the vehiclebrands table, for example, where you actually don’t have to perform the join because you instantly recognize what US or UK or CA stands for

so now you have sufficient information to make the choice

system · July 18, 2012, 9:37am

A few more thoughts.

Surrogate keys versus natural primary keys has always been surrounded by controversy.
From personal experience, surrogate keys are the long term answer.

If you’re guaranteed to have unique two-letter code values, why bother making it an ID?
Normal indexing should be enough.

Natural keys fit the scenario when you absolutely control their creation and when they’re “narrow” and they’re absolutely unique.
You only satisfy the “narrow” condition.

Official interpretations:
AL could mean Albania or Alabama
BE could mean Belgium or Berne
NL could mean Netherlands or Newfoundland and Labrador
SV could mean El Salvador or Suceava

So, when you need to retrieve full names, you’ll have to join data.
Databases will transform text values in join conditions to numeric values before performing the join.
Hence, joining with conditions based on text values are slower than joining with conditions based on numeric values.

In the event you want to switch to Alpha-3 or to Numeric codes, using the actual Alpha-2 codes as identity/foreign keys will stand in you way.

Here’s an idea, why not use the standard all together, with Numeric codes: http://en.wikipedia.org/wiki/ISO_3166-1 ?

GG09 · July 18, 2012, 3:52pm

Hi r937 and itmitică, thanks for your replies!

It seems as though a good practice is to just use the numeric IDs, though in the discussion of this topic I’ve learned a few things which I thank you guys for (i.e. the ON UPDATE CASCADE)!

r937, I didn’t mean for that to be the case you’re right, I just meant for the VehicleBrandID to be the primary key.

To r937 and itmitică, is a join something that is always executed when the database is quiried, i.e., the results of a join are not stored anywhere are they? Also, just to confirm I fully understand what I join does, please review my understanding:

I have a table of countries all with unique numeric IDs and a field for the country name. I then have a table of countries I’ve visitied, with a date field and a country ID field.

I want to call up all the places I’ve visited and print the countries’ names on a web page. Since I only have the country’s ID in my visited table, I need to “join” the the country name field to the ID and then print the country name. Is this correct?

Thanks a lot!

r937 · July 18, 2012, 4:13pm

no, it’s not necessarily “good practice”

as you have seen, it is a contentious issue, and it is usually contested by purists

pragmatists have no trouble using surrogate keys (which is what they’re called) when appropriate, and natural keys when appropriate, while purists often insist upon using one or the other at all times (therre are purists on both sides)

no

a join is something that you decide to do when you write the query, and the results of any query (join or not) are not stored

yes, you need to do a join in this case

system · July 18, 2012, 4:17pm

Queries may be cached: http://dev.mysql.com/doc/refman/5.1/en/query-cache.html

The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again. The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.

[…]

The query cache does not return stale data. When tables are modified, any relevant entries in the query cache are flushed.

You could also look at materialized views. Views are stored queries. Materialized views are stored results, generally in a table, for a query. Materialized views have to be refreshed once in a while.

MySQL doesn’t support materialized views, as far as I know, but you can implement the mechanism.

You’d have to perform a join on the traveling table ( date field and a country ID) to get meaningful data ( date field and a country name ) out of countries table.

GG09 · July 18, 2012, 6:05pm

You guys are great, thanks! itmitică cahcing would be great! I find databases so interesting, I really have a lot of respect for your work. I’m looking at this book to understand data modelling, relationships and things of the like:
Database Modeling and Design, Fifth Edition: Logical Design
http://www.amazon.com/dp/0123820200/ref=cm_sw_su_dp

I’m hoping I can get a better understanding of architecting a database solution.

r937, thanks for highlighting the Surrogate/Natural key thing; just to ensure 100% understanding, Natural Key = a number, or, generally, something that’s its sole purpose is to identify a record, where as a Surrogate Key is something that is used to identify a record, but is actually part of the data itself?

r937 · July 18, 2012, 6:07pm

nope, it’s the other way around

system · July 18, 2012, 7:34pm

On of the most important aspects is to identify uniqueness constraints.

Practice has told me, in particular, over and over, I’ll never have all the data beforehand.
Also, there will always be changes that need to be accounted for later on.

Whatever natural key you may consider for a primary key, it will almost always fall pray to the above real world challenges: missing data, changing data.
From a pragmatic point of view, surrogate keys will save you a lot of pain in the long run.

Also.
The first thing you have to ask yourself before deciding upon a natural key as primary key is if it’s immutable.
There’s no if’s and’s or but’s for me on this one.

A primary key you have to update compares to you changing your DNA halfway through your life.
It’s called primary key for a reason, and it’s selected from the heap of candidate keys a table exposes to you.
If that.

I’d also say that surrogate keys help me keep a sane level of normalization within my database.

r937 · July 18, 2012, 7:46pm

a wise man once said it was important to “identify uniqueness constraints”

to distinguish between surrogate and natural keys is important, in this regard – no relation has a natural “auto_increment” attribute, so a primary key is not always chosen from a relation’s candidate keys, which are by definition natural keys

if you choose an auto_increment or similar surrogate key for your table, you must take steps to identify one additional key, a natural key, which you will declare with a UNIQUE index or constraint

and upon this natural key you must perform due diligence, specifically with regard to missing data, changing data

woe be unto those who forget the need to “identify uniqueness constraints”

as a followup, ask yourself, does a change to this natural key constitute a new identity?

system · July 18, 2012, 8:01pm

The most important distinction between surrogate keys and natural keys is that surrogate keys are designed to be primary keys while natural keys are the result of normalization: they are unique (candidate keys involving attributes only from the relation) but not all of them will qualify as valid primary keys.

Most importantly, the future data transformation can easily invalidate a natural key that was chosen as a primary key.

EDIT: And seeing that the primary key concept never existed in the relations theory (superkeys) but was introduced specifically for developers, surrogate keys fit the profile.