Can anyone suggest a reason why I may want an ID field?

MatthewBOnline · April 12, 2011, 1:30pm

Hi,

I will soon be building a MySQL Database. I have read in many tutorials and web sites that there should be an ID field that auto increments.

I intend to use the WHERE function in MySQL to select an entry in the database. Each WHERE function will request a unique product name. In addition it will be more search engine friendly. Rather than having www.mysite.co.uk/products.php?id=23 I will have www.mysite.co.uk/products.php?name=productname.

Furthermore, I intend to store transactions within another database. There will be an entry for time and date of purchase. Again, this is more useful that an ID that just auto increments. Much easier to remember a date than remember an ID?!

Can anyone suggest a reason why I may want an ID field??? Am I missing the point with using an ID field?

Matt.

wwb_99 · April 12, 2011, 1:48pm

This makes sense for most cases, but starts to break around the edges. Problems with using product name are:

IDs should generally be immutable, products change names. How are you going to cascade changes to the other database?
IDs need to be unique. Product names are not always unique.
IDs tend to get stored in multiple places. Product names are longish so they won’t be as efficent.
IDs get used alot in debugging, do you really want to depend on typing ‘fuzzy bunny slippers, LARGE PINK’ correctly all over the place?

For transaction dates, you’ve got one small issue – there most definitely can be simultaneous transactions in the system. You also have lots of interesting issues around dates and times as they translate between layers in the system.

It has been pretty well and proven that the only things that really work as IDs in most cases are:

a) Integers, typically AUTOINCRIMENT style
b) GUIDs, which are really big long funny integers
c) Short system generated strings

And (c) is still a risky option at times.

DaveMaxwell · April 12, 2011, 1:50pm

I’m going to respond to these in reverse order:

I use IDs for a number of reasons:

Foreign keys
[LIST]
if a record is going to be referenced in another table (a product is referenced within an order for example), then I find using ID fields to be the most efficient use for me. Plus, comparing one field on a record is easier than comparing two or three.
Speed - again, this goes back to the foreign key bit, but numeric lookups are more efficient than string lookups unless you really, REALLY know what you’re doing with indexing.
Ease of maintenance
[/LIST]
Uniqueness - say I had two companies called Maxwells Music Shop, I don’t want to have to arbitrarily add something to a companies name just to make it unique - I’ll use a ID field on the record so I don’t have to make an overly complex key just to get uniqueness.
Normalization - database normalization rules have been in place for many, many years for a reason - yes, hardware today can overcome many things, but why make anything harder than it needs to be.

I’d argue that it’s easier for a customer to remember a transaction id vs a date. You’ve also got the issue of duplication - unless you go down to milliseconds, the chances of duplicate order dates are VERY high and confusion will reign, both with you, your staff and your customers.

With htaccess rewrite rules, this has nothing to do with your database setup. A better URL would be www.mysite.co.uk/products/productname or even [URL=“http://www.mysite.co.uk/productname”]www.mysite.co.uk/productname and your URL rewrite rules will parse it and send the user to the correct page.

Doug_G · April 13, 2011, 1:21am

This google search will point you at lots of discussions/opinions on this subject.

natural vs artificial keys - Google Search

oddz · April 13, 2011, 1:42am

For reasons outlined by both wwb_99 and DaveMaxwell 90% of the time I find surrogate keys to be the better/hassle free option. You can still use SEO garbage URLs anyway and literally ignore the SEO garbage: /23456/My-Really-Stupid-SEO-Friendly-Title-That-Will-Probably-Change-Anyway-DOH.

dportas · April 17, 2011, 10:17am

It’s important to separate the two different concerns raised by your question. The first, the one you are directly asking about, is what identifier to use in your URLs. The second concern is the design of your database. Since this is the Database forum I’ll answer the second of these first, just as other people already have done.

In database design it is quite common for things to have multiple identifiers. In a relational database table the identifiers are keys and there’s nothing wrong with having several different keys in a table. It is quite common for a table to have keys both for the business key (the identifier familiar to end users) as well as a “technical” or “surrogate” key (a key seen and used by database developers and DBAs but without business meaning and not exposed to the end users).

So although the points made by wwb_99 and DaveMaxwell are potentially relevant concerns when implementing keys, none of those things in themselves should stop you using any appropriate key value in your URLs - because there is no reason why you have to choose only one type of key or the other. Interface design should not dictate database design.

The most important criteria for choosing keys are Familiarity, Stability and Simplicity. A user-visible product code, stock number or possibly even a unique product name are good candidates for keys but in all cases business requirements should be the deciding factor. For instance, stability is desirable from the database designer’s point of view but doesn’t necessarily match what is really required from the user’s point of view since key values sometimes need to change.

What I suggest you do not do is expose any surrogate key in the URL. That would undermine some of the advantages of having a surrogate key in the first place. Of course, once you expose the key to users it inevitably acquires some business meaning and is no longer a surrogate.

Hope this helps.

r937 · April 17, 2011, 11:49am

nice one, david

your advice (not to expose a surrogate key) is completely opposite to common practice, innit

heck, even vbulletin uses them in its urls

(we had a thread on here a while back, which i don’t want to re-open, where someone suggested that since tables like threads and posts don’t have a useful natural key, the surrogate number assigned to them actually ~becomes~ the natural key!)

yes, i do realize “common practice” does not make it “best practice” but you’re definitely swimming against the current on this one…

steveob200653 · April 17, 2011, 2:39pm

I think the reasons for using an Id rather than a date or description have been pretty well expressed on here.

I would just like to add that although an autoincrement can be convenient, it is not as predictable as one may wish. There may be times when you want to know in advance what the key will be, eg. when preparing a script to add data with the Id as a foreign key on other tables. at times like this I use my own function for generating a key, or returning a key that is supplied in advance.
You obviously have to make sure the ranges are well apart. I would also suggest having the ID as a character rather than an integer. Don’t forget to add a unique index.

rpkamp · April 17, 2011, 2:54pm

This doesn’t fly. On systems with a lot transactions you can never know for sure what the ID of a row you’re about to insert will be. Between the time you query the database to find out what it probably will be and the time you actually insert the row another process might have just inserted a row with the ID you just got, causing all sorts of havoc.

steveob200653 · April 17, 2011, 3:50pm

This does fly. There are times when you do want to know in advance eg. scripts to set up standing data that will be tested in a separate environment, and the subsequent data should be recognisable with the same test scripts whatever the environment. I may not have made it clear but you make sure that you never insert a row with a value that could be inserted by another process. This is why I propose that the values from separate processes use different ranges, I favour character based prefixes. If that is not suitable, then use entirely different ranges.

rpkamp · April 17, 2011, 4:01pm

I assumed you were talking about fetching the next auto_increment (or indeed even MAX(id)+1) before. Now that I see you weren’t I take back what I said. You’re right, that does fly

StarLion · April 18, 2011, 4:07pm

The simple test i use is “Will I be storing this ID in another table?” If yes, I use a synthetic key. If no, a natural key is more than sufficient.

Mostly, it’s a way of reducing redundant overhead - it takes a lot less space to store the number 14 1000 times than it does to store “Daves Magical Carpet Ride” 1000 times.

rpkamp · April 18, 2011, 4:10pm

Plus integer comparison is a lot faster than string comparison (for JOINing).

r937 · April 18, 2011, 4:22pm

whenever i see a statement like this, i always ask the person who made it to provide a definitive and authoritative reference

so, you gots one?

in actual fact, i suspect you would have difficulty measuring the difference

besides, basing a table design choice on the difference in CPU comparison of strings or integers is the tail wagging the dog if it leads you to select a design which ~requires~ additional joins just to return a string for an unknown integer

“integers are faster” (despite the fact that they probably are) is just one of those urban database myths that people get hung up on

StarLion · April 18, 2011, 5:03pm

r937 - the only statement i’d make there is that it would be faster because the table isnt holding as much data, which would reduce check time? Smaller Table = Faster Search, Int rather than string = Smaller Table, ergo Int rather than String = Faster Search ?

(I’m not going to say ‘a lot’, because I believe my reasoning is subject to relative sizes as to it’s magnitude).

rpkamp · April 18, 2011, 5:10pm

I do. In MySQL (and most DBMSs, if not all) an integer is stored in 4 bytes, whereas strings take up as many bytes as they have characters (or a multiple thereof if you use any of the utf-8 or utf-16 encodings) plus I’m pretty sure VARCHARs add a 0 byte at the end to indicate the end of the string, so that’s one byte extra. So unless your strings are all 3 characters or shorter, integer comparison will be faster. Indeed if your strings are 7 characters long integer comparison will be twice as fast, etc.

It’s something that’s used throughout computer science and is well established, but I agree that the difference you’ll notice in databases is probably (close to) negligible in the grand scheme of things.

r937 · April 18, 2011, 5:17pm

so where is it? i mean, the link

specifically, the link that actually ~quantifies~ the difference, and provides support for your use of “a lot”

rpkamp · April 18, 2011, 5:55pm

MySQL :: Re: Primary key performance int vs. char and [url=http://www.gamedev.net/topic/598968-string-vs-int-comparison-which-is-faster/]string vs. int comparison which is faster? - GameDev.net (post #8)

samanime · April 18, 2011, 6:10pm

The other reason for an integer key instead of a string key would be just the size itself.

As Rémon said, an integer is only 4 bytes whereas a string is at least 1 byte per character + 1 byte. Say your strings average only about 10 characters, that’s 11 bytes per row, which is 7 more than an integer. Multiply that by 100, you’re talking 1.1KB extra, by 100,000, you’re talking an extra meg… just for storing the key.

Say you use the key to reference it in 5 other tables, you’re now up to 5 extra megs.

Include into that all the extra processing it takes to store those, and the extra bandwidth to transfer those few extra bytes, and things can start compounding quite rapidly.

I think the main reason for using an auto-incremented immutable integer ID is so you have something short and sweet use as a foreign key and link your tables together. The immutable part is just as important (if not more so) than the auto-incremented or integer point. It becomes very tricky to change a primary key and have it reflected everywhere. It can be done, but it’s a lot of extra hassle.

r937 · April 18, 2011, 6:19pm

your first link is a nice explanation of the performance of retrieving longer rows as opposed to shorter rows off the disk (but does not otherwise talk about actual comparisons) – a most important consideration, to be sure

your second link talks about the actual comparisons, and by inference, suggests that string comparisons might execute more slowly simply due to the number of machine code operations involved

but again, nothing to quantify that integer comparisons are “a lot” faster

it’s the “a lot” part that i had the problem with