Tag table structure

everactive · February 20, 2010, 11:38pm

Hi,

I want to have the tags feature in my PHP application, just like this forum (“Separate tags using a comma.” when posting).

Should I have a separate table or just a column? How it’s ideally done?

Please advise. Thanks.

Dan_Grossman · February 21, 2010, 12:35am

Separate table, a one-to-many relationship to the object being tagged. That allows you to efficiently query the tags, so that you can do things like create a tag cloud, a tag page, top tags list, tag suggestions, etc.

everactive · February 21, 2010, 12:50am

Hi Dan,

You mean like this, right?

video

-> videoid
-> title
-> url

tag

-> tagid
-> videoid
-> word

Dan_Grossman · February 21, 2010, 1:01am

Yes. Though you don’t need a tagid, the (videoid, word) pairs are unique already.

everactive · February 21, 2010, 1:06am

Thanks, much appreciated.

Mike_Borozdin · February 22, 2010, 5:31pm

Have you thought about data redundancy this schema creates, as well as updates anomalities?

Here is an example of data stored in the suggested schema?


post#1 | tag#1
post#2 | tag#2
post#3 | tag#1
post#1 | tag#1

See a problem?

I would create two tables. One will contain tags (tagId, tagName) and the other one will link posts with tags (postId, tagId) and yes the latter uses many-to-many.

Dan_Grossman · February 22, 2010, 5:47pm

The primary key (post_id, tag) must be unique, so row #1 and row #4 could not both exist. Nor does the schema create update anomalies. If the author of post #1 wants to change his tags, that doesn’t mean the tag post #3 previously had in common should change. As long as tags are user defined and not a prewritten safe list maintained by someone else, your schema requires more queries and more maintenance with no practical benefit.

r937 · February 22, 2010, 5:47pm

yes, you’ve repeated post#1|tag#1 and that wouldn’t be allowed

but i think i know what you were trying to suggest, and i think it’s wrong

the issue of “redundancy” is not going to go away if you substitute a numeric id for the tag name

as often as the tag name would have repeated, the numeric id will also repeat just as much

“redundancy” is a poor excuse to use a numeric id

furthermore, note that with a numeric id, an extra join is required

Mike_Borozdin · February 22, 2010, 7:14pm

I made a type, of course they couldn’t both exist.

As for update anomalies, I mean if you have many posts tagged with “tag A” and suddenly you want to rename this tag to “tag AA” you have to update a whole bunch of entries.

Moreover, if we have a schema similar to this one: tags (tagId, tagTitle, numPosts) and providing we’re updating numPosts after tagging a post with a corresponding tag, we are pretty much better off when we need to generate a tag cloud, because we don’t have to count the number of posts tagged with all the tags.

r937 · February 22, 2010, 8:13pm

this is a rational argument for other many-to-many structures, but it hardly applies to tags

‘ms jones’ might have to become ‘mrs smith’, but when does the ‘mysql’ tag have to change to some other name? more to the point, would it still be the mysql tag? this question goes to the root of identity, and of course identity is what candidate and primary keys are all about

this too is a rational argument, but again, it hardly supports the need for a numeric id, it also works just as well for a tags table where the numeric id is missing

Mike_Borozdin · February 25, 2010, 11:54am

There can be a type in a tag that you only notice having published a few posts with it. Also, I remember making a huge tag revamp and I think I changed the names of some tags.

Okay, we can actually go without a numeric id, I just think it’s better to join tables rather than perform counting when displaying a tag cloud. Although, it’s just words, I didn’t evaluate the performance in both cases.

r937 · February 25, 2010, 2:25pm

fixing typos and making mass changes can both be accomplished easily with simple UPDATE statements

you do not need a numeric id to make occasional updates

and there’s the key – you don’t update tags regularly, just occasionally

meanwhile, during normal operations, you are saved the unnecessary overhead that the extra join requires