Help me normalize this database

I am trying to help a coworker optimize a database that was created many moons ago. It’s a list of documents, and along with a document title and other relevant info, each document has a category and a subcategory, and in fact can have more than one of each. The original person building this database decided that it would be a good idea to put comma-delimited lists of categories and subcategories in columns (one column for each) in the same table as the documents. This has led to some issues that you would probably expect from such a setup.

I’ve decided to “clean up” this database by moving all the categories and subcategories into their own tables, where they will have ID numbers (this is done), then add a relational table to join each document ID with all the category and subcategory IDs it belongs to.

My problem is that I can’t join the comma-delimited columns in the documents table to the category or subcategory tables, so I’m having trouble adding all the values to the new table. Other than making the changes by hand, is there an easier way to insert the appropriate values into my new table?

Use whatever language you’re using to write a one-off script that takes that comma separated value and splits it up into an array, then process the array and insert the values as appropriate.

You could also do it using SQL, but a scripting language (VBscript/PHP/etc) would probably be easiest and fastest.

Thanks Dave. So, do you supposed I could loop through all the categories, then within that loop, do a query to insert a record into my new table where the value of the comma-delimited column is LIKE the current category?

I ended up doing two query loops - one that looped over my documents, and another loop inside that one that looped over my categories. I compared the columns of each query using a listcontains(), and when true did an insert. Lather, rinse, repeat three times and now all my tables are populated. :slight_smile:

Ok, I have something else I need to get my head around.

In this application, we have the following hierarchy:

Category -> Subcategory -> Document.

Normally, each subcategory would have one parent category, but in this case, each subcategory has the potential to appear in more than one category. For instance, the subcategory Recreational Trails appears under both the Design and the Planning categories.

In the database I have two options. Option 1 is to have a parent category columns in the subcategories table, and list each category more than once. Option 2 is to have each subcategory once in its table, then have a relational table to join the subcategories table to a categories table (which is how I have it now).

Normally, I would use option 2, except that as I build the front end for this thing (a web-based form used to add documents and assign them to categories and subcategories) I’m noticing that duplicates are showing up, which is not surprising. For instance, when I look at the above document, the Recreational Trails subcategory appears twice - once for each category it’s in.

I suppose I could just group the subcategories, but I’m wondering if I’m going about this the right way. Currently I have a table that joins subcategories to their parent categories, but maybe I don’t need that at all, and I should be joining the documents themselves to their main categories?

I’m not sure which way I should be doing this!

my own preference is for each subcategory to have exactly one parent category

for the situation where you have what appears to be the same subcategory in two different parent categories, the solution you mentioned as option 1 is what i would choose – two different subcategories, but having the same name

as for separate category and subcategory tables, with a strict hierarchy (each subcategory has exactly one parent category) it’s a lot easier to put them both into the same table – see Categories and Subcategories

an additional benefit of the single table is that you can then associate a document (or whatever) with either a category or a subcategory, whichever makes sense, and this is a ~lot~ harder to accomplish when the categories and subcategories are in separate tables

Mine too! But since this is a front end for a preexisting application, I don’t have much of a choice.

That link looks helpful. I’m going to read through it and hopefully I’ll have a better idea of what to do!

maybe you can explain this a bit more?

you said earlier…

For instance, the subcategory Recreational Trails appears under both the Design and the Planning categories.
for me, this would be two separate subcategories

they would each have a different primary key, but by happenstance, they would have the same name –

[B]cat_id   cat_parent   cat_name[/B]
   9       NULL       Design
  37       NULL       Planning
  42         9        Recreational Trails 
  83        37        Recreational Trails 

Ok, let me try to explain (and probably fail!)

This is basically a list of government documents that are public info. There is no actual content in the database. It’s just titles, hyperlinks, and keywords. The purpose of the database is to provide a search engine so that people can find the locations of these documents on our website without having to browse the entire site.

The user end of this is basically a form with <select> dropdowns for category, subcategory, the type of document it is, and the office name. The latter two are easy, because they’re many-to-one relationships, and I have those done. The problem is with the category and subcategory, because a document can be considered to be a child of more than one of either of those. The same document can easily be found under two or three categories, and two or three subcategories.

In fact I’ve even entertained the idea of just having a table to join the documents to the categories without going through the subcategories. The search would still find them, and I think the only reason the categories and subcategories need to be related at all is to limit the available options in the dropdown menus to what’s selected in the previous dropdown.

The part I’m working on right now is a sort of back end to this thing. A form to allow the administrator of this app to maintain it without having to directly edit the Access database, because doing that won’t be an option once we move it to MSSQL. But I realize that I’ll have to help with the public-facing side of the app, because the database is now quite different, so I want to get the database set up in the best possible way given the weirdness of this many-to-many-to-many setup.

I woke up this morning with a fresh mind, and decided to put all the categories and subcategories into a single table, with parent IDs. It didn’t take much time, and I was able to build a nested list to show the hierarchy. Then I wrote a script to populate a relational table to join document IDs to category IDs, and that’s when I noticed my biggest problem.

In the current app, subcategories aren’t related to categories at all, but to the documents themselves! The documents table has a column for category (comma-delimited list) and subcategory (also a comma-delimited list), and depending on the document, the subcategories and categories appear quite random. For instance, subcategory 34 will be a child of category 10 for one document, but that same subcategory will be a child of parent 12 for another document. If I pick category 10 on the drop-down, only the first doc will appear and not the second, despite being in the same subcategory.

This thing is making my head spin! I’m not sure how to handle this thing at all. :confused:

you need two subcategories

one (keep it as id 34) will belong to category 10

the new one (new id) will belong to category 12, likely with the same name as 34

then, hook up the documents for the appropriate subcategory

does the document really and truly belong to both categories?

rudy, you have the unique ability to ask me a question that makes me realize the answer.

I actually DO have multiple copies of some of the subcategories, each with their own parent category. The problem is getting the document assigned to the proper one!

I think I have to embellish my import script to get the subcategories assigned to all their parents. Right now I just loop through the documents table, then within each record I run a second query to loop over the categories table. I use a listfind() and when I find a match I add the ID of that document to the relational table, along with the category ID. I need to modify my script to also consider the category when I do this, so I can pick the right version of the subcategory.

As for whether the document belongs to two categories, the answer I got was that while a document really belongs to ONE category and ONE subcategory, people need to be able to find it in more than one place, because the document can cover multiple categories and subcategories. The goal is to not make the public have to spend lots of time looking for the document.

(in the real world, 99% of the public probably uses the Google appliance and keywords, and not the drop-down list, but that’s another battle) :slight_smile:

Ok, so every time I think I have this figured out, up pops a new situation in which it won’t work.

Using rudy’s previous example:


cat_id   cat_parent   cat_name
   9       NULL       Design
  37       NULL       Planning
  42         9        Recreational Trails
  83        37        Recreational Trails

In my search form, I only want one Recreational Trails option in the <select> element. I can do this by grouping my query results, but the problem is with the search form using ID numbers, I won’t be able to get to the selection that doesn’t appear in the <select>. So if I have <option value=“37”>Rec Trails</option>, then I won’t be able to pull up the ones assigned to 9.

And all this is beside the fact that I only really want ONE Rec Trails. It just has to have multiple parents. I’m at my wit’s end trying to figure out how to set this up!

You’re looking at it from the wrong angle - you wouldn’t want to display this to the client like that. You’d want to show them to the client something like this:


cat_id   cat_parent   cat_name
   9       NULL       Design
  42         9        Recreational Trails 
  37       NULL       Planning
  83        37        Recreational Trails

or in html


<select id="category">
<option value="9">Design</option>
<option value="42">--Recreation Trails</option>
<option value="37">Planning</option>
<option value="83">--Recreation Trails</option>
</select>

This allows you to show the “same” category, but only show the appropriate ones.

gorgeous :award:

Hmmm, yanno, I actually don’t even have to assign the category to the document, just the subcategory, because the category will automatically be assigned by virtue of being the parent, no?

Correct.

Yeah, this is the way I’m going to go.

Originally I was trying to figure out a way to just assign a document to one Rec Trails subcategory and have it show up in all the parents, but I don’t think that’s very realistic. :slight_smile:

What I might do is have a single <select> that looks like this:

<select id="category">
<option value="42">Design, Recreation Trails</option>
<option value="83">Planning, Recreation Trails</option>
</select>

And have the user add all that apply. I’ll organize the categories and subcategories in a single table, like rudy’s previous example, and when I want to display everything that’s in a certain category, I’ll look for that parent_id.

Thanks both of you for helping me get my head around this!!