I intend to create a MySQL database of a discontinued magazine that was issued for about 50 years with 4-6 issues per year. So the amount of data is set once and for all; when I’ve fed it all in there will be no more writing to the database. I could invent all the wheels on my own, but I’d rather hear some more experienced opinions before I go too far. So here are my thoughts on it. I’d be happy to get some feedback on them.
There will be a fairly simple web interface with two possibilities - browsing by selecting year and issue or searching in a single search field. I need a fulltext index for the articles as well as the titles and subtitles, and ordinary single-word indices for the keywords and authors’ first and last names.
The main table articles should contain one article per row and I started out by creating the columns year, issue, pages, title, subtitle and text. Then the tables keywords and authors as well as tables for connecting these to the main table.
First of all, I suppose most people would agree that I should keep authors and keywords in separate tables. But that will give me queries like the following:
SELECT year, issue, title, subtitle
FROM articles AS a
INNER JOIN articles_keywords AS ak
ON a.id = ak.articleid
INNER JOIN keywords AS k
ON k.id = ak.keywordid
INNER JOIN articles_authors AS aau
ON a.id = aau.articleid
INNER JOIN authors AS au
ON au.id = aau.authorid
WHERE MATCH(articletext,title,subtitle) AGAINST(:ft_1 IN BOOLEAN MODE) OR keyword = :x_1 OR au.firstName = :x_1 OR au.lastName = :x_1)
– in its most simple form, i.e. when there’s only one word in the search field and no boolean operators (some version of AND, OR and NOT should be possible as well).
Are there any acceptable ways to get rid of the JOINs and simplify queries in cases like this? (Or maybe the joins won’t affect speed as much as I imagine?) I see two possibilities: to move keywords and authors to the main table (which I guess is out of the question) or to make sure that the keywords and authors end up in the fulltext index for the articles themselves. Hidden paragraphs/divs/whatever? Meta tags? Other solutions?
The answer to that question is perhaps dependent on the answer to my main concern – how to store the articles. Some of them are extremely short and consists in a single paragraph, but others will need a lot more markup such as various levels of headers, plenty of paragraph tags, lists, tables and images. Should I store the articles with HTML and all in the database (or XML, which I’ve discarded as an unnecessary extra step) or in a file system? I definitely need a version with markup somewhere, and if I keep it outside the database I’m going to need two copies of the articles - one with markup, one without markup for the index. The second solution will waste more space than the first, but if the speed is affected positively by keeping markup out of the index, I suppose it’s worth it. Otherwise I don’t see any real problems with having the markup in the index. The chances of false positives (where the query happens to be identical to a tag name or attribute name/value) are pretty slim, especially since the articles aren’t written in English. Any other considerations or solutions?
As for the images, I intend to keep those outside the db entirely.
Ideas? Suggestions? Thanks.