atoroqo — 2013-04-17T04:45:43-04:00 — #1
I intend to create a MySQL database of a discontinued magazine that was issued for about 50 years with 4-6 issues per year. So the amount of data is set once and for all; when I've fed it all in there will be no more writing to the database. I could invent all the wheels on my own, but I'd rather hear some more experienced opinions before I go too far. So here are my thoughts on it. I'd be happy to get some feedback on them.
There will be a fairly simple web interface with two possibilities - browsing by selecting year and issue or searching in a single search field. I need a fulltext index for the articles as well as the titles and subtitles, and ordinary single-word indices for the keywords and authors' first and last names.
The main table articles should contain one article per row and I started out by creating the columns year, issue, pages, title, subtitle and text. Then the tables keywords and authors as well as tables for connecting these to the main table.
First of all, I suppose most people would agree that I should keep authors and keywords in separate tables. But that will give me queries like the following:
SELECT year, issue, title, subtitle
FROM articles AS a
INNER JOIN articles_keywords AS ak
ON a.id = ak.articleid
INNER JOIN keywords AS k
ON k.id = ak.keywordid
INNER JOIN articles_authors AS aau
ON a.id = aau.articleid
INNER JOIN authors AS au
ON au.id = aau.authorid
WHERE MATCH(articletext,title,subtitle) AGAINST(:ft_1 IN BOOLEAN MODE) OR keyword = :x_1 OR au.firstName = :x_1 OR au.lastName = :x_1)
– in its most simple form, i.e. when there's only one word in the search field and no boolean operators (some version of AND, OR and NOT should be possible as well).
Are there any acceptable ways to get rid of the JOINs and simplify queries in cases like this? (Or maybe the joins won't affect speed as much as I imagine?) I see two possibilities: to move keywords and authors to the main table (which I guess is out of the question) or to make sure that the keywords and authors end up in the fulltext index for the articles themselves. Hidden paragraphs/divs/whatever? Meta tags? Other solutions?
The answer to that question is perhaps dependent on the answer to my main concern – how to store the articles. Some of them are extremely short and consists in a single paragraph, but others will need a lot more markup such as various levels of headers, plenty of paragraph tags, lists, tables and images. Should I store the articles with HTML and all in the database (or XML, which I've discarded as an unnecessary extra step) or in a file system? I definitely need a version with markup somewhere, and if I keep it outside the database I'm going to need two copies of the articles - one with markup, one without markup for the index. The second solution will waste more space than the first, but if the speed is affected positively by keeping markup out of the index, I suppose it's worth it. Otherwise I don't see any real problems with having the markup in the index. The chances of false positives (where the query happens to be identical to a tag name or attribute name/value) are pretty slim, especially since the articles aren't written in English. Any other considerations or solutions?
As for the images, I intend to keep those outside the db entirely.
Ideas? Suggestions? Thanks.
r937 — 2013-04-17T07:26:53-04:00 — #2
keywords and authors should remain in their own separate tables, with many-to-many relationship tables as you have anticipated in the sample query you posted
the sample query will not work exactly as is, you would need subqueries to return GROUP_CONCAT strings for keywords and authors
as for storing the articles, use two TEXT columns, one without markup for searching and one with markup for display (space is not really an issue because you won't have bazillions of articles)
atoroqo — 2013-04-17T15:52:33-04:00 — #3
Thanks. I'm not sure how to implement the subqueries, though, but I'll get there eventually.
As for the text version with markup, are there any obvious pros and cons putting it into a TEXT column compared to having it in a file system and just store the paths in the db?
r937 — 2013-04-17T16:48:38-04:00 — #4
yeah, i guess that's an option too
scallioxtx — 2013-04-17T17:38:59-04:00 — #5
You can't use MySQL's fulltext functionality on the files, you'd have to add some sort of search functionality in your code, and they're mostly not trivial to write. Unless you have a very strong reason not to, I'd stick them in the database in a TEXT field.
Also, you want to have a look at Spinx Search: http://www.ibm.com/developerworks/library/os-php-sphinxsearch/
Takes some time to set up, but works well (better than MySQL fulltext IMHO, which is a tad too simple for any real searching).
r937 — 2013-04-17T18:06:46-04:00 — #6
no, but my suggestion was to have a non-marked-up copy of the article in the database, the only issue then being whether to retrieve the marked-up version from an adjacent column on the row, or from an external file
plus, if you do the fulltext searching outside mysql, how would you combine ose results with additional conditions on other attributes, like category, or date_published, or author, etc.
scallioxtx — 2013-04-18T02:13:42-04:00 — #7
Sorry, I must have missed that.
My point exactly
atoroqo — 2013-04-18T13:23:32-04:00 — #8
Thanks, it sounds like something worth looking into. But I would either have to convince my present host to install it or switch to another web hotel. So I'll go with the MySQL fulltext for now and see how far it'll take me.
(I'm glad the two of you sorted out the other question without me. :))