Right structure for multiple language DB

I made a start setting up a DB structure for multiple language websites, but I have the feeling that I miss something and I can’t put my finger on what it is. This are the tables I have so far:


CREATE TABLE IF NOT EXISTS `languages` (
  `language_id` TINYINT(2) NOT NULL AUTO_INCREMENT,
  `language` VARCHAR(12) DEFAULT NULL,
  `country_flag` CHAR(8) DEFAULT NULL,
  `isActive` TINYINT(1) NOT NULL DEFAULT '0',
  PRIMARY KEY  (`language_id`),
  UNIQUE KEY `language` (`language`)
) ENGINE=INNODB  DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `pages` (
  `page_id` TINYINT(2) NOT NULL AUTO_INCREMENT,
  `page_address` VARCHAR(128) DEFAULT NULL,
  PRIMARY KEY  (`page_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
 
CREATE TABLE IF NOT EXISTS `content` (
  `content_id` TINYINT(2) NOT NULL AUTO_INCREMENT,
  `page_id` TINYINT(2) NOT NULL DEFAULT '0',
  `title` VARCHAR(40) DEFAULT NULL,
  `heading` VARCHAR(255) DEFAULT NULL,
  `content` TEXT,
  PRIMARY KEY  (`content_id`),
  KEY `page_id` (`page_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 ;

Those tables are self explaining I guess. table languages holds the different languages with the requested number of languages set to active. table pages are the actual pages in the website: about_us.html etc. table content holds the actual content for the different pages. And at this point I have a block! I think I need at least 1 if not more relational tables. Or should I add another field language_id to the content table? Can anyone tell me If I miss something here or that I’m on the right track

Is the language relative to the page or the content?

I would suggest that it relates to the content and so language should be a 1:1 relationship between each content record in the db. that means a ‘title’ and ‘content’ written in French, will show in French whereas a ‘title’ and ‘content’ written in Greek, will show in Greek. 1:1

in my CMS, I have these table structures


languages:
| language_abbr | language_full | flag |
| ENG   (pk)    | English       | eng.gif |
| GRE   (pk)    | Greek         |  gre.gif |  
file_sequencing
| file_id | file_name_home_language |  sequence_number |

*sequence_number determines the order they display in the nav menu
** home language is the home language of the business/user not the web viewer

file_data
| id | file_id | language_abbr | title | heading | content | 

I hope that gives some assistance

You may want to consider something more standard like PO language files. I personally don’t know much about them, though it’s on my to do list.

If you use something more standard, you can find more support (whereas if you make something proprietary, it’s on you).

content_translations

  • content_id ( fk content(content_id) )
  • language_id ( fk languages(language_id) )
  • title
  • heading
  • content
  • pk(content_id,language_id)

You may also want to add language_id to the content table to determine the contents origin language.

bazz, good explanation

by the way, it’s one-to-one only as seen from the content side – from the language side, it’s many-to-one, as there can (hopefully) be more than one content for each language

i would use the international 2-char language code for the language PK, but the 3-char one is okay too, i guess

but please, not a auto_increment id, okay oddz? :slight_smile:

Good explanation indeed IBazz. :tup: This helps me a lot indeed. One question though! Is the a particular reason why you use the language_abbr as PK instead of a number?

@Rudy: can you explain what you mean with this:

i would use the international 2-char language code for the language PK, but the 3-char one is okay too, i guess

The auto_increment was indeed the thing that made me confused I guess

this –> ISO 3166-1 alpha-2

the reason you would use it instead of a number is because it’s so much better than a number!!!

Is the a particular reason why you use the language_abbr as PK instead of a number?

in simple terms it’s because its unique to a language. (OK, so there is a variety of English such as US English and International english but, fundamentally, its all English). and if there is a natural PK in your table content, there is little benefit in using a surrogate key.

However, on a technical point, I am not sure if (in other circumstances), a huge PK value would be as efficient in terms of bytes, as an auto_increment number. Personally, I value readability quite highly when moving from table to table as part of the development process.

bazz

ISO 3166 are country codes, not language codes…
ISO 639 are language codes

BUT

Having built multiple multi-lingual databases, I would suggest you use BOTH… and establish a LOCALE table instead of a LANGUAGE table.

Locale is the combination of language and country, e.g. en_US, or en_GB. It gives you more ability to localize your application not only on language, but currency, number formatting, metric-vs-kings measurements, etc.

I typically structure my ML DB as such:


-- Fill this with all ISO-639 languages
CREATE TABLE language (
    id CHAR(2) NOT NULL PRIMARY KEY, 
    name VARCHAR(255) NOT NULL
) TYPE=InnoDB;

-- Fill this with all ISO-3166 Countries
CREATE TABLE country (
    id CHAR(2) NOT NULL PRIMARY KEY,
    name VARCHAR(255) NOT NULL
) TYPE=InnoDB;

CREATE TABLE locale (
    language_id CHAR(2) NOT NULL REFERENCES language(id),
    country_id CHAR(2) NULL REFERENCES country(id),
    name VARCHAR(255),
    active TINYINT NULL
    PRIMARY KEY (language_id, country_id)
) TYPE=InnoDB;

You can either prefill your entire locale table and use an active/inactive flag (as shown) or only fill it with locales you support.

Country should be nullable so you can create a default locale with no country (e.g. “en” or “es”). Typically you should have your application use a particular country’s locale act as the default. This can be achieved by adding a “default” flag to your country table and making “country_id” not null… then any non-country-specific locale or unsupported country’s locale would fall back to the default country’s locale.

Extending my above example and applying to your DB:

Languages: http://www.loc.gov/standards/iso639-2/php/code_list.php
Countries: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
Locales: http://www.roseindia.net/tutorials/I18N/locales-list.shtml


-- Fill this with all ISO-639-1 languages
CREATE TABLE `languages` (
    `language_id` CHAR(2) NOT NULL PRIMARY KEY, 
    `name` VARCHAR(255) NOT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;

-- Fill this with all ISO-3166 Countries
CREATE TABLE `countries` (
    `country_id` CHAR(2) NOT NULL PRIMARY KEY,
    `name` VARCHAR(255) NOT NULL,
    `default` TINYINT NOT NULL DEFAULT 0
) ENGINE=INNODB DEFAULT CHARSET=utf8;

-- Fill this table with supported locales
CREATE TABLE `locales` (
    `locale_id` VARCHAR(5) NOT NULL PRIMARY KEY, -- This would be like 'en_US', etc.
    `language_id` CHAR(2) NOT NULL REFERENCES `languages`(`language_id`),
    `country_id` CHAR(2) NULL REFERENCES `countries`(`country_id`),
    `name` VARCHAR(255),
    `active` TINYINT NOT NULL DEFAULT 1
    PRIMARY KEY (`language_id`, `country_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `pages` (
  `page_id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `page_address` VARCHAR(128) DEFAULT NULL,
  PRIMARY KEY  (`page_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
 
CREATE TABLE IF NOT EXISTS `page_contents` (
  `page_id` INT UNSIGNED NOT NULL REFERENCES pages(page_id),
  `locale_id` VARCHAR UNSIGNED NOT NULL REFERENCES locales(locale_id),
  `title` VARCHAR(40) DEFAULT NULL,
  `heading` VARCHAR(255) DEFAULT NULL,
  `content` TEXT,
  PRIMARY KEY  (`page_id`, `locale_id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 ;

Yeah, i would agree with that. On the application side of things two or three alphabetic characters isn’t going to be an issue passing via the URL if its even needed.

transio, that new design doesn’t support translations. How would you determine the available translations for an article that originated in English?

oddz,

That design most certainly DOES support translations. Is the originating language important?

the page_contents table contains the contents of all pages in multiple languages. The page_id and locale_id combination define which page and what language the contents are in.

Cheers.

Well, I can’t profess on the issue but, I wonder if locales is a bit ‘PC gone too far’. Surely its about communication? so if the different dialects don’t cause a breakage in communication then the does the adage ‘if it ain’t broke, don’t fix it’, apply?

Maybe it’s useful for multi-national entities but the cost of translation is quite significant for smaller businesses. I digress :oops:

Locale implies more than just language, IBazz… For example, en_US, en_GB, en_CA, en_ZA, en_AU, en_NZ - all imply “English”, and can fall back on the default “English” dialect you choose for your site… but which ones use mm/dd/yyyy vs. dd/mm/yyyy? Which use meters and kilos vs. feet and pounds? These are important “locale” settings that have nothing to do with language. If your mission is to localize your website, why restrict your DB to language when it’s not much more work to account for full locale, and allow future localization of other non-linguistic information?

Don’t think that’s important? If you print your date for an upcoming event as 04/05/2011 - half your audience may believe it’s April 5th, the other half May 4th… think that’s not an important miscommunication? Think again.

[indent]patient: doctor, it hurts when i do this

doctor: well, don’t do that then :)[/indent]

no confusion anywhere if you print 2011-04-05

but your point about locales and other attributes besides language is very well made

Heh… “equally confusing EVERYONE” doesn’t equate to “no confusion anywhere”. I’ve used dates in YYYY-MM-DD format in websites before… the specific case I’m thinking of was a form with a jQuery datepicker that filled the field with a Y-m-d date. I never saw so much negative user feedback in my life!!! People called us at least every other day confused about the dates!!! Don’t assume people are smart… that’s a sure way to develop poor usability apps.