Charset for Spanish language site

vanishdesign · May 19, 2009, 3:16pm

I have been assigned to do a site in Spanish only. We want to do away with special characters like í and just use í, for example. I’ve read up on charsets and I don’t fully understand the difference between iso-8859-1 and UTF-8. Which should I use to get this result?
Is it possible not to use í?

I’m not sure of the origin of the text, but it’ll likely be coming from microsoft word and be pasted into my text editor (since I don’t speak Spanish and won’t be writing the copy anyway).

Up until this point, my text editor (Notepad++) was saving in ANSI, and I just switched it to UTF-8. Will it still work properly with pages encoded in ISO 8859-1?

This is what is on es.yahoo.com


<html lang="es">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Is this the correct way of doing things?

AutisticCuckoo · May 19, 2009, 5:00pm

First of all, you shouldn’t use the term ‘charset’ at all, because it’s ambiguous.

There are two concepts of importance here: the character repertoire and the encoding. The term ‘charset’ has been used for both of these, which can lead to some confusion.

The character repertoire is the total set of available characters. For HTML this is defined to be ISO/IEC 10646, which to all intents and purposes is equivalent to Unicode. You cannot change this; it’s built into HTML.

So what you can vary is the encoding, which is how the characters in a given repertoire are represented numerically, i.e., in a form that computers can understand.

ISO 8859-1 is actually both a repertoire and an encoding. As a repertoire it’s a small subset of ISO/IEC 10646, so we can regard it as an encoding capable of representing a small part of Unicode. UTF-8 can represent any Unicode character.

In ISO 8859-1, each character is encoded using a single octet (an octet is an 8-bit number, i.e., an integer between 0 and 255, inclusive). In UTF-8 characters are encoded using a variable number of octets. The first 128 positions, equivalent to US-ASCII, are encoded as a single octet. Most additional characters used in European languages, Hebrew, Arabic and others use two octets. Eastern writing systems like Japanese and Chinese require three octets per character.

The important thing to understand is that the encoding you declare for your web page must match the encoding under which you saved your files! Browsers don’t automatically convert anything; they trust what you tell them.

In the case of Spanish you can choose either ISO 8859-1 or UTF-8. Both will let you use a literal ‘í’ (and the other letters with diacritical marks used in Spanish, plus the ‘¿’ and ‘¡’ punctuation characters).

If you choose UTF-8, make sure you save the files without a BOM (byte order mark). A BOM is completely unnecessary in UTF-8, and will cause problems with some browsers.

Then be careful, because the encoding used for the original text may then come into play as well. Your editor may convert the pasted text automatically, or it may not.

As I said, the declared encoding must match the encoding used in the file. If you save as UTF-8 and declare as ISO 8859-1 – or vice versa – you’ll run into problems with all characters outside the US-ASCII range.

Yes and no.
The <meta> element is good to have there, but it will be ignored if your web server sends encoding information in the real Content-Type HTTP header. (Many web servers do, by default.)

If you cannot affect the server-side setting, then you have to choose the encoding declared by your server. Unless you use a server-side scripting language like PHP, which lets you override the headers.

Mittineague · May 19, 2009, 5:09pm

I like utf-8 because it’s more portable. eg. content -> feed, compared to using something like Windows-1252

But if your page content will only ever be in pages you could use whatever I suppose (as long as it’s a common supported charset). Is your example of ES Yahoo what you want?

<title>Yahoo! Espa&ntilde;a</title>
...
... Im&aacute;genes</a>
... V&iacute;deos</a>
<label for="v11">en espa&ntilde;ol</label>
Una mujer da a luz a dos beb&eacute;s y resulta que son hijos de padres distintos
... &#187; &iquest;C&oacute;mo es posible?</a>

vanishdesign · May 19, 2009, 5:54pm

Thanks Tommy for your detailed explanation. It is very helpful. I’ve read that ANSI is a superset of ISO8859-1, and learned that the server I’m going to use is transmitting in ISO 8859-1. Should I assume I’m safe using ANSI in my text editor and ISO on the server?

Actually mittineague, I noticed that Yahoo was using html entities as you posted, which spurred me to start the thread. I’d prefer not to use html entities.

Mittineague · May 19, 2009, 8:20pm

Yes, Tommy, thanks for that. No matter how many times I read about it I still have an uneasy feeling that I’m over my head.

I think the main thing is to be consistent. The most common problems involving “weird characters” are almost always a result of using something different in the text editor, stated encoding, database, etc. So the best thing is to pick something and stick with it across the board.

The other common problem is the BOM. AFAIK Notepad++ refers to this as the “signature” rather than BOM. Don’t use it for UTF-8 or you may see “weird characters” at the beginning of your files.

gary_turner · May 19, 2009, 10:19pm

Some reading:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Subject: UTF-8 history ⇦ Absotively, posilutely fascinating reading, and the best description of utf-8 anywhere.

If you don’t have a Spanish keyboard, use this Free Online Unicode Character Map.

If you configure Tidy to output utf-8, it will convert character entities to the character its ownself.

cheers,

gary

AutisticCuckoo · May 20, 2009, 5:50am

I think that ‘ANSI’ is what Microsoft sometimes call their proprietary encoding Windows-1252. That is similar to ISO 8859-1, except that it adds a number of characters in the 0x80-0x9F range. In the ISO encodings that range is reserved for C1 control characters.

Yes, if you’re careful. You must not use any of the characters that Microsoft put in the 0x80-0x9F range, because those code positions are not allowed in ISO 8859-1. Browsers often assume that the encoding is Windows-1252 if it’s declared as ISO 8859-1, because many non-savvy authors don’t understand about encoding concepts and believe that Microsoft complies with standards. But your page may fail validation – and display incorrectly in some browsers – if you use literal representations for characters in this range (such as dashes, ellipses or typographically correct quotation marks).

It can be confusing, but once the penny drops it becomes fairly clear.

[list][]Computers can only deal with (binary) numbers.
[]Characters must therefore be represented by numeric values.
[]There are a lot of different characters used in writing.
[]Many different characters require large numbers to represent them all.
[]Most authors use only a limited subset of the total character repertoire.
[]Using large numbers to represent few characters is a waste of space.
[]Thus various encodings try to represent such subsets as efficiently as possible.
[]The Writer and the Reader must agree on which representation to use.
[]The Writer chooses an encoding and declares what it is.
[]The Reader trusts the declaration and interprets the numbers accordingly.
[]If the Writer is lying about the encoding, chaos and mayhem may follow.
[]The encoding declaration should be sent by the Writer’s server for HTML.
[]The encoding declaration should be in the XML declaration for XHTML.
[]Using a <meta> equivalent in HTML is good practice, to ensure correct interpretation in the absence of a server.
[*]A <meta> equivalent must match the server’s declaration.[/list]