Character set problem with MS word HTML document

I’m having a problem with the character encoding of HTML files produced using Microsoft Word. When I view the local files in a web browser they display fine. However, when I post the file and view it on the web, the both Firefox and Explorer show the wrong character encoding. The web page loads with the encoding Unicode UFT-8 that makes it display incorrectly. If you manually change the encoding to Western it displays fine.

See, for example:
http://www.cjc.ca/uroproject/guide/BEG-general/zz-sworn-statement-en.htm

What I can’t understand is that the character set is clearly specified in the head of the
HTML as follows:
<meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252”>

Why are the browsers ignoring this? I’ve tried changing this line to other encodings such as charset=iso-8859-1 but the browser still displays in Unicode when the page loads.

Please help! I have no idea why this is happening or what can be done about it.

Thanks,
Avi

Hi,

the HTTP header of your page looks like this:


HTTP/1.0 200 OK
Connection: close
Content-Length: 28830
Content-Type: text/html; charset=UTF-8
Date: Thu, 28 Jul 2005 15:27:35 GMT
Server: Apache/2.0.50 (Fedora)
Last-Modified: Tue, 26 Jul 2005 14:34:32 GMT
ETag: "154677-709e-6e2c6e00"
Accept-Ranges: bytes
Keep-Alive: timeout=15, max=100

so most probably your Apache sends its own charset, so that the browser ignores the meta tag. If you can’t change the server configuration, maybe the following W3C document tells you a way out of it: FAQ: Setting ‘charset’ information in .htaccess

Another solution would be to just change the encoding of the document to UTF-8.

I don’t know if you can make MS Word generate UTF-8. If not, you must make the web server send the character encoding that Word produces (Windows cp1252). You can either do this by editing the .htaccess or httpd.conf files (for Apache) or by using a server-side scripting language like PHP to send the header for you.

Thanks to kleineme for that info about the server. It clears up the mystery of why the page displays well locally but not online. Where does one view that info (the HTTP header that you pasted)? It doesn’t show up when you just view source from a browser. I spoke to the server guy and he was surprised that the server character set would override the HTML document and he said there was nothing he could do about it (grrrr).

Thanks to zcorpan for the idea of saving it in UTF-8 in Word. I tried that and it works for the document I linked to. (If you click the link above you’ll see that it now displays properly). Unfortunately some of my documents have their formating screwed up when I save them as unicode in Word (alas nothing is ever simple).

~Avi

Hi,

oops, I forgot to add the URL to the above mentioned W3C document, sorry. I’ve added it now.

I’ve retrieved the HTTP-header of your document with cURL via PHP, but there are other ways to get to it. Here’s another link to the W3C, this time even including the URL :wink:

FAQ: Checking HTTP Headers