Regex accented characters

digitalecartoons · November 4, 2007, 8:09pm

Well, I did in this case, cause I’m using a flash form to post values and even if my site is set to charset iso 8859-1, it posts its values in flash as utf-8. Which I still have to convert to iso 8859-1 with utf8-decode before processing further. Otherwise If I would submit a name as “René” it would display as “Ren%08” (or something like that) in the sent mail.

But anyway, for hexadecimal values it should always be set to iso8859-1 charset? (that’s what you mean by page encoding isn’t it?)

digitalecartoons · November 4, 2007, 8:32pm

I’ve tested it with:

<?php
echo "\\xE9";
?>

which should display the é character.
The browser opens default as iso 8859-1. When changing the browser to utf8 charset it displays the question mark character. Refreshing the page changes the browsers charset to iso 8859-1 again. So even when not set specifically I assume that php indeed sees strings as iso 8859-1? So even when I didn’t, I would even have to set the page enconding for that purpose?

Anyway, Flash for some reason allways posts its values as utf8 encoded. That is why a had to add an extra step of using utf8_decode to switch it back to iso 8859-1. Even when php sees strings as iso 8859-1. Otherwise, like I said, names like René would arrive displaying with utf8 encoded characters.

kyberfabrikken · November 4, 2007, 9:05pm

I reckon flash send data as UTF-8, disregarding the charset of the page. As does Javascript. If you submit a form element, the browser will use the same encoding as the page was served as.
What happens, if you don’t specify the charset, is undocumented, but most browsers would default to ISO-8859-1. Normally, Apache will be set up to send a chaset header by default, if you don’t specify one from PHP. You can see if this is the case, using the LiveHttpHeaders plugin for Firefox, or similar.
PHP is blissfully ignorant of any other charset, than ISO-8859-1. It expects everything to be that way. That’s why you need to explicitly decode, when the client sends data in UTF-8.

stereofrog · November 5, 2007, 9:11am

Well, this is something you should have said from the start… POSIX classes like :alpha: don’t work with utf8, either convert everything back to iso, as already suggested, or use utf8 expressions with \p classes and ‘u’ modifier:


$only_letters = preg_match('/^\\pL+$/u', $utf_string);

digitalecartoons · November 5, 2007, 2:30pm

Well, in my test alpha I didn’t use flash but it didn’t work anyway. Guess has to to with locale settings.

But where does it say on php.net that php allways uses iso 8859-1 for strings? Can’t find anything about it

stereofrog · November 5, 2007, 2:40pm

Actually, php core and most functions are completely unaware of the encoding. For the php interpreter, a “string” is just a sequence of bytes, it doesn’t make any assumptions about how the string is encoded. However, some functions do use encoding info, this is usually documented on the function’s man page, e.g. [fphp]htmlentities[/fphp].

For more info on charsets, encodings and all that stuff I’d recommend this excellent article

http://www.phpwact.org/php/i18n/charsets

digitalecartoons · November 5, 2007, 3:00pm

You’re right, I did find stuff on utf8 / iso8859 on functions like htmlentities, or utf8_decode and functions like that. Specifying that you can convert them from to iso8859 to utf8 or other encodings. But nothing specifically on that php sees strings as iso8859 by default.

I was just wondering if I could use hex codes for e.g. accented letters which are on the standard iso 8859 list (http://www.ascii.cl/htmlcodes.htm), without having set the charset in the php script itself. Not that for some reason they wouldn’t work on someones browser, because e.g. their browser was set with a different charset.

I did notice that with a php script with only a echo “é” my browser automatically switched to iso 8859 even when I set it first to unicode.

Perhaps, even though php doesn’t see strings as iso8859-1 encoded by default (just that it sees it as a sequence of bytes, not specifically as a certain encoding), it would be safer to just set it in the html page by having this charset=iso8859-1 line? Or otherwise with such a charset header line in php?

stereofrog · November 5, 2007, 3:22pm

Yes, immediate byte values like “\xc3\xbc” will be interpreted by the browser according to the encoding being used, for example the above will be displayed as “Ã¼” in ISO-8859, as “УМ” in Cyrillic and as “ü” in utf8. You can use html entities to force encoding-independent display, e.g. “ü” will be rendered as “ü” no matter which encoding is used, but more general and better approach is always to specify intended charset with content-type header, as you suggested. The page I linked above has an example on this.

digitalecartoons · November 5, 2007, 3:38pm

So I have my content type header already set as iso 8859-1. I’m using this regular expression range to check for the special characters in that range (192-255) À-ÿ:
\xC0-\xFF
Which I tested is the same as checking for À-ÿ.
Having set the iso 8859 content type header I guess I would be safe to use both ways for checking this range?
It’s probably not possible to use html entities in regular expressions? Like I did with hexadecimal values above?

stereofrog · November 5, 2007, 3:59pm

“in” encoding is generally not the same as the “out” one, while for html pages this is mostly the case, flash and javascript (ajax) requests are always in utf-8. Unfortunately, there’s no way to tell in which encoding the specific http request was sent (there’s no “encoding” field for requests), therefore all you can do is to hope that client’s encoding is the same as yours. If you have to support multiple charsets (e.g. iso for html and utf for flash) this can end up with a huge mess, that’s why some people recommend to use utf8 exclusively even despite that php lacks proper support for it (this is going to change in php6, btw).

digitalecartoons · November 5, 2007, 4:03pm

yes, i’ve noticed that in flash that it sends its post values as utf.
So what I do is always set the content type header as iso 8859-1 and convert any values posted by flash to iso 8859-1 by using utf8_decode.
So this basically makes everything iso 8859 for the php script I’m using.
I’m right it doing it that way, aren’t I?

stereofrog · November 5, 2007, 4:17pm

digitalecartoons, it really depends on many factors. Sometimes it’s better to convert, sometimes it’s better to keep everything in utf. You should really read the page I linked. It explains the stuff pretty well.

digitalecartoons · November 5, 2007, 6:09pm

It’s a little too complicated for a novice like me. Especially why you should write a page in utf-8 instead of a more default iso 8859-1 page. Isn’t that more when you would write a more international webpage? I want to use a typical dutch page with the normal 0-191 ascii characters and the accented ones which fall in the 192-255 category. Is in my case iso 8859-1 good enough?

digitalecartoons · November 5, 2007, 6:55pm

I guess I’m a litte confused about when to use utf-8. Programs as dreamweaver start a default html page as iso 8859-1 and also most pages I visit are iso 8859. So I thought iso 8859 is mostly used by default and only when needed more characters in a site you could switch programming to utf-8. Like I said, I would use nothing more than a-z en some accented ones like é ë á ë ö ó.

stereofrog · November 5, 2007, 7:50pm

Let me just quote