Regex accented characters

digitalecartoons · November 4, 2007, 10:58am

For a name regular expression which also should allow accented characters like in René, I made this regex:
!preg_match('~^[1][\‘a-zÀ-ÿ \-]*$~i’
Is my use of À-ÿ good for that? I got that from the html entity list. Saw that all accented characters fell into that range.

But I’ve also just read about the alpha class which should take care of that? So I could also use this? And did I wrote it correct?
!preg_match(‘~^(<alpha|a-z>)([\’ \-]<alpha|a-z>)*$~i’

a-zÀ-ÿ ↩︎

Jake_Arkinstall · November 4, 2007, 10:59am

Well, have you tried it?

digitalecartoons · November 4, 2007, 11:22am

Just did. Think alpha is just the same as A-Z…
But just discovered that you can also use hex values in a php regex with \x. Looked up the hex values for acceted characters and this worked:
^[1][\'a-z\xC0-\xFF \-]*$
But is there anything wrong with using À-ÿ? Or is it safer to use hex values in that case?

a-z\xC0-\xFF ↩︎

Jake_Arkinstall · November 4, 2007, 11:24am

I don’t think theres anything wrong with Á-ÿ, otherwise there would be an error. Watch out that you don’t change the page encoding, though.

digitalecartoons · November 4, 2007, 11:25am

If I where to change the page encoding, would that also mean that my hex codes wouldn’t be the same?

Kieran_in · November 4, 2007, 11:28am

You need to use \pL as a modifier after each set of characters.

Then use \u

Or

~^[2][\'a-zÀ-ÿ \-]*$~i

Should work.

Search for a regex tutorial. Should help you:)

a-z\xC0-\xFF ↩︎
a-zÀ-ÿ ↩︎

Jake_Arkinstall · November 4, 2007, 11:28am

Edit:

Beat me to it^

No, but in my experience, characters like Á could be completely different. If you don’t change your encoding, it’ll be fine

Kieran_in · November 4, 2007, 11:32am

xD

digitalecartoons · November 4, 2007, 12:04pm

You mean when I use hex values like in
^[1][\'a-z\xC0-\xFF \-]*$

?

a-z\xC0-\xFF ↩︎

stereofrog · November 4, 2007, 12:07pm

The “alpha” class (correct syntax is [:alpha:]) matches any “letter”. The meaning of “letter” depends on the locale, in en-US locale this includes A-Z and accented characters in ISO-8859 encoding.

You can easily find out what it does on your system using a test code like this:


$all = implode(' ', range("!","\\xFF"));
echo preg_replace('/[[:alpha:]]+/',
	'<font color=red>$0</font>', $all);

digitalecartoons · November 4, 2007, 12:37pm

I’ve tried this:
^[1]

But is says ‘false’ when entering René or Rene?

And what’s this about having to use “\pL as a modifier” ?

[:alpha:]a-zA-Z ↩︎

stereofrog · November 4, 2007, 12:42pm

^[1] doesn’t make any sense (nor does “\pL” )

What does this do for you


echo preg_match('/^[[:alpha:]]+$/', 'Ren&#233;');

[:alpha:]a-zA-Z ↩︎

digitalecartoons · November 4, 2007, 1:06pm

Tested ^[1]+$ on:

Though it accepts normal letters like in ‘rene’
it doesn’t allow accented ones like in ‘rené’
why not?

[:alpha:] ↩︎

digitalecartoons · November 4, 2007, 3:12pm

What should I do to have [[:alpha:]] also allow for accented letters besides A-Za-z?
I’ve set the browser to chartype iso 8859-1 allready…

kyberfabrikken · November 4, 2007, 4:27pm

Because of your locale setting. Try setting it to something, which contains those characters, using [fphp]setlocale[/fphp].

bokehman · November 4, 2007, 4:33pm

Either set the correct locale setlocale() (not always possible) or include the individual characters in the regex character class.

[a-zA-Z&#225;&#233;&#237;&#243;&#250;&#193;&#201;&#205;&#211;&#218;]

digitalecartoons · November 4, 2007, 6:06pm

Or otherwise just use hexcodes like in this?

^[a-z\\xC0-\\xFF][\\'a-z\\xC0-\\xFF \\-]*$

(the main page which calls this php script is set by default with a charset of iso-8859-1 anyway)

digitalecartoons · November 4, 2007, 6:16pm

Does a php server have a standard charset setting? The hex codes work even if I haven’t set the charset in the php script specifically. Or is that because my browser has a default setting of iso8859?

digitalecartoons · November 4, 2007, 7:24pm

This encoding should allways be iso 8859-1?
I’ve already set the index.php as iso 8859-1 and all posted values are first converted to iso 8859-1 with utf8_decode, so names like René display as René in sent email (and not with funny codes).

kyberfabrikken · November 4, 2007, 7:41pm

Encoding and locale are different issues. PHP assumes that strings are ISO-8859-1.
Browsers will send data back, using the same charset as the page was served in. Thus you shouldn’t use utf8_decode on incoming data.