Regex accented characters

For a name regular expression which also should allow accented characters like in René, I made this regex:
!preg_match('~[1][\‘a-zÀ-ÿ \-]*$~i’
Is my use of À-ÿ good for that? I got that from the html entity list. Saw that all accented characters fell into that range.

But I’ve also just read about the alpha class which should take care of that? So I could also use this? And did I wrote it correct?
!preg_match(‘~^(<alpha|a-z>)([\’ \-]<alpha|a-z>)*$~i’


  1. a-zÀ-ÿ ↩︎

Well, have you tried it?

Just did. Think alpha is just the same as A-Z…
But just discovered that you can also use hex values in a php regex with \x. Looked up the hex values for acceted characters and this worked:
[1][\'a-z\xC0-\xFF \-]*$
But is there anything wrong with using À-ÿ? Or is it safer to use hex values in that case?


  1. a-z\xC0-\xFF ↩︎

I don’t think theres anything wrong with Á-ÿ, otherwise there would be an error. Watch out that you don’t change the page encoding, though.

If I where to change the page encoding, would that also mean that my hex codes wouldn’t be the same?

You need to use \pL as a modifier after each set of characters.

Then use \u

Or

~[2][\'a-zÀ-ÿ \-]*$~i

Should work.

Search for a regex tutorial. Should help you:)


  1. a-z\xC0-\xFF ↩︎

  2. a-zÀ-ÿ ↩︎

Edit:

Beat me to it^ :smiley:

No, but in my experience, characters like Á could be completely different. If you don’t change your encoding, it’ll be fine :slight_smile:

xD

You mean when I use hex values like in
[1][\'a-z\xC0-\xFF \-]*$

?


  1. a-z\xC0-\xFF ↩︎

The “alpha” class (correct syntax is [:alpha:]) matches any “letter”. The meaning of “letter” depends on the locale, in en-US locale this includes A-Z and accented characters in ISO-8859 encoding.

You can easily find out what it does on your system using a test code like this:


$all = implode(' ', range("!","\\xFF"));
echo preg_replace('/[[:alpha:]]+/',
	'<font color=red>$0</font>', $all);

I’ve tried this:
[1]

But is says ‘false’ when entering René or Rene?

And what’s this about having to use “\pL as a modifier” ?


  1. [:alpha:]a-zA-Z ↩︎

[1] doesn’t make any sense (nor does “\pL” )

What does this do for you


echo preg_match('/^[[:alpha:]]+$/', 'Ren&#233;');


  1. [:alpha:]a-zA-Z ↩︎

Tested [1]+$ on:

Though it accepts normal letters like in ‘rene’
it doesn’t allow accented ones like in ‘rené’
why not?


  1. [:alpha:] ↩︎

What should I do to have [[:alpha:]] also allow for accented letters besides A-Za-z?
I’ve set the browser to chartype iso 8859-1 allready…

Because of your locale setting. Try setting it to something, which contains those characters, using [fphp]setlocale[/fphp].

Either set the correct locale setlocale() (not always possible) or include the individual characters in the regex character class.

[a-zA-Z&#225;&#233;&#237;&#243;&#250;&#193;&#201;&#205;&#211;&#218;]

Or otherwise just use hexcodes like in this?

^[a-z\\xC0-\\xFF][\\'a-z\\xC0-\\xFF \\-]*$

(the main page which calls this php script is set by default with a charset of iso-8859-1 anyway)

Does a php server have a standard charset setting? The hex codes work even if I haven’t set the charset in the php script specifically. Or is that because my browser has a default setting of iso8859?

This encoding should allways be iso 8859-1?
I’ve already set the index.php as iso 8859-1 and all posted values are first converted to iso 8859-1 with utf8_decode, so names like René display as René in sent email (and not with funny codes).

Encoding and locale are different issues. PHP assumes that strings are ISO-8859-1.
Browsers will send data back, using the same charset as the page was served in. Thus you shouldn’t use utf8_decode on incoming data.