Accented characters into SQL table

Continuing the discussion from Best way to sanitize Email input for SQL:

I’m working on another form script. In this one I do need some fields to accept accented characters. I have a regex in preg replace that seems to work, when I test the regex.

$owners = preg_replace('#[^A-Za-zÀ-ÿ 0-9,.?$@&!]#i', '', $_POST['owners']);

The problem arises when I put the data into my SQL table, any accented characters get replaced with à in the table. I thought maybe I need to encode the characters, but when I try to use htmlspecialchars I just get a blank string. How do I fix this?

What character encoding is in use in the database table and for the connection to the database?

I’m not sure. Where would I check that?

I left this before as I was getting nowhere with it, but trying again now.
To simplify things and take it a step at a time, I’m just looking at doing the conversion in php before bringing sql into the picture.
I thought this should be very simple, but I just can’t get it to work.
Say I have a string with various character in it, some accented and suchlike. Eg:-
"Guòrun Blöndal & 1897"
and what I want to end up with is:-
"Guòrun Blöndal & 1897"
or
Guòrun & Blöndal 1897
Either will do.
I have found that htmlspecialchars() takes care of the ampersand, but not the accented characters.
Maybe I’m not using the correct flags or encoding, I’m not sure. Or should I be using something else?

I recall having this trouble a few years ago and found the problem was the database collation.

To eliminate that problem for now, I’m just doing a pure php/html script (very simple test) that doesn’t even think about sql. Once I have that working, I will pass the encoded strings to sql.

I’m just having one of those “Duh” moments. I was looking at the html output of the test in Inspector mode in FF, thinking what I’m looking at is the source. When I do look at the actual source, I see that htmlentities() is doing the trick perfectly. But the inspector was showing ò as ò.

1 Like

Taking it one step at a time, to identify where a problem occurs, I am a little further, but not yet done.

$encoded = htmlentities($input, ENT_COMPAT,'UTF-8', true);

on its own works, the next step is to pass it through a form.
So if the form has;

<input type="text" name="input">

and the processing script has:

$ecoded = htmlentities($_POST['input'], ENT_COMPAT,'UTF-8', true);

then I get what I want, the characters become &ograve; and &amp; etc…
But this has no sanitization. If I try to clean it up with:

$output = preg_replace('#[^A-Za-zÀ-ÿ 0-9,.?&$-@!]#i', '', $_POST['input']);

The result is that the characters get swapped for &Atilde; (Ã).
Is this due the the character set that php is using?
So how do I sanitize this and get the right result?

PHP language support sure has grown since the last time I worked with it

http://php.net/manual/en/refs.international.php

So there might be something better now, but I have used iconv before and it should help you get a step further

http://php.net/manual/en/book.iconv.php

Just realised, my experience with international characters in a MySQL db goes back to the FIRST edition of Kevin Yank’s Build your own database driver website using PHP & MySQL!

I can’t seem to make any sense of any of this. I have experimented with iconv, but don’t know what I’m doing. It has an input set and an output set, but I don’t know what they should be, though I guess the output should be UTF-8. On some settings the ò comes out as ò and the ö as ö. So I think the preg_replace gives me à for everything because it strips off the second character.
But I’m still clueless as to how to sanitize and get the result I want.

I think you are making this too complicated. You do NOT need htmlentities if you are going to insert the value into a html form of a page encoded in UTF-8. You need entities like &ograve;, etc. only if your html page is in an encoding that does not support the characters you want to use. UTF-8 can represent characters from any encoding so if your page is in UTF-8 then you don’t need entities and you can simply output ò and other characters directly.

Therefore htmlspecialchars() is enough - it will convert ampersands and quotes to entities, which is important because those characters have special meaning in html attributes.

This may not work when your input is in UTF-8 because regex in PHP does not work with single-byte encodings by default. You might try using the u (Unicode) modifier for your regex.

1 Like

I worked out that the preg_replace was my stumbling point, but didn’t understand why.

I added the u modifer and the string comes through the regex OK now. A real step forward. Thank you.

I think I headed down that route because there was clearly an problem with those characters which had to be addressed. But I had not yet identified where in the script that problem was occurring. After breaking the process down step by step, I now know the problem was in the regex and now know how to fix it with the u modifier.
So the next step is to enter it into the database then finally, retrieve it and display it on page.

I wrongly thought that this had all suddenly become very easy. But with further testing I have found another problem after adding the u modifier.
Perhaps I’m adding it in the wrong way, I was a bit unsure how to do it. Since adding it, it appears that the preg_replace is not removing characters, I can put whatever I want in the form, and the unwanted characters don’t get stripped out.

$output = preg_replace('#[^A-Za-zÀ-ÿ 0-9,.?&$@!]/u#i', '', $_POST['input']);

That’s what I have. What is wrong? It appears to do nothing now.

This is wrong, all modifiers should go at the end of the string after the pattern, just like your i modifier:

$output = preg_replace('#[^A-Za-zÀ-ÿ 0-9,.?&$@!]#iu', '', $_POST['input']);

Sorry, my mistake, that should be “regex in PHP does not work with multi-byte encodings by default” - of course!

Tried this, and it appears to be giving a null output.

I have found another way of getting regex to keep accented characters:-

$output = preg_replace('#[^\p{L}\s-]#u', '', $_POST['input']);

That works. I also added some ranges for numbers and some other characters I may want to keep:-

$output = preg_replace('#[^\p{L}\s- 0-9 $-\']#u', '', $_POST['input']);

And that works. But I want to add a few more characters for some fields, pound (£) to name one, but I can’t get that to work.

I’ve just tried running this:

echo preg_replace('#[^A-Za-zÀ-ÿ 0-9,.?&$@!]#iu', '', 'some.name@example.com');

and it works fine. Probably you have some other error not related to preg_replace.

Strange, I just copy/pasted that exact line into my script and it did nothing. :confused:
Any ideas?