I’m working on another form script. In this one I do need some fields to accept accented characters. I have a regex in preg replace that seems to work, when I test the regex.
The problem arises when I put the data into my SQL table, any accented characters get replaced with à in the table. I thought maybe I need to encode the characters, but when I try to use htmlspecialchars I just get a blank string. How do I fix this?
I left this before as I was getting nowhere with it, but trying again now.
To simplify things and take it a step at a time, I’m just looking at doing the conversion in php before bringing sql into the picture.
I thought this should be very simple, but I just can’t get it to work.
Say I have a string with various character in it, some accented and suchlike. Eg:- "Guòrun Blöndal & 1897"
and what I want to end up with is:- "Guòrun Blöndal & 1897"
or Guòrun & Blöndal 1897
Either will do.
I have found that htmlspecialchars() takes care of the ampersand, but not the accented characters.
Maybe I’m not using the correct flags or encoding, I’m not sure. Or should I be using something else?
To eliminate that problem for now, I’m just doing a pure php/html script (very simple test) that doesn’t even think about sql. Once I have that working, I will pass the encoded strings to sql.
I’m just having one of those “Duh” moments. I was looking at the html output of the test in Inspector mode in FF, thinking what I’m looking at is the source. When I do look at the actual source, I see that htmlentities() is doing the trick perfectly. But the inspector was showing ò as ò.
The result is that the characters get swapped for à (Ã).
Is this due the the character set that php is using?
So how do I sanitize this and get the right result?
Just realised, my experience with international characters in a MySQL db goes back to the FIRST edition of Kevin Yank’s Build your own database driver website using PHP & MySQL!
I can’t seem to make any sense of any of this. I have experimented with iconv, but don’t know what I’m doing. It has an input set and an output set, but I don’t know what they should be, though I guess the output should be UTF-8. On some settings the ò comes out as ò and the ö as ö. So I think the preg_replace gives me à for everything because it strips off the second character.
But I’m still clueless as to how to sanitize and get the result I want.
I think you are making this too complicated. You do NOT need htmlentities if you are going to insert the value into a html form of a page encoded in UTF-8. You need entities like ò, etc. only if your html page is in an encoding that does not support the characters you want to use. UTF-8 can represent characters from any encoding so if your page is in UTF-8 then you don’t need entities and you can simply output ò and other characters directly.
Therefore htmlspecialchars() is enough - it will convert ampersands and quotes to entities, which is important because those characters have special meaning in html attributes.
This may not work when your input is in UTF-8 because regex in PHP does not work with single-byte encodings by default. You might try using the u (Unicode) modifier for your regex.
I worked out that the preg_replace was my stumbling point, but didn’t understand why.
I added the u modifer and the string comes through the regex OK now. A real step forward. Thank you.
I think I headed down that route because there was clearly an problem with those characters which had to be addressed. But I had not yet identified where in the script that problem was occurring. After breaking the process down step by step, I now know the problem was in the regex and now know how to fix it with the u modifier.
So the next step is to enter it into the database then finally, retrieve it and display it on page.
I wrongly thought that this had all suddenly become very easy. But with further testing I have found another problem after adding the u modifier.
Perhaps I’m adding it in the wrong way, I was a bit unsure how to do it. Since adding it, it appears that the preg_replace is not removing characters, I can put whatever I want in the form, and the unwanted characters don’t get stripped out.