UTF-8 Character Problems With MySQL

I have never had so much trouble with anything in my life as I have with this UTF-8 character set.

I have a site and a phpBB forum. I run a query on the phpBB topics table to pull the most recently updated topics for display elsewhere on my site. I basically pull the topic title.

The problem is that Russian characters (from spammers) show up just as question marks elsewhere on my site but display fine as Russian characters in the forum.

I have this line in my HTML for both the forum and my website:

<meta http-equiv=“content-type” content=“text/html; charset=UTF-8” />

On the Home page for phpMyAdmin it displays this information about the character set:

MySQL charset: UTF-8 Unicode (utf8)

phpMyAdmin lists the collation on the phpBB topics table as:

utf8_bin

The collation on the topic_title field inside the phpBB topics table is:

utf8_unicode_ci

When I pull the topic_title to display on the rest of my site, I do a basic mysqli query, loop through the result set and then display the field. $row[‘topic_title’].

So, what could possibly be going wrong to cause my query to display question marks when phpBB displays the Russian characters just fine?

Do I need to put something in my PHP code to force the MySQL server to always return UTF-8? Or is it a problem in my web page where I have to manually set a UTF-8 header for each and every page view?

Thanks for your help.

Try to run this before any query:
“SET NAMES UTF8”;

Thanks for the advice, Steve. It seems a pain in the butt to have to run an additional query every time I do something. But if is the way to do it, then I’ll do it.

I ran show variables on MySQL and this is what it returned:

character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/
collation_connection utf8_general_ci
collation_database latin1_swedish_ci
collation_server latin1_swedish_ci

Could this be causing my problem? And could I just run a query to permanently change the character set (collation) to utf8_general_ci or something? Perhaps change the collation in phpMyAdmin?

But would I risk screwing up anything else? If the database returns UTF-8, PHP will echo out UTF-8 without any additional handling, won’t it?

This query is fast, I do it on each page load for every site that has non latin chars. Right now you should not
be trying to change collation in the db as it might break something. Just try it ans see if it works.

OK, thanks. :slight_smile:

Are you using the MySQL or MySQLi interface? Both have a mysqli_set_charset function which should be used right after the connection to MySQL has been established, this sets the charset for the connection that has been established to use whatever charset you have set it to use.

I’m using mysqli. I was wondering how phpBB does it. Russian characters come out right in phpBB. I formatting the output of the topic_title using htmlspecialchars(), and I’m using UTF-8 as the $charset parameter. So I don’t think that would be causing me any problems.

As a side note, using mysqli::set_charset() is the recommended way of setting the character set for communication with the MySQL server over a query using SET NAMES.

This is the preferred way to change the charset. Using mysqli::query() to execute SET NAMES … is not recommended.

I guess going forward I’ll incorporate a call to mysqli::set_charset() every time I make a connection to the DB just as a precaution to prevent any problems. Live and learn! :slight_smile:

Thanks for the suggestions, folks.

Stupid question – you have the meta set, but how about your mime-type? Meta is mostly lip service and ignored by 99% of user agents. Is your server sending a iso-8859-1 mime-type by default?

Are you using the php “header” command thus in your script?

header(‘Content-Type: text/html; charset=UTF-8’);

If you aren’t, the server is outputting whatever it’s default is, and on 99.99% of servers that’s iso-8859-1.

If you have opera installed, go to your offending page, open up the “info” pane, and see what it says is being used for encoding and mime-type. Dimes to dollars this is your problem.

Thanks for the advice. I did read elsewhere that it is a good idea to set the header to UTF-8 explicitly. But I also read somewhere that Apache will examine the HTML for the charset and then send it to the browser based on whatever is specified in the code.

The HTML is generated by PHP. UTF-8 is specified in the HTML head as mentioned above and according to Firefox’s Page Info (Tools -> Page Info) the page is encoded as UTF-8. It’s reaching the browser as UTF-8.

My hunch is that the database collation is what is screwing things up and the data connection is not transmitting in UTF-8.

It is a little confusing. We have the charset in the HTML, the possibility of a browser header specifying the charset, then we have the MySQL server character set, the database collation, and finally the field collation. There’s a lot of places where something get go amiss.

I’d be interested in hearing where that came from – I’ve never ONCE seen apache by default send UTF-8 just because you output it in the meta. In fact i’ve never heard of Apache parsing the files it’s sending unless it was for SHTML. That’s REALLY none of it’s business in a way… It’s also why on all of my servers I put in this .htaccess when doing utf-8 templates for people:

<FilesMatch "\\.(htm|html)$">
   AddDefaultCharset UTF-8
   DefaultLanguage en-US
</FilesMatch>

Hell, by the time Apache HAS the markup it’s too late to send the connection header. Part of why in PHP you HAVE to call the header function before you output anything else.

I also suggested Opera because it will show you the ACTUAL mime-type in the content header… the view->character encoding is pretty useless in FF… and with the web developer toolbar being broken on reporting header info on anything newer than FF 3.5… Though it’s Information > response headers should tell you what’s going on…

You are correct though, there are SO MANY possible points of failure along the chain it’s ridiculous.

Long back in my case it was encoding of the add file that was causing that problem…
and
The problem was there when i use to copy and paste sql but when i use to import sql file using utf-8 setting it use to work fine

There is not as much documentation on how to use Opera as I would like. I went into View -> Developer Tools -> Page Information and that said the encoding used by Opera was UTF-8 and that the MIME type was text/html.


Encoding (used by Opera): utf-8 (utf-8)

Is that what you are looking for? Or should I be looking for something else and if so, where do I find it? I’ve looked all over and can’t find anything else.

Another site on the same server is served in ISO-8859-1 as specified in the HTML meta according to Opera’s Page Information.

If I’m not looking at the right thing, let me know.

I expect that this:


<FilesMatch "\\.(htm|html|php|css|js)$">

will work to serve PHP, CSS, and JS as UTF-8, too?

Let me know where.

I checked a couple HTTP Header checker websites and it looks like there is no content-type header being sent by the server I am on.

That was what I was suggesting looking at… do you have the web dev toolbar installed for FF? What does it report?

NEVER heard of Apache much less PHP sending content-type other than iso-8859-1 as the default unless you edit it to do so. The header checkers saying none is effectively the same thing… if “information>Response headers” in the web dev toolbar for FF doesn’t list content-type, you’re sending none, so the browser is best guessing… Smarter browsers like Opera and Chrome may actually switch to UTF-8 because of the Meta – FF it’s probably a coin toss given it’s still Netscape 4’s sweetly retarded cousin (life is like a box of open source) … but if there’s no header IE will default to ISO-8859-1 on IE6/newer assuming not in quirks mode, and IE5/earlier will actually default to windows-1252.

Do you have a link to the page in question?

Oh, and yes, that change will cover all those files – NOT that I’d bother with .CSS since by the specification it’s only supposed to have ASCII7 in it in the first place even if you change the encoding.