Htmlspecialchars() outputs null value if accented characters in string

I’ve got a head scratcher that I’m not sure how to deal with. I’m using htmlspecialchars to display user inputted variables on a page. On my local server (PHP version 5.1.4) it works just as expected, but on the live site (PHP version 5.2.9), if there are any accented characters in the string, I am getting a null value after running it through the htmlspecialchars function.

Here is some sample code I’ve been working with, if it’s at all helpful.


$text = "canapés";
echo 'My text is '. $text;
$text = htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
echo 'The encoded text is ' . $text;

Output:

My text is canapésThe encoded text is

Help?

Try htmlentities() instead?

I’m unable to reproduce your problem (PHP 5.3, Apache 2, Windows). However, you shouldn’t need to specify the charset, since, in your case, the characters affected by the function are in the same positions as in ISO-8859-1 (htmlspecialchars() docs).

Did you post all the code?

Yep, that was the whole code that I was using for testing. After the first suggestion to use htmlentities instead (which works fine) I started poking at it some more, and what I’ve found is that if I don’t specify the charset, then it works.


$text = 'canapés aren\\'t popular';
echo 'My text is '. $text;
$text = stripslashes(htmlentities($text, ENT_QUOTES));
echo 'The htmlentities text is ' . $text;

$text = 'canapés aren\\'t popular';
echo 'My text is '. $text;
$text = htmlspecialchars($text, ENT_QUOTES);
$text=str_replace('&','&',$text);
echo 'The encoded text is ' . $text;

Output:


My text is canapés aren't popular
The html entities text is canapés aren& #039;t popular
My text is canapés aren't popular
The encoded text is canapés aren&# 039;t popular

I’m the first to admit that what I know about character encoding would fit on the head of a pin, but I’m particularly concerned about it for this project because the site serves an international audience and needs to be able to accurately reproduce character sets from multiple languages.

P.S. No matter what I do, the forum software is converting my apostrophes so I stuck spaces in there to try to force it.

I’m afraid I’m at a loss, since I’m still unable to reproduce the problem. Just out of curiosity, though, what is the output on your live server for the following:

<?php
header('Content-Type: text/plain; charset=UTF-8');
var_dump(htmlspecialchars('canap&#233;s', ENT_QUOTES, 'UTF-8'));

You might also try searching for PHP bugs related to htmlspecialchars().

I cant really explain it either, since neither of the hex-pairs that constitute e-acute are a quote character… perhaps it’s being screwed up by the character being read as e’ or e` or some other Quote-Containing phrase? Have you tried parsing the string without the ENT_QUOTE flag?


header('Content-Type: text/plain; charset=UTF-8');
var_dump(htmlspecialchars('canapés', ENT_QUOTES, 'UTF-8'));

Output:
string(0) “”


header('Content-Type: text/plain; charset=UTF-8');
var_dump(htmlspecialchars('canapés'));
echo 'canapés';

Outputs:
string(7) “canap�s”
canap�s

So, it appears the accented character is not being interpreted properly. Is it possible the server itself can’t handle UTF-8 encoding? Do character sets have to be enabled somehow on a server?

I really appreciate your help on this so far.

Have you checked what headers the server is sending to your browser prior the the script output? Do you have an external URL you post/PM ?

If Apache/IIS is sending conflicting headers, this would produce the problem you’re having.

This is the information from $_SERVER

Array
(
[PATH] => /usr/bin:/bin
[DOCUMENT_ROOT] => /home/mydomainroot/public_html
[HTTP_ACCEPT] => text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
[HTTP_ACCEPT_CHARSET] => ISO-8859-1,utf-8;q=0.7,*;q=0.7
[HTTP_ACCEPT_ENCODING] => gzip,deflate
[HTTP_ACCEPT_LANGUAGE] => en-us,en;q=0.5
[HTTP_CONNECTION] => keep-alive
[HTTP_HOST] => mydomain.com
[HTTP_KEEP_ALIVE] => 300
[HTTP_USER_AGENT] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)
[REMOTE_ADDR] => xx.xx.xxx.xx
[REMOTE_PORT] => 1564
[SCRIPT_FILENAME] => /home/mydomainroot/public_html/info.php
[SERVER_ADDR] => xx.xxx.xx.xx
[SERVER_ADMIN] => webmaster@mydomain.com
[SERVER_NAME] => mydomain.com
[SERVER_PORT] => 80
[SERVER_SOFTWARE] => Apache/1.3.41 (Unix) mod_gzip/1.3.26.1a mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 PHP-CGI/0.5
[PHPHANDLER] => /usr/local/php52/bin/php
[GATEWAY_INTERFACE] => CGI/1.1
[SERVER_PROTOCOL] => HTTP/1.1
[REQUEST_METHOD] => GET
[QUERY_STRING] =>
[REQUEST_URI] => /info.php
[SCRIPT_NAME] => /info.php
[PHP_SELF] => /info.php
[REQUEST_TIME] => 1271001105
[argv] => Array
(
)

[argc] =&gt; 0

)

Oh, I think you’ve misunderstood me. You need to look at the HTTP headers sent by Apache when you request the script/page in question.

You can quite easily do this using Firefox and the [URL=“http://livehttpheaders.mozdev.org/installation.html#”]LiveHTTPHeaders extension.

For instance, here’s my request for www.google.co.uk .


GET / HTTP/1.1 
Host: www.google.co.uk 
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100214 Linux Mint/8 (Helena) Firefox/3.5.8 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
Accept-Language: en-gb,en;q=0.5 
Accept-Encoding: gzip,deflate 
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 
Keep-Alive: 300 
Connection: keep-alive 

HTTP/1.0 200 OK 
Date: Sun, 11 Apr 2010 16:47:03 GMT 
Expires: -1 
Cache-Control: private, max-age=0 
[I]Content-Type: text/html; charset=UTF-8 
Content-Encoding: gzip [/I]
Server: gws 
Content-Length: 4515 
X-Cache: MISS from Zeus 
X-Cache-Lookup: MISS from Zeus:3128 
Via: 1.0 Zeus:3128 (squid/2.7.STABLE3) 
Connection: keep-alive 

Ah, gotcha.

Here’s the info using the LiveHTTPHeaders extension.

GET /info.php HTTP/1.1
Host: mydomain.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cache-Control: max-age=0

HTTP/1.1 200 OK
Date: Sun, 11 Apr 2010 17:15:45 GMT
Server: Apache/1.3.41 (Unix) mod_gzip/1.3.26.1a mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 PHP-CGI/0.5
X-Powered-By: PHP/5.2.9
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

Great, can you throw the following in a stand-alone script and post both the headers and output?


<?php
header('Content-Type: text/plain; charset=UTF-8');
echo htmlspecialchars("Anthony's canap&#233;s aren't popular at all, in fact, they suck.", ENT_QUOTES, 'UTF-8');
exit;

There is no output, either in the browser or by viewing source. Here are the headers:

GET /info.php HTTP/1.1
Host: mydomain.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.1 200 OK
Date: Sun, 11 Apr 2010 19:18:35 GMT
Server: Apache/1.3.41 (Unix) mod_gzip/1.3.26.1a mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 PHP-CGI/0.5
X-Powered-By: PHP/5.2.9
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain; charset=UTF-8

Interesting, there’s a bug listed which may apply.

http://bugs.php.net/bug.php?id=43896

I’ll come back to you.

Try this… :wink:


<?php
header('Content-Type: text/plain; charset=UTF-8');
echo htmlspecialchars(
    utf8_encode("Anthony's canapés aren't popular at all, in fact, they suck."),
    ENT_QUOTES | ENT_COMPAT,
    'UTF-8'
);
exit;

If it works, your PHP script is saved by your editor as ISO-8859-1.

I get this for output:

Browser:
Anthony's canapés aren't popular at all, in fact, they suck.

Source code:
Anthony's canapés aren't popular at all, in fact, they suck.

Yay! Progress!

So, we’re good?

Not quite :slight_smile: The forum software converted my apostrophes. The output doesn’t convert the encoding for the apostrophes, so they come out as & #039; If I change the header, is that a bad thing to do?

header(‘Content-Type: text/html; charset=UTF-8’);