Replacing Ampersand & in XML documents

I am trying to replace ampersands & to & in my string. Seems easy but it must be smart enough to not replace strings like   to &nbsp and other stuff like Ï

Currently I have:

$message = ereg_replace(“&”, “&”, $message);

I want text to be like:

tom & jerry = tom & jerry
& = & (& on it’s own with nothing before or after it)
  =   (stay unchanged and other html ascii codes that have a & in front)

Any ideas anyone??

I once did it by adding a space after both the search and replace chars, that way it will only replace a single & and not &

I know its not the best solution, some sort of regular expression would probably be best…


$message = ereg_replace("& ", "& ", $message);

It did the job for me, hope it can help you aswell.

How about :

htmlentities($string);

Or…if you just wanted to replace the ampersand, you could probably go with something like this :

$newstring = preg_replace("/^[\\&]$/i", "", $old_string);

I haven’t tried it. My regex also isn’t up to standard. :smiley: …so you might want to play around with it.

If you are lucky enough to be running PHP 5.2.3+, you can use htmlspecialchars() or htmlentities() with double_encode set to false. You might also want to try html_entity_decode() the string first and then apply htmlspecialchars() for somewhat similar effect. Or you could just write a suitable regex.

“When double_encode is turned off PHP will not encode existing html entities. The default is to convert everything.”.

Cool! But my PHP version on my web host is 5.2.0. Only version 5.2.3 and up support double_encode. Might be worth the upgrade just for this function to work properly.

$message = htmlentities(trim($message), ENT_NOQUOTES, "UTF-8", false);

I installed PHP 5.2.3 on my development server and tried the above code. However, I still get crappy code like:

 

Any idea why it htmlentities is still replacing existing html codes? I tried double_encode = true and false. Setting it as true was much worse.

$message = html_entity_decode(trim($message));

I fixed my previous bug by making the string purely HTML first. Next bug is allowing certain tags to remain untouched…

When I use htmlentities()… It replaces all my hyperlinks to crappy codes like:

<a target="_blank" href="http://www.youtube.com/profile?user=oasisvideos">Oasis Fanatic Youtube account</a>

That is why I tried not to use this htmlentities function. Any alternatives?

The root of your problem is, that you encoded data too early. You should never have the need for the functionality, you’re describing. Where do you get your data from?

My data is from my web site’s forum posts and the posts may have URLs and other tags that I want to preserve.

I see. You could try to prevent posters from posting invalid markup then. Eg. validate it and give an error message. I’m not sure if that’s feasible – It probably depends on your audience.

Else you can use htmltidy, which is a tool for cleaning up malformed HTML.

The user’s input is validated. I just want to allow URLs, images and bullet tags without replacing the < > characters into html entities.

You’re accepting HTML (Or a subset hereof) as input. Thus you should validate that this input is valid HTML. That includes encoding ampersands as entities. As it stands, you have really no way of knowing if the user wanted to write an & or the literal text &, if the input text is &. It’s not a major thing, but it’s just a bad practise to mix different levels of abstraction like that.

In the forum posts, there isn’t any HTML code, just BBCode like [ U R L ] http://www.whatever.com [/ U R L]. I convert some of the BBCode into HTML code. Actually it is not the forum posts that is giving me problems but the ampersand character.

Do you guys know a regular expression that could solve the problem I pointed out at the top of this topic?

$message = ereg_replace("[\\&]{2,}", "", $message);
$message = str_replace(array(" &", "& ", " & "), array(" &", "& ", " & "), trim($message));
$message = trim($message);

I have fixed the ampersand problem temporarily with the code above.


#FORMAT STRING INTO PURE HTML FIRST
$message = trim(html_entity_decode($message));
        
#REPLACE HTML ENTITIES WITH HTML CODES
$message = htmlentities($message, ENT_NOQUOTES);

#REPLACE < & > HTML CODES WITH THE ACTUAL CHARACTERS
$message = str_replace(array("<", ">"), array("<", ">"), $message);

I think the above code works way better than anything I have tried so far.