Replacing strange characters to display on XML feed

Hi Guys!

Im using PHP to generate an XML feed on the fly. All of my data is stored in a MySQL database. From time to time, i’m getting some strange characters show in the XML output, for example:


HR Advisor role – 9m contract – London
Part-time HR Manager Role – Digital Marketing

Here’s a snippet of my PHP script that parses the data from the database and TRIES to get it in a readable format…


foreach($jobs as $key=>$array){
			$rss_title = htmlspecialchars($jobs[$key]['job_title']); // Job title
			$rss_title = html_entity_decode($rss_title, ENT_COMPAT,'UTF-8');
			
			$rss_description = strip_tags($jobs[$key]['job_description']); // Description
			$rss_description = html_entity_decode($rss_description, ENT_COMPAT,'UTF-8');
			if(strlen($rss_description) > 400){
				$rss_description = substr($rss_description, 0, 400).'...'; // Shorten description
			}
			
			$rss_date = $jobs[$key]['date_posted']; // Date posted
			$rss_link = SITEURL.'/'.$this->settings['company_directory'].'/'.$jobs[$key]['company_url'].'/'.$jobs[$key]['job_url']; // Link
			
			$date = date("D, d M Y G:i:s", strtotime($rss_date));
			$date = $date.' +0000';
			$result .= '<item>';
			$result .= '<title><![CDATA['.$rss_title.']]></title>';
			$result .= '<description><![CDATA['.$rss_description.']]></description>';
			$result .= '<link><![CDATA['.$rss_link.']]></link>';
			$result .= '<guid>'.$rss_link.'</guid>';
			$result .= '<pubDate>'.$date.'</pubDate>';
			$result .= '</item>';
		}

Any ideas what’s wrong?

Thanks in advance :slight_smile:

Could it be an encoding problem before it gets to your database?

How would I check that out? The characters that’s causing the problem is also stored in the database like this: –

I ask the question because in my experience the encoding anomalies mostly come from the text files they originate from.

I have no idea how they get into your db, pasted in, scraped from somewhere etc.

If you have someone typing the data in then perhaps this will not be your case.

I’m not 100% sure of what you mean by ‘strange characters’, but if you mean ‘bad’ ASCII characters, then I’ve treated those before in one of my projects.


// Removing all ASCII characters below ASCII 32 (except 9, 10 and 13 (tab, newline and carrige return)).
$bad_characters = array_diff(range(chr(0), chr(31)), array(chr(9), chr(10), chr(13)));
$text = str_replace($bad_characters, '', $text);

I hope that is useful for you.
Thanks.