Parse HTML

Hi guys,

I am developing a mobile phone e-commerce store and what I want is to be able to cut and paste a specification from a well-know review site for a phone.

I have already done the styling for this and it is working great in my product pages, however on other pages I just want to extract certain information from the html below which of course I will already have in the DB from the product page.

I only need certain things like the CPU, MEMORY etc. but as you can see they are nested in a table with no significant markers to identify one cell from another so I can’t do it by class or id.

Here is the code I will have stored. Can someone tell me the best way to parse this with PHP?

Thanks in advance :wink:


<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="4" scope="row">
				General</th>
			<td class="ttl">
				<a href="network-bands.php3">2G Network</a></td>
			<td class="nfo">
				GSM 850 / 900 / 1800 / 1900</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="network-bands.php3">3G Network</a></td>
			<td class="nfo">
				HSDPA 850 / 1900 / 2100 /800</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="#" onclick="helpW('h_year.htm');">Announced</a></td>
			<td class="nfo">
				2010, August</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="#" onclick="helpW('h_status.htm');">Status</a></td>
			<td class="nfo">
				Available. Released 2010, August</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="3" scope="row">
				Body</th>
			<td class="ttl">
				<a href="#" onclick="helpW('h_dimens.htm');">Dimensions</a></td>
			<td class="nfo">
				111 x 62 x 14.6 mm</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="#" onclick="helpW('h_weight.htm');">Weight</a></td>
			<td class="nfo">
				161 g</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=keyboard">Keyboard</a></td>
			<td class="nfo">
				QWERTY</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="4" scope="row">
				Display</th>
			<td class="ttl">
				<a href="glossary.php3?term=display-type">Type</a></td>
			<td class="nfo">
				TFT capacitive touchscreen, 16M colors</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="#" onclick="helpW('h_dsize.htm');">Size</a></td>
			<td class="nfo">
				360 x 480 pixels, 3.2 inches (~188 ppi pixel density)</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=multitouch">Multitouch</a></td>
			<td class="nfo">
				Yes</td>
		</tr>
		<tr>
			<td class="ttl">
				&nbsp;</td>
			<td class="nfo">
				- Optical trackpad</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="3" scope="row">
				Sound</th>
			<td class="ttl">
				<a href="glossary.php3?term=call-alerts">Alert types</a></td>
			<td class="nfo">
				Vibration, MP3 ringtones</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=loudspeaker">Loudspeaker</a></td>
			<td class="nfo">
				Yes</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=audio-jack">3.5mm jack</a></td>
			<td class="nfo">
				Yes, <a href="blackberry_torch_9800-review-516p6.php#aq">check quality</a></td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="2" scope="row">
				Memory</th>
			<td class="ttl">
				<a href="glossary.php3?term=memory-card-slot">Card slot</a></td>
			<td class="nfo">
				microSD, up to 32GB, 4GB card included</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=dynamic-memory">Internal</a></td>
			<td class="nfo">
				4 GB storage, 512 MB RAM, 512 MB ROM</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="8" scope="row">
				Data</th>
			<td class="ttl">
				<a href="glossary.php3?term=gprs">GPRS</a></td>
			<td class="nfo">
				Class 10 (4+1/3+2 slots), 32 - 48 kbps</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=edge">EDGE</a></td>
			<td class="nfo">
				Class 10, 236.8 kbps</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=3g">Speed</a></td>
			<td class="nfo">
				HSDPA; HSUPA</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=wi-fi">WLAN</a></td>
			<td class="nfo">
				Wi-Fi 802.11 b/g/n, UMA (carrier-dependent)</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=bluetooth">Bluetooth</a></td>
			<td class="nfo">
				Yes, v2.1 with A2DP</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=usb">USB</a></td>
			<td class="nfo">
				Yes, microUSB v2.0</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="4" scope="row">
				Camera</th>
			<td class="ttl">
				<a href="glossary.php3?term=camera">Primary</a></td>
			<td class="nfo">
				5 MP, 2592&#1093;1944 pixels, autofocus, LED flash, <a href="piccmp.php3?idType=1&amp;idPhone1=3203&amp;nSuggest=1">check quality</a></td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=camera">Features</a></td>
			<td class="nfo">
				Geo-tagging, continuous auto-focus, image stabilization</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=camera">Video</a></td>
			<td class="nfo">
				Yes, VGA@24fps</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=video-call">Secondary</a></td>
			<td class="nfo">
				No</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="12" scope="row">
				Features</th>
			<td class="ttl">
				<a href="glossary.php3?term=os">OS</a></td>
			<td class="nfo">
				BlackBerry OS 6.0</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=cpu">CPU</a></td>
			<td class="nfo">
				624 MHz</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=sensors">Sensors</a></td>
			<td class="nfo">
				Proximity</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=messaging">Messaging</a></td>
			<td class="nfo">
				SMS, MMS, Email, Push Email, IM</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=browser">Browser</a></td>
			<td class="nfo">
				HTML</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=fm-radio">Radio</a></td>
			<td class="nfo">
				No</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=gps">GPS</a></td>
			<td class="nfo">
				Yes, with A-GPS support</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=java">Java</a></td>
			<td class="nfo">
				Yes, MIDP 2.0</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="#" onclick="helpW('h_colors.htm');">Colors</a></td>
			<td class="nfo">
				Black, White, Dark Orange</td>
		</tr>
		<tr>
			<td class="ttl">
				&nbsp;</td>
			<td class="nfo">
				- Social feeds<br />
				- BlackBerry maps<br />
				- Document viewer (Word, Excel, PowerPoint)<br />
				- Media player MP3/WMA/eAAC+/FlAC/OGG player<br />
				- Video player DivX/XviD/MP4/WMV/H.263/H.264<br />
				- Organizer<br />
				- Voice memo/dial</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="4" scope="row">
				Battery</th>
			<td class="ttl">
				&nbsp;</td>
			<td class="nfo">
				Standard battery, Li-Ion 1300 mAh</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=stand-by-time">Stand-by</a></td>
			<td class="nfo">
				Up to 432 h (2G) / Up to 336 h (3G)</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=talk-time">Talk time</a></td>
			<td class="nfo">
				Up to 5 h 30 min (2G) / Up to 5 h 40 min (3G)</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=music-playback-time">Music play</a></td>
			<td class="nfo">
				Up to 30 h</td>
		</tr>
	</tbody>
</table>
<table cellspacing="0">
	<tbody>
		<tr>
			<th rowspan="3" scope="row">
				Misc</th>
			<td class="ttl">
				<a href="glossary.php3?term=sar">SAR US</a></td>
			<td class="nfo">
				0.91 W/kg (head) &nbsp; &nbsp; 0.68 W/kg (body) &nbsp; &nbsp;</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="glossary.php3?term=sar">SAR EU</a></td>
			<td class="nfo">
				0.86 W/kg (head) &nbsp; &nbsp; 0.81 W/kg (body) &nbsp; &nbsp;</td>
		</tr>
		<tr>
			<td class="ttl">
				<a href="#" onclick="helpW('h_price.htm');">Price group</a></td>
			<td class="nfo">
				<img src="http://st2.gsmarena.com/vv/price/pg5.gif" title="About 240 EUR" /></td>
		</tr>
	</tbody>
</table>


http://www.php.net/manual/en/book.dom.php

Hi there,
Thanks for this, is this built into php 5.2 + ? Could you give me a simple example of its use, the ones on the manual are difficult to follow.

Thanks

See the first comment on that page, it has an HTML example

Hi guys, I am trying to parse the string above using simple_html_dom library and I just keep getting call to function on non-object error my code is below:


  if(!empty($result['spec']))
                                       {
                                       $html[$result['product_id']] = str_get_html($result['spec']);
                                       $ret[$result['product_id']]  = $html[$result['product_id']]->find('th',0)->innertext;
                                       //echo $ret[$result['product_id']];
                                       var_dump($ret[$result['product_id']]);
                                       }

This works with the code below when I pass in part of the above table as a string so I am guessing it has something to do with the whitespace, tabs, linebreaks etc, is there any way to remove them all and give me the format below?

                                       $html[$result['product_id']] = str_get_html('<tr><th rowspan="4" scope="row">General</th><td class="ttl"><a href="network-bands.php3">2G Network</a><a href="network-bands.php3">2G Bogworth</a></td><td class="nfo">GSM 850 / 900 / 1800 / 1900</td></tr>');

Anyone else who is trying to do this with a string from a database, remember to use htmlspecialchars_decode() otherwise the simple_html_dom script is trying to parse tags as : < p >

Pretty obvious really but if you’re reading because of a call to undefined object error, then there’s a good chance you made the same mistake as me :wink:

Haha good tip, I do this ALL the time…I’ll probably do it in 15 minutes from now too.

Hahaha,

Easily done isn’t it Carlos, it wasn’t until I outputted the string to a file that I realized what form it was being stored in the DB. Again it’s one of the drawbacks of developing on top of someone else’s CMS rather than customizing with Zend or something similar, you don’t know what’s going on behind the scenes in your own backyard!