HTML 2 RSS dilemma (the best way to approach this)

netsearchmonkey · February 8, 2010, 7:12pm

Hi all,

I have one page in particular on my website for which I would like to make a RSS feed for, on a daily basis.

I’ve been searching online for PHP scripts to make a RSS feed and they all seem to assume a MySQL database. At present my website doesn’t use a DB and so this is out of the equation.

I was thinking of integrating the script that builds the .xml file inside the given web page (which just so happens to be PHP anyway) but then this would create the .xml file every single time the PHP page is loaded. Other than that it would be messy (why glue both together right? Just not natural).

I’ve reached the conclusion that I would have an external script that would get loaded each time the PHP web page is refreshed. It would check if it’s a given time of day (say 12:00) and if so it would generate the XML script from the web page content. At all other times it would just exit(); In order to create the .xml file it would look for tags inside the web page to determine what is title, what is description, what is link etc. These would be nested by me for the script to work.

…or I could just run the script manually each time I update the corresponding web page (this option would eat the least server resources).

What do you guys and gals think?

Yup, I know CMS’s do this automatically but like I said this site isn’t using a CMS or a DB.

I know there are websites out there that do this for you but then the path to the xml file has their domain name in it and not mine.

Thanks,

Mittineague · February 8, 2010, 8:07pm

You don’t need a database for this. I have one that uses a CSV file. But you could parse a page instead too. IMHO it would be better to use filemtime() rather than an arbitrary time of day. Or do it manually.

netsearchmonkey · February 8, 2010, 10:14pm

Thanks for the reply.

So how does yours work? You output everything you have in your HTML/PHP web page out to a CSV file and then run the script which takes the CSV file and extracts the data to makeup the XML RSS compliant file?

Is this two step approach necessary? Any reason you chose not to take your PHP web page and have that be the feeder for the script directly?

In as far as how I’m picturing it on mine I would implement markers in the PHP web page to define sections for the parser script to then run through and extract the necessary data to makeup a valid RSS 2.0 file.

Mittineague · February 8, 2010, 11:05pm

The feed that uses CSV isn’t from a web page, I use that instead of a database.
If your page is XHTML, I think you might be able to use simplexml to parse the DOM. In any case, the process would be something like

page containing a “check mtime” function is requested
if the page has been modified
parse out content you want
insert data into RSS XML “template”
create RSS XML file

I wrote my own RSS XML “template” script some time ago and it’s a bit kludgy, but it works. I imagine there may be some script available that could do this for you if you don’t want to “reinvent the wheel”, maybe SimplePie? AFAIK it works with feeds, but also can work with page files.

netsearchmonkey · February 9, 2010, 12:08am

Thanks. I’ll take a look at it tomorrow as my eyes are giving up today.

To be frank I think I can just adapt the good of PHP MySQL RSS scripts, the RSS feed header stays the same, so does the footer, the only part needing revamping is the for loop where instead of accessing the database it would parse the PHP web page line by line. For this to work I would need to add some markers/tags into the PHP web page so the parse could look for given tags within all the html to suss out what’s what (i.e. title, author, link etc.).

Not complex I reckon, just messy to get it working without kinks.

I still can’t believe after searching 15 pages on Google for such a script there’s not much out there (all relate to MySQL). If there was I would probably just buy it to save time.

Mittineague · February 9, 2010, 1:03am

I guess I was wrong about SimplePie. I thought I saw something about it parsing HTML, but when I looked again the closest thing I could find to HTML was HTTP.

It does seem there would be something already out there. I’ve seen more than one member here looking for a way to get page content into a feed. Maybe some kind of “page scraper” app?

netsearchmonkey · February 10, 2010, 12:55am

I’ve looked it some more today and it would appear doing it the DOM Document way is the easiest and fastest too.

From what I see with LoadXML() one can search for custom tags within a string.

The only problem so far is that LoadXML() must have a string as it won’t take a path to a file (http or offline, doesn’t matter) or a string array from reading the file in via fgets().

If I do a for loop and get each ID of the array into a string variable as it desires (i.e. $example = $lines[$i] it will run and give me a nice long list of errors. God only knows where it’s getting mysql errors from if there’s not a DB in sight. In LoadHTML() mode it will of course load just jolly fine but then custom tags aren’t going to be on the menu.

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Opening and ending tag mismatch: link line 1 and head in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Opening and ending tag mismatch: img line 1 and h6 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Opening and ending tag mismatch: h6 line 1 and div in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ‘;’ in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag script line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag div line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag body line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag link line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag meta line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag head line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag html line 1 in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag expected, ‘<’ not found in Entity, line: 1 in C:\wamp\www\get_data.php on line 10

Warning: mysql_connect() [function.mysql-connect]: Access denied for user ‘user’@‘localhost’ (using password: YES) in C:\wamp\www\create_rss.php on line 114

Warning: mysql_select_db() expects parameter 2 to be resource, boolean given in C:\wamp\www\create_rss.php on line 115

Warning: mysql_query() [function.mysql-query]: Access denied for user ‘SYSTEM’@‘localhost’ (using password: NO) in C:\wamp\www\create_rss.php on line 119

Warning: mysql_query() [function.mysql-query]: A link to the server could not be established in C:\wamp\www\create_rss.php on line 119

Warning: mysql_num_rows() expects parameter 1 to be resource, boolean given in C:\wamp\www\create_rss.php on line 119

Warning: mysql_query() [function.mysql-query]: Access denied for user ‘SYSTEM’@‘localhost’ (using password: NO) in C:\wamp\www\create_rss.php on line 125

Warning: mysql_query() [function.mysql-query]: A link to the server could not be established in C:\wamp\www\create_rss.php on line 125
Query failed

Mittineague · February 10, 2010, 3:39am

I never noticed before, but the DOMDocument class has HTML functions too. DOMDocument->loadXML() requires the string to be well formed (i.e. <br /> instead of <br>) Maybe if you used DOMDocument->loadHTMLfile() that doesn’t require well-formedness? I couldn’t find any mention in the documentation, but I’m guessing you need to use a relative path to the file, no “http://”.

As for the MySQL errors, that’s a puzzle. Maybe the file has db stuff in it or includes a file that does?

netsearchmonkey · February 10, 2010, 2:26pm

Hmm. LoadHTML() won’t allow custom tags as far as I know so it’s got to be LoadXML().

If I can’t get it to work via the DOM method then I’ll replace it with the old fashioned but slower PHP functions. Yes it will be slower but given the RSS will only be re-created once a day the difference between it taking 1.1s and say 1.8s to process isn’t going to matter (and the server won’t mind either).

My only fear is it’s going to get messy looking through all these cutom tags, stripping away the other text, finding the position of the last > and before last < to strip the inner text (for a rss title lets say).

There are some DOM classes out there that work wonderfully for html tags, tell them to hunt for “a”, “img”, “h2” (whatever of that type) and they just work. Give them a custom tag such as “example123” and no errors but also not results, even if I add the given custom tag to a list of tags they should identify.