Best way to iterate over lots of XML files of identical structure

rickibarnes · October 25, 2012, 2:04pm

I’ve got a bit of a situation, no code to speak of at this moment because I have really zero experience at using XML, so I’m hoping to just describe my situation and have someone point me in the right direction.

The basic set up is this:

I have a website that is having XML files pushed to it via FTP, i.e. the files just show up on the server.
The files all have the same basic structure (tree?) with differing content.
There would, at any time, be hundreds of these files.
I need to be able to output the content from these files onto the website.
Some of these files are basically duplicates of each other, with minor edits. I would need to output onto the website ONLY the latest version of the file.

So far what I can think of is maybe combining all these files into one and then iterating over it to output the details on the site. I can’t tell if this would be a resource-heavy way of doing this, and I also can’t work out exactly how I would do it anyway.

I’ve been looking into XSLT, which seems theoretically able to do what I want, but I’m not sure whether it is the appropriate tool for the job or not? I also can’t work out exactly how that would be done anyway, as all teh things I’ve seen are really about applying formatting to your XML with XSL and XSLT (this is actually what makes me think it might be able to combine the files, but that this wouldn’t be the best way).

Also just generally, the idea of combining the files into one seems a bit clunky and inefficient, so I assume there may be some other method that I’m completely missing because I don’t know what to search for. Maybe it’s something to do with web services? I’m not entirely sure if this is actually a web service or not . . .

Any opinions on this would be appreciated! Please let me know if I need to provide any more information.

rickibarnes · November 13, 2012, 10:11am

In case anyone comes across this in the future, this is what I ended up doing to sort this out:

I have a script that scans the directory for new files, reads the XML data and inserts it into a mySQL database. It then moves the file into another directory, so it won’t be processed again. The script runs as a cron job that I set up through the site’s cPanel.

While the script is reading the XML file, it checks to see if the file’s unique id is in the database already. If it is, it compares the datetime in the database to the XML file’s datetime. If the XML file is newer, it updates the existing database record, if not, it does nothing.

I then use the database throughout the site as I usually would, for displaying records, search functions, etc.

I had already thought of this as a solution to the problem myself, but it seemed a bit strange to me - why use XML files at all if the information is just going to go into a database? Why not just skip the middle man? I still don’t know the answer to that (any ideas, anyone?) However I got confirmation from one of the developers who works on the CMS that produces the XML files, and he told me this method is how most people use their system, so I went ahead and implemented it with confidence.