SAX vs. XPath vs. DOM?

Hi all.

Just out of curiosity, could you shed some light on the advantages/disadvantages of each XML method and when to use/not use each method.

Thanks a ton.

SAX is generally not needed unless you have to parse badly formed XML. It is event based and generally not what I see as a “complete” pasrer as it you have to write an object that has call back methods to deal with different XML events like open tags, close tags etc. This means you sort of have to writing some “parsing code” your self as fair amount of times you will need to hold a stack of the parent tags etc. SAX is best suited to small, simple XML formats, or VERY large XML files where you can’t afford to fit everything in RAM.

DOM parses XML into a tree structure. It offers a lot of flexiblity as can implement a lot of functionallity by extending from the Element classes etc and implementing something based on the composite pattern to deal with processing data (though extending classes in PHP5 is a issue due to silly querks). DOM offer the most flexiblity as it’s easy to manipulate.

Xpath is an extension the works with DOM, like XSL. These can be used to easily convert from one XML format to another and generally these are perfered ways to transform XML files as they are platform/ language independant (most languages have XSL / XPath support)

In most cases DOM is what you need

Well they are sort of different things. SAX is a parser style; DOM is an access method; Xpath is a language. DOM and XPath implementations are often written using a SAX parser underneath as it is a common, low level XML parser style.

I beg to differ MiJaySung, it all depends on what problem you are trying to solve. It seems to me that “your” problems are mostly solved with DOM.

SAX is a top down parser. It starts at the start, moves to the end, giving you a single pass over the XML to process the information. Excellent for template parsing (and bad in some situations). SAX parsing gives you the ability to do something at each node (or not if you dont want to). SAX is the more memory efficient of the two methods (SAX vs DOM - not taking your code into account, just the xml parser itself)

DOM allows you to manipulate the XML document itself. By loading it in a tree structure, you can move around that tree, add/remove nodes at will, and then put it all back into the document. This is at the expense of having to load the whole document into memory, whereas SAX only loads the part you are looking at, then “forgets” the rest (hence why MiiJaySung said you might need to create a stack - although mostly this is still more efficient than holding the whole doc in memory like DOM does).

SAX vs DOM = Speed/Efficiency vs Flexability/Performance

That said, at times, using SAX can be like pulling teeth if you have to keep track of your path through the document. But … its the easier of the two to learn if you aren’t familiar with tree traversal techniques.

SAX is good if your document is always structured in the same order.

XPATH is a different animal altogether. Its just a technique for querying documents. For example, I can ask it to pull out just one node at a time without worrying about the rest of the document. From that point you have to decide what parser you are going to use to process the data, SAX or DOM.

And just for good measure, SimpleXML is very similar to DOM, but makes it a whole lot easier to use as you can traverse it using standard iteration techniques like you would with arrays, objects and iterators.

SAX parsing allows for serial access to an XML document. SAX works well for read-only access and not so well for updates. Typically SAX is not very resource intensive. If you want serial and/or read-only access, use SAX.

DOM, on the other hand, creates an in memory data structure of the entire XML document and allows for random access to the XML document. DOM works well for updating an XML document. DOM can be resource intensive as the XML documents get larger. If you want random access and/or you want to update your XML document, use DOM.

XPATH is a language typically used in conjunction with DOM to randomly access nodes from a DOM tree.

JT

I find SAX style is also useful in a streaming environment where the XML is being streamed to you in peices. Using SAX you can parse each peice as it comes in instead of having to wait for the last part of the document before beginning.

Thanks for the info guys. Really, all these posts have given me a much better idea of this subject. I have in fact chosen DOM since it seems to be the standard method to access XML (at least I think).

Thanks for all the help guys!

SAX is also very nice if you retrieve XML data from other sites, such as partner or affiliate sites. I receive gigs of XML a day from partner sites (we resell data). I have to use SAX to process data by node, otherwise I’m loading very large chunks of information into memory at once, which would use up all the RAM I have available on the server. SAX allows me to process the data by element and do what I wish with it and move on to the next.

I beg to differ MiJaySung, it all depends on what problem you are trying to solve. It seems to me that “your” problems are mostly solved with DOM.

Well as I said as a generalisation DOM is what most people use and perfer, and for most things it’s the best route as with more complex formats you do need build up some form of structure from the doucment (for instance with SAX you quite often end up having at least a “tag stack”). Likewise it’s made more standardised as it is used in JavaScript for manipulations.

as I said SAX to me is best for very small and very simple XML files, mainly because I don’t need the tree structures to see how a node affects/is-affected by other nodes. Likewise i said SAX was good for large things for the reasons feti suggested. In this case you probably also have a more complex format and therefore you need to often have specialised data structures built as you parse data in order to get round the issue that a full on DOM would use too much RAM, but at the same time some structure is needed as nodes/tags are more interelated and therefore dependant of the settings of other nodes in the document.

The other reason I hint DOM is that the SAX parser built into PHP is cack and generally falls over at the slightest error. However, a self coded SAX parser is the opposite as it can be written with different approaches to dealing with invalid data.