XML Semantics and XPath: Adjacent Text and Element Nodes

Hello all,

I am doing some experimenting with XPath via the PHP SimpleXMLElement class. I want to perform certain queries to retrieve various elements (clearly, since I’m using XPath queries to do it). However, I am running into an issue when I appear to have adjacent text and element nodes.

(X)HTML has no problem parsing this and allowing access via Javascript:


<p>The color <span class="color">orange</span> has always been my favorite color.</p>

I have looked at the W3C specification for XML (just to verify that this is valid markup, even though I know it is) and found this definition:

3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]

Ok, so maybe the problem is with my XPath queries? This is what I tried ($p is an instance of SimpleXMLElement representing the p element):

$content = $p->xpath('child::*');   //get all children of p. returns a SimpleXML object containing the text 'orange'
$content = $p->xpath('child::text()');   //get all text nodes which are children of p. returns 2 SimpleXMLElement objects representing just the span element!

Any similar queries targeting the same elements return the same thing. So, in the first case(get all children), only a text node containing ‘orange’ seems to be recognized, but in the second (all children that are text nodes) 2 copies of the span element itself seem to be the only things recognized! The rest of the text, which I thought would be contained in two text nodes, is never recognized. I am way confused right now. Thoughts?

I’m not sure what you want to select… you could try

//p/child::node()

if you want to select both text nodes and elements?

jurn: I want access to all of the content. Ideally I would like to see a text node with the text up to the span, the span, then another text node with the rest of the text. I just need to be able to access all of the data. I tried your siggestion and after running that XPath query this is what was returned:


Array
(
    [0] => SimpleXMLElement Object
        (
            [span] => orange
        )

    [1] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => color
                )

            [0] => orange
        )

    [2] => SimpleXMLElement Object
        (
            [span] => orange
        )

)

It seems to have quite a bit of trouble recognizing the text nodes at all… :frowning:

Maybe ->nodeValue or ->textContent would do it?

Mittineague: I am not sure what context to use your suggestion. Neither the SimpleXMLElement nor DOM (I checked just in case) include either of those methods. I also checked the XPath documentation at w3.org and couldn’t find them either. Am I missing something?

:d’oh:

Sorry about that, I’ve been working with XPATH in javascript and forgot to shift gears. :blush:

I’ll put together a test case and get back ASAP

I tried with simpleXML but couldn’t get it to work unless I added explicit <text> nodes around the text.

But I was closer than I thought.Try:

<?php
$xmlstr = <<<XML
<p>The color <span class="color">orange</span> has always been my favorite color.</p>
XML;

$doc = new DOMDocument;
$doc->loadXML($xmlstr);
$xpath = new DOMXPath($doc);
$query = '//p';
$ptags = $xpath->query($query);
foreach ($ptags as $ptag)
{
	echo $ptag->nodeValue . "<br />\
";
}
?>

Thanks Mitteneague! I really wanted to stick with SimpleXML but if it doesn’t work there’s not much I can do. I’ll just have to work with the solution you provided with DOM constructs. Thanks again for your time!