Recursively iterating through XML document elements but only for certain tags...?

Wolf_22 · July 29, 2011, 3:57am

I have a HUGE XML document that I need to parse for only specific pieces of data. The XML has your standard structure any XML document has, but this specific document has a <page> element that I need to parse for the contents of. What I mean is that I need to create some kind of recursive iteration that can load the entire document, scan for and output the contents of everything between the opening and closing “page” tags throughout the entirety of this file.

That said, how would one extract the contents between an opening and closing tag specified by the developer using SimpleXMLIterator, recursively? It’s probably something simple for most of you, but I currently find myself a bit bejeweled at this task right now and I think it may be due to having had only 4 or 5 hours of rest for the past 3 days…

If anyone has any examples or starting points that they could provide that I could get started with, you would be in my best prayers. Have any of you even ever done this before?

Mittineague · July 29, 2011, 5:00am

Without seeing the XML I can’t say for certain, but you may be able to use XPATH to limit the results to only the <page> nodes without any need for recursion.

Got a short example? One typical “node group” should suffice.

Wolf_22 · July 29, 2011, 12:21pm

Hi, Mittineague.

I didn’t include any of the <front> portions simply because whether it’s some “<front>” element or “<item>” element, all I believe I need is a way to extract content between an opening and closing group of data. BUT, if you really need some code pertaining to those sections, let me know and I’ll throw some of it on here.

Below is the header portion of the XML code…:

<?xml version="1.0" encoding="ISO-8859-1"?>
<TEI.2>
  <teiHeader status="new" type="text">
    <fileDesc>
      <titleStmt>
        <title>Lorem Ipsum</title>
        <author>Lorem Ipsum</author>
        <sponsor>Lorem Ipsum</sponsor>
        <principal>Lorem Ipsum</principal>
        <respStmt>
          <resp>Lorem Ipsum</resp>
          <name>Lorem Ipsum</name>
          <name>Lorem Ipsum</name>
          <name>Lorem Ipsum</name>
          <name>Lorem Ipsum</name>
        </respStmt>
        <funder n="org:BLAH">Lorem Ipsum</funder>
      </titleStmt>
      <extent />
      <publicationStmt>
        <publisher>Lorem Ipsum</publisher>
        <pubPlace>Lorem Ipsum</pubPlace>
        <authority>Lorem Ipsum</authority>
        <availability status="free">
          <p>
            Lorem Ipsum, Lorem Ipsum, Lorem Ipsum...
          </p>
          <list>
            <item>
              Lorem Ipsum
              <quote>Lorem Ipsum, Lorem Ipsum, Lorem Ipsum, Lorem Ipsum, Lorem Ipsum, Lorem Ipsum, Lorem Ipsum.</quote>
            </item>
            <item>Lorem Ipsum</item>
            <item>Lorem Ipsum</item>
            <item>Lorem Ipsum</item>
          </list>
        </availability>
      </publicationStmt>
      <sourceDesc default="NO">
        <biblStruct default="NO">
          <monogr>
            <title>Lorem Ipsum</title>
            <author>Lorem Ipsum</author>
            <imprint>
              <pubPlace>Lorem Ipsum</pubPlace>
              <publisher>Lorem Ipsum</publisher>
              <date>1870</date>
            </imprint>
          </monogr>
        </biblStruct>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
      <editorialDecl default="NO">
        <correction status="medium" method="silent" default="NO">
          <p>Lorem Ipsum</p>
        </correction>
      </editorialDecl>
      <refsDecl doctype="TEI.2" n="front">
        <state unit="section" n="chunk" />
      </refsDecl>
      <refsDecl doctype="TEI.2" n="body">
        <state unit="section" />
        <state unit="subsection" />
        <state unit="paragraph" n="chunk" />
      </refsDecl>
      <refsDecl doctype="TEI.2">
        <state unit="section" />
        <state unit="subsection" />
        <state unit="paragraph" n="chunk" />
      </refsDecl>
    </encodingDesc>
    <profileDesc>
      <langUsage default="NO">
        <language id="en">
          English
        </language>
        <language id="greek">
          Greek
        </language>
        <language id="la">
          Latin
        </language>
        <language id="de">
          German
        </language>
        <language id="fr">
          French
        </language>
        <language id="it">
          Italian
        </language>
      </langUsage>
    </profileDesc>
    <revisionDesc>
      <change>
        <date>June 25, 1819</date>
        <respStmt>
          <name>Lorem Ipsum</name>
          <resp>Lorem Ipsum</resp>
        </respStmt>
        <item>
	   Etiam in consequat est. Ut at mattis magna. Praesent quis metus in nibh lobortis egestas condimentum eu tortor. Nullam ut mi justo, nec scelerisque ante. Integer risus mauris, pretium eu laoreet eget, adipiscing sed ligula. In vitae congue lacus. Nulla a felis velit, et ultricies lacus. Vivamus volutpat imperdiet mauris, vitae aliquet augue tempor nec. Vivamus eget hendrerit dui. Donec id enim ut enim dignissim luctus. Pellentesque vel orci arcu, quis venenatis libero.
        </item>
      </change>
    </revisionDesc>
  </teiHeader>

So for this particular example, let’s say I’m trying to extract all the content between “<profileDesc>” and “</profileDesc>”, meaning I’m trying to extract only the content (not necessarily the tags themselves, though the option could be nice for possible conditions later on down the road; if I need to create any pertinent open-close tag conditions, for example).

What do you think?

Mittineague · July 30, 2011, 3:56am

With that, of you used the XPATH
//profileDesc
it would return a collection of nodes you could loop through

      <langUsage default="NO">
        <language id="en">
          English
        </language>
        <language id="greek">
          Greek
        </language>
        <language id="la">
          Latin
        </language>
        <language id="de">
          German
        </language>
        <language id="fr">
          French
        </language>
        <language id="it">
          Italian
        </language>
      </langUsage>

You could then work with those, or use nodeValue to get

English Greek Latin German French Italian

Wolf_22 · August 1, 2011, 1:14am

Thanks for the insight, Mittineague.

One more thing: what if I would like to retrieve all the content contained in the “biblStruct”? I’m going to be trying to break the entire XML file down into individual pages. That said, I’d love to be able to use some form of wildcard to extract all the CONTENT of a given section of XML tags. SO, how could I do this with the biblStruct if the following doesn’t seem work?:

$output = $xml->xpath('//sourceDesc/biblStruct/*');

Mittineague · August 2, 2011, 1:35am

I don’t know what you mean by “doesn’t work”, for me it returms an array of objects (in this sample code one member, for larger you could loop), the third object itself having three objects.

eg.

echo '<pre>';
var_dump($output);
echo '</pre>';

shows

array(1) {
 [0]=>  object(SimpleXMLElement)#2
 (3) {
	["title"]=>  string(11) "Lorem Ipsum"
	["author"]=>  string(11) "Lorem Ipsum"
	["imprint"]=>  object(SimpleXMLElement)#3
	(3) {
		["pubPlace"]=>  string(11) "Lorem Ipsum"
		["publisher"]=>  string(11) "Lorem Ipsum"
		["date"]=>  string(4) "1870" } } }

which can be accessed like

echo "title: " . $output[0]->title . "<br />\
";
echo "author: " . $output[0]->author . "<br />\
";
echo "pubPlace: " . $output[0]->imprint->pubPlace . "<br />\
";
echo "publisher: " . $output[0]->imprint->publisher . "<br />\
";
echo "date: " . $output[0]->imprint->date . "<br />\
";

wwb_99 · January 31, 2012, 7:39pm

If this is a truly large file, you might want to look at using the old PHP XML functions that were SAX based. Not the most elegant solution, but insanely efficent if you need a forward-only, readonly snapshot of the data.