Hi all
I am writing a WP plugin that takes a Word 2007 document and extracts the content to post is as a page/post.
While adding new functionality for the next version, I came across this problem. Following is a snippet of the XML data extracted from the document:
<w:p>
<W:PPR>
<W:PSTYLE W:VAL='ListParagraph' />
<W:NUMPR>
<W:ILVL W:VAL='0' />
<W:NUMID W:VAL='1' />
</W:NUMPR>
</W:PPR>
<W:R>
<W:T value='the value of the item' />
</W:R>
</W:P>
This is all the code the xml file gives me to use (this is repeated for each item). Here is an explanation of the tags (as far as I could figure out):
- W:P - Start a normal Paragraph tag
- W:PSTYLE - What style of text will follow
- W:ILVL W:VAL = 0 - The level of indentation
- W:NUMID W:VAL = 1 - The type of list being used (1 = ul | 2 = ol)
- W:T - This is the text being displayed
I extracted the data to an array using the xml_parse_into_struct function that creates an array (snippet) like this:
[496] => Array
(
[tag] => W:P
[type] => open
[level] => 3
[attributes] => Array
(
[W:RSIDR] => 00F775F1
[W:RSIDRDEFAULT] => 00F775F1
[W:RSIDP] => 00F775F1
)
)
[497] => Array
(
[tag] => W:PPR
[type] => open
[level] => 4
)
[498] => Array
(
[tag] => W:PSTYLE
[type] => complete
[level] => 5
[attributes] => Array
(
[W:VAL] => ListParagraph
)
)
[499] => Array
(
[tag] => W:NUMPR
[type] => open
[level] => 5
)
[500] => Array
(
[tag] => W:ILVL
[type] => complete
[level] => 6
[attributes] => Array
(
[W:VAL] => 0
)
)
[501] => Array
(
[tag] => W:NUMID
[type] => complete
[level] => 6
[attributes] => Array
(
[W:VAL] => 1
)
)
[502] => Array
(
[tag] => W:NUMPR
[type] => close
[level] => 5
)
[503] => Array
(
[tag] => W:PPR
[type] => close
[level] => 4
)
[504] => Array
(
[tag] => W:R
[type] => open
[level] => 4
)
[505] => Array
(
[tag] => W:T
[type] => complete
[level] => 5
[value] => The first item of a UL
)
[506] => Array
(
[tag] => W:R
[type] => close
[level] => 4
)
[507] => Array
(
[tag] => W:P
[type] => close
[level] => 3
)
I then parse each array separately into a function that fist check the type of tag (open|complete|close) and then (using switch case) to test for specific tags and when recognized, append the correct string to the variable for output.
And now finally my problem: How do I generate an ordered or bulleted list (with multi-level support) from this tags (as only one tag can be processed at a time)?
Some more background information:
- An Office 2007 file is a .zip file containing xml data and the images as files (it is not embedded in the document itself).
- I was able to correctly extract Bold, Italics, Underlined, Super and Subscripts, Images (resized), Tables and even hyperlinks. All I still need is the list.
- The plugin can be downloaded Free at http://wordpress.org/extend/plugins/docx-to-html-free/ and bought Premium at http://wpplugins.com/plugin/305/docx-to-html-premium