PCDATA vs CDATA... just means to be parsed or not to be parsed as XHTML or HTML?

winterheat · April 21, 2009, 12:57am

I see mentioning of PCDATA and CDATA around…

and after looking for info about them, it turns out that

PCDATA is just the characters that will be parsed by XML, XHTML, or HTML parser, and

CDATA is not to be parsed, and that’s it?

winterheat · April 21, 2009, 1:48am

and is it true that in the realm of PCDATA, only 3 characters need to get special attention:

< > &

and that’s it? thanks.

AutisticCuckoo · April 21, 2009, 5:45am

This is a bit more complicated that one might think. :-/

First of all, CDATA can refer to three different things: attribute values, element content and CDATA sections. From your question I’ll assume that you are talking about the element content aspect.

An element whose content model is declared as CDATA (e.g., script and style in HTML) can only contain text – no subordinate elements. Furthermore, the text will not be parsed for entity references, so you don’t have to escape ampersands or less-than signs. In fact, you mustn’t. Any occurrence of an ETAGO separator (</) followed by a name start character will be taken as the closing tag. No matter what identifier it is. So <script>...</foo> will be the same thing as <script>...</script>. This causes some problems for beginners trying to insert content with document.write() or .innerHTML until they learn to escape the ETAGO (or use DOM methods).

An element whose content model is or contains (#PCDATA) is different. Here entity references are parsed and markup delimiters like STAGO (<), ETAGO (</) and TAGC (>) are recognised. Plus the ampersand, of course, which will be assumed to be an ERO delimiter. So we need to escape ‘<’ and ‘&’ characters. It’s usually not necessary to escape ‘>’ characters, but it’s commonly done for the sake of symmetry.

The next problem is that the script and style element types are declared as CDATA in HTML, but as (#PCDATA) in XHTML. This causes insurmountable problems for people using pretend-XHTML, since it can be impossible to write JavaScript code in a way that works in both the HTML and the XML parsing mode. Therefore you should always use external scripts and style sheets with pretend-XHTML.

winterheat · April 21, 2009, 9:15am

thanks. it is great to know there are 3 places where data are parsed… wait… do you mean attribute value, element content are both parsed data? so aren’t they PCDATA?

what i was thinking was that sometimes we explain things like attribute value is PCDATA and it is parsed… i think it is not very explanatory. In a way, it is like if a person asks, why do we move forward when we are on a moving bus and the bus stops all of a sudden, one way to explain it (very popular in hong kong) is “it is all due to inertial property of any object – and this property is directly proportional to its mass and inversely proportional to the force that is exerted on the object.” i think just saying, “newton has 3 laws about object’s movement, and the first one is, any object that is moving like to keep on moving, and any object that is not moving like to keep on not moving. so when we are moving from Italy to France on a bus, and the bus stops all of a sudden, our body like to keep on moving”, and that explains a lot and make the subject matter a lot more interesting.

AutisticCuckoo · April 21, 2009, 11:01am

An attribute value declared as CDATA can contain text and include entity references (which will be parsed). So it’s quite different from an element content model of CDATA. It’s a bit confusing.

Most attributes are CDATA in HTML. Even if they are distinguished by the use of various entities (%URI;, %Script;, etc.) this cannot be validated by the parsers. You could use onclick="http://example.com" in your markup and the validator wouldn’t mind.

Attributes cannot be declared as (#PCDATA), since this is a content model, not a value type. Attributes can only contain text, not markup.

winterheat · April 21, 2009, 2:50pm

hm… i thought the general rule is that PCDATA is to be parsed and CDATA is not to be parsed… so this rule is not in general true? thanks.

AutisticCuckoo · April 21, 2009, 6:09pm

That is true when it comes to element content, but for attribute values only CDATA can be used.

And in a CDATA attribute value entity references are parsed, but not in CDATA element content.

I did say it’s confusing.

winterheat · April 21, 2009, 9:24pm

ah, thanks. will look into it some more.

will this be accurate about CDATA? (especially this line: [Definition: All text that is not markup constitutes the character data of the document.])

http://www.w3.org/TR/REC-xml/#syntax

2.4 Character Data and Markup

Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, text declarations, and any white space that is at the top level of the document entity (that is, outside the document element and not inside any other markup).]

[Definition: All text that is not markup constitutes the character data of the document.]

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " ' “, and the double-quote character (”) as " " ".

Character Data

[14] CharData ::= [^<&]* - ([^<&]* ‘]]>’ [^<&]*)

AutisticCuckoo · April 22, 2009, 5:50am

LOL, here you’re venturing into the third meaning of CDATA, viz. CDATA sections.

Those are sequences of character data delimited by <![CDATA[ and ]]>, which is not parsed for markup delimiters (except MSC+TAGC (‘]]>’)) or entity references.

You normally use CDATA sections when the text contains a lot of literal characters that have special meaning in the markup language: mainly ‘<’ and ‘&’. For instance, if you want to display a fragment of HTML markup on the page.

As far as I know, Opera is the only browser that supports CDATA sections in HTML. Other browsers support it in their XML parsers, so you can use it with XML – including real XHTML, but not pretend-XHTML.

Here’s an example of a CDATA section in use,

<p>Here's how to mark up user input:
<code><![CDATA[<kbd>* 21 * <var>number</var> #</kbd>]]></code>.</p>

Here’s the same example without using a CDATA section,

<p>Here's how to mark up user input:
<code><kbd>* 21 * <var>number</var> #</kbd></code>.</p>

system · June 4, 2010, 6:03pm

Indeed, today, 2010 AD, Opera still is the only one amongst big boys to support that. In HTML. Just adds an empty space at the end of CDATA. That’s little, compared to what funnies do the other UAs with it.

But you forgot to mention that Lynx, a text-only Web browser, also does this job right. In HTML. Perfectly. Better than Opera.

I find my self wondering, in a twisted way, what are the benefits of switching back to a 386 and ms-dos 7 just for browsing. Besides going back to faculty years.