Help me remove this invalid Unicode character!?

Stomme_poes · July 28, 2009, 8:18am

Hallo gurus,

I did a rebuild of a page, and decided I’d like a special bullet, one who was not an image… so I thought I’d be all sneaky and clever and use :before. But it’s become my doom. Before switching to yet another image bullet (ug) maybe someone knows another way around this?

I have a menu, and in place of bullets I have this:


#menu {
  margin: 1em 0;
}
	#menu li {
	  margin-bottom: .3em;
	  padding-left: 1em;
	  font-size: 1em;
	}
	#menu li:before {
	  content: "\\00bb" " "; /*raquo*/
	  color: #d1b248;
	  font: .8em georgia, serif;
	}
	* html #menu li {display: block; width: 99%;}

To get me the >> right angled quote character. I should prolly also test this in JAWS… it’s possible that I’m still really adding content in which case, a decorative bullet shouldn’t be content. But I’ve seen this technique done in forms before for a decorative “hey look here” image for error messages… so, not sure about that.

As I understand it, CSS “content” requires special characters to be written in hex or in a code point??? And if I look here on Wikipedia I see the Unicode code point U+00BB. So I wrote it as you see above, and this is how I’ve seen it in the form I saw as well, since I can’t actually write it in hex with the x… Maybe there’s a way to do it that I don’t know, to actually make it just hex?

And this validates HTML4 no problem. But I wanted to check the page through the W3 semantic extractor for teh Lawlz. Apparently it uses this XML parser, Xerxes, which I think is puking on that character (I’m not sure, but after some Googling other people with the problem with this parser were also using unicode code points instead of decimal character entities… so that’s why I think my >> is the issue).

Here is the error:

Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1d) was found in the comment.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1d) was found in the comment.

I don’t have THAT character anywhere I can tell 0x1d, but it seem to stand for any “control character” which, apparently, is everyone in the range of 0000 to 000something… this includes x00bb : ( I’m definitely not using it as a control character.

So, before giving up and switching to an image (Yet Another GET Request is I guess my only reason for not doing the image…), is there some other equivalent hex code for this character? It has a very low ascii number.
» (187)

Or, better yet, a page that can tell me valid hex equivalents of the decimals? I once found, long ago, a few unicode sites who wanted me to type the character in and then it could give me some other versions… but usually I can’t type these characters in, lawlz, because they’re not in my keyboard. I’ve always used decimals written out to make characters… even for the Euro symbol (there’s a key on my keyboard, but it doesn’t seem to do anything).

Any Unicode gurus out there?

Thanks,
poes

AutisticCuckoo · July 28, 2009, 8:38am

That’s correct. You can shorten it to "\\bb" if you like.

You can use a literal ‘»’ character as long as it’s correctly encoded. (The ‘»’ is available in most encodings you’re likely to use, e.g., UTF-8, ISO 8859-1 and Windows-1252).

You can also express it with a character escape as you’ve done.

Character escapes in CSS consist of a backslash (‘\’) followed by 1-6 hexadecimal characters. A blank space after an escape is ignored, which let’s you write "\\bb !" to produce ‘»!’ instead of having to write "\\0000bb!".

No, U+00BB is not a control character. The C0 range of control characters is U+0000 to U+001F, and the C1 range is U+0080 to U+009F.

This cannot have anything to do with your ‘»’ character (which should be encoded as C2 BB in UTF-8). You must have a U+001D character somewhere in your source, or it may be some oddity with the software you’re using.

Try searching for it (you have vim, don’t you?) in your source file(s). Since it’s a control character you won’t be able to see it (it’s unprintable), but you should be able to search for it.

Stomme_poes · July 28, 2009, 8:46am

You can use a literal ‘»’ character as long as it’s correctly encoded. (The ‘»’ is available in most encodings you’re likely to use, e.g., UTF-8, ISO 8859-1 and Windows-1252).

Since I can’t actually type it, I don’t really dare to copy-pasta it… though I suppose I could try it.

Actually, me being stupid, I should have commented that whole section out and then tried again with the Semantics Extractor to verify the problem was in that section… unfortunatley I just tried that and got this:

is locally blacklisted

arg! This sucks, my entire domain is suddenly blacklisted. I cannot test any of my pages! The ways to contact them are all under the subject of developing the extractor more : ( I wonder if my illegal character had set it off : (

Try searching for it (you have vim, don’t you?) in your source file(s). Since it’s a control character you won’t be able to see it (it’s unprintable), but you should be able to search for it.

I do have vim but I’m not sure how I search for something unprintable… more importantly, I’m not sure how I could have created something unprintable… this is just a pure, static HTML file written in my text editor, which produces pages which have never made this problem before.

*edit I wonder if there’s another XML xerces parser out there I can use to check if I’ve gotten rid of it… but setting the list option in vi only shows my line ends ($)

No, U+00BB is not a control character. The C0 range of control characters is U+0000 to U+001F, and the C1 range is U+0080 to U+009F.
hm I forgot to thank you for this one, I had only run into a list of the c0 range on teh googles… good to know

…holy **** Google is fast picking these pages!!!

AutisticCuckoo · July 28, 2009, 11:00am

In vim you can enter characters by code position.

In insert mode, press Ctrl+V 187 to enter ‘»’ using the decimal value (187). This only works for code positions up to 255, I believe, and you need to type exactly three digits.

In insert mode, press Ctrl+V u 00bb to enter ‘»’ using the hexadecimal value (BB). This works for any Unicode character. You need to type exactly four hex digits.

You can use escapes in your search pattern. /\\%d29 <CR> to search for a character with code position 29 decimal, or /\\%x1d <CR> to search for a character with code position 1D hexadecimal. (<CR> means ‘press Enter’).

You’d probably spot it anyway, since such a character would show up as ‘^]’ in a colour that’s different from normal text.

You probably haven’t. I think there’s a glitch somewhere else.

The C1 range is just reserved in the ISO 8859 series (and Unicode). No standardised meaning is assigned to these characters, unlike those in the C0 range.

Windows-1252 uses the C1 range for a number of useful characters (like dashes and curly quotes). There should be problems if you use Windows-1252 and declare the encoding as ISO 8859-1, since those character are actually invalid (reserved) in the ISO encoding. But since this is so common (because people blithely use Windows software without knowing what they’re doing) browsers actually assume Windows-1252 when you declare ISO 8859-1. The W3C validator will warn you, though, e.g., if you accidentally use code position 151 for an em dash (U+2014).

gary_turner · July 28, 2009, 11:19am

I do have vim but I’m not sure how I search for something unprintable… more importantly, I’m not sure how I could have created something unprintable… this is just a pure, static HTML file written in my text editor, which produces pages which have never made this problem before.
Could you have typed ^] (ctl-]), trying for a closing curly brace?

See Unicode character map. Click the character you want, then click “make html” to get the numeric entity , which is of course decimal. Or you could copy/paste the character directly into your file.

cheers,

gary

AutisticCuckoo · July 28, 2009, 11:53am

Yes, that works too (in this case). Although I think you mean ‘square bracket’ (‘]’) rather than ‘curly brace’ (‘}’).

Stomme_poes · July 28, 2009, 3:16pm

I did :set list and saw for sure there are only line-ends $.

the HTML file I’m 100% sure it’s clean.

Since I’m not a Windows user (except for testing IE in VirtualBox) and I don’t import from Word files or anything I don’t have to worry about 1251 chars here. : ) I always suspect them when people’s quotes become ?'s

In any case, I can’t go back and check my page again until my server gets unblocked. Thanks also for the vim help because I’ve only used ^v for search before, and that was always for actual strings : )

gary_turner · July 28, 2009, 4:04pm

I was thinking of the original entry. Perhaps, were she trying to type “}”, and hit <ctl> instead of <shift>. 'Twas just a thought.

cheers,

gary

AutisticCuckoo · July 28, 2009, 4:21pm

Ah. I see what you mean. I don’t know what Dutch keyboards look like (or if she’s even using one :)), but on an American keyboard that could explain things. (On a Swedish keyboard ‘]’ is AltGr+9 and ‘}’ is AltGr+0, so it wouldn’t quite apply.)

Stomme_poes · July 29, 2009, 7:18am

Lord, I have no clue if my keyboard is Dutch or not. Maybe not, I don’t have ë keys, but nobody in the office has those, and my alt key is worthless due to Linux hijacking it. Means I miss out on some GIMP and Inkscape commands too, even trying to use altgr : ( But like US keyboards I do have }] on the same key.

AutisticCuckoo · July 29, 2009, 7:36am

There’s a good entry about keyboard layouts on Wikipedia. It seems as if you’re not using a Dutch layout, and the article says they are uncommon and that Dutch users normally use the US layout.