Unicode in HTML5 Input Text Pattern

Enver · April 28, 2013, 10:31am

This pattern works for the text input on ‘Name’ and ‘Surname’ fields in my HTML5 contact form (despite the ampersand throwing an error on the W3C Validator):

$clientsidepattern = "^[^0-9\\.,\\/#!?£$%\\^&\\*;:{}=\\_`~()]+$";

Nevertheless I would like the pattern to match as closely as possible the server-side validation in unicode which goes like this:

$serversidepattern = '/^[\\'\\p{L}\\p{Pd}\\p{Zs}]{1,35}+$/u';

However, both this

$clientsidepattern = "^[^\\u0000-\\u0080]+$";

and this

$clientsidepattern = "^[^\\u0021-\\u0026\\u0028-\\u002C\\u002E-\\u0060\\u007B-\\u007F\\u0080-\\u009F\\u00A1-\\u00BF\\u00D7\\u00F7\\u2014-\\u2018\\u2020-\\u206F]+$";

fail to work as patterns on the input.

I understand that javascript does not offer equivalent support for the PHP unicode properties but I would have thought that explicit code blocks should be recognisable. Am I missing something about the spec for javascript pattern validation in HTML5? Or are my regex solutions just wrong?

Enver · April 28, 2013, 2:12pm

Talking to myself here but on the off-chance that someone with a similar problem finds this post: strictly speaking, there is nothing wrong with the syntax of any of the patterns in my original post above. However, the last version invalidates the entire range of uppercase Latin characters (\u0041-\u005A) which of course is why any entry beginning with a bog-standard ASCII capital letter would not be accepted. The corrected – and more useful – pattern is:

$clientsidepattern = "^[^\\u0021-\\u0026\\u0028-\\u002C\\u002E-[B]\\u0040\\u005B[/B]-\\u0060\\u007B-\\u007F\\u0080-\\u009F\\u00A1-\\u00BF\\u00D7\\u00F7\\u2014-\\u2018\\u2020-\\u206F]+$";

The second last version, which I picked up from another reputable site in a fit of desperation and frustration, will reject everything in the Basic Latin alphabet including C0 Controls. Heaven only knows why someone thought that was a good idea.

More fool me for even considering it, I guess.

Michael_Morris1 · April 28, 2013, 2:21pm

Why are you sending your js inline? The validator shouldn’t care if the js in a separate file.

Enver · April 28, 2013, 3:17pm

The HTML of the form is written into the page with PHP. I arranged it that way in order to keep all of the validation code together in a single file.

I suppose I could include the pattern directly as a string but it is used twice so setting it as a variable at the top of the code block makes it easier to edit. It also seems cleaner. This is the printed input tag:

<input type="text" name="Form[First]" id="given" required="required" maxlength="35" pattern="^[^\\u0021-\\u0026\\u0028-\\u002C\\u002E-\\u0040\\u005B-\\u0060\\u007B-\\u007F\\u0080-\\u009F\\u00A1-\\u00BF\\u00D7\\u00F7\\u2014-\\u2018\\u2020-\\u206F]+$" value="" placeholder="Maximum of 35 characters" title="numbers, punctuation and symbols are not accepted" />

Apart from the occasional patch for legacy versions of the popular IE browser, I tend to avoid using javascript altogether. This ‘under-the-hood’ feature of javascript validation on form submissions is one of the things I admire most about HTML5.

That said, would you still advise including the pattern as a separate javascript file? It seems somewhat unnecessary to me although I am open to suggestions.

Enver · April 29, 2013, 8:58am

I’ve been thinking about this overnight because I used to mark up all my pages in XHTML 1.1 Strict where it is an absolute requirement to separate code from HTML mark-up. Doing so would certainly solve the problem of that first example above where the ampersand throws an error on the W3C Validator. However, I am confused now: being something of a newbie to HTML5, I didn’t think it was necessary to treat the pattern as a line of code even though I know it is a javascript flavour of regex. So I checked the specification where I still see no requirement to handle the pattern separately from the HTML. Of course I could still be missing something here – Heaven knows, I’m no expert.

All the same, the unicode solution has removed the issue of that pesky ampersand validation error so I’ll leave the pattern inline until I learn otherwise.

Thanks for getting me to think about the matter. Perhaps I should not be using PHP to write the form to the page and I will certainly look into that.

Michael_Morris1 · April 29, 2013, 1:24pm

I thought you was using js because pattern is a rarely used html 5 attribute in my experience. Are you sure it is supported across all browsers you plan to support? If so you should be good - though note the following:

XHTML is a myth. Don’t use, don’t worry about it.

Specifically - IE never did, and is never going to, support XHTML. It borks up if you send XHTML files under their correct header - “application-xml/xhtml”. A ‘work around’ is to send the file as text/html, but this loses all the functionality of xhtml.

A lot of site authors pretend to use content type metatags in their html to “correct” the issue. They need to stop and think a moment - if we are parsing a tag, haven’t we already started the parser? The answer to this is - ‘yes’. That is, A content-type meta tag is ABSOLUTELY POINTLESS because by the time the browser parses the tag the page rendering engine has loaded, and it’s too late to go back.

So, unless you plan on blocking IE from your site entirely, XHTML is not an option.

pattern is an html 5 attribute to input. It hasn’t been and will not be added to XHTML. XHTML is a stillborn corpse - it isn’t going anywhere and there’s no point in using it.

Also, <input type=“text”> is html. The /> business is invalid in html - always has been. It may be that which is throwing your W3C validation error, not your pattern, depending on whether you’re trying to validate as html 5. You’ll never get something with a pattern attribute to validate as xhtml.

Enver · April 29, 2013, 3:49pm

Most of the popular desktop browsers support these HTML5 features even on Windows XP – as do most of the handheld gadgets I’ve tested, which thankfully seem to accept HTML5 and CSS media queries as the default these days (or here in Europe at least). However, I am aware that HTML5 pattern validation certainly is not supported in IE<10 so I will be writing a javascript patch for older versions of IE and other browsers such as Safari (for Windows XP) since these will be common enough for another year or two. Also, it will be relatively easy to adapt the javascript unicode pattern to PHP. This way all three stages of the validation process will have a desirable match – with the exception of the in-built HTML5 email validation algorithm of course.

I don’t plan to: I only used XHTML 1.1 Strict when I started out because the validation tools were quite unforgiving by comparison to other flavours of HTML, which helped introduce a little discipline into my mark-up as a beginner. For example, I will continue to use quote marks on properties and other such XHTML requirements for consistency unless I find clear reasons not to. An exception, which you mention below, is content-type metatags where I would now use:

<meta charset="UTF-8">

for instance, as opposed to:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Yep, I was guilty of that as you can see from the example immediately above: I think it might have been as late as 2010 before I actually read an in-depth exposition on the pointlessness of this exercise.

I understand this much and I should stress that, while I used to mark up pages in XHTML, I no longer do so. These days, I use the HTML5 DTD:

<!DOCTYPE html>

Thanks for pointing that out although the error was definitely due to the ampersand in the first pattern above: the page was declared as HTML5 and validated according to whatever currently passes for a standard. However, I have tried the mark-up without the ‘closed’ tag as you suggest and I’m pleased to note that it validates just as well. The parser would have accepted the XHTML mark-up for the purpose of backwards compatibility, which is another feature of HTML5. I will see about editing the mark-up for all ‘closed’ tags when I have had a careful look at the latest spec, which is, after all, an experimental feature.