I need a validator that catches high ascii chars

My office used to use the CSE validator for our web pages, but since our new template is mostly CF includes, CSE throws tons of errors on a perfectly good page, as it validates the source code.

We’ve switched to Tidy because it validates the rendered page, rather than the source, but it misses high ascii characters, and we’re getting a LOT of those now because much of our content starts off in Word 2007, which uses high ascii for things like curly quotes and apostrophes.

Some of the high ascii characters from Word appear as spaces in the code, so often you don’t see the problem until you view the page in a browser and see things like euro symbols scattered throughout the code.

Having to read every page to check for these is a bit tedious, especially when we have a 5000-word report, so I’m looking for a validator that validates the rendered page, but also looks for high ascii chars.

Any ideas?

Can I assume CF means coldfusion? If so, what is this, 1997?

As to CSE – I always thought that was a scam given the REAL validation services from the W3C are free.

So far as import is concerned it sounds like you have character set encoding differences to deal with – what are you deploying the websites as for character encoding? Microsoft Turd tends to want everything in windows-1252 so just have whatever you are using for copying from word translate -1252 to either UTF-8 or ISO-8859-1…

I’m not sure about coldfusion since I’ve never seen anyone actually use that for new code after 2002, but I imagine it must have a function similar to PHP’s ICONV function

On one of the sites I maintain they cut/paste from word all the time – I ended up making the form be accept-charset=“windows-1252” and then running iconv(‘windows-1252’,‘UTF-8’,$test); on the input before dumping it into the database. I then reverse the process when they go to edit.

Though it really depends on how complex the site in question is and if you’re talking static pages or having a real CMS behind it.

Yes, contrary to not-so-popular belief, there are still new versions of ColdFusion coming out, and they’re indeed quite powerful. In fact, one of our long-time PHP people recently admitted that there wasn’t anything he could do in PHP that you couldn’t do (more quickly) in CF.

We use the W3C validator, but it doesn’t catch the high acsii characters. Some of the paid validators also let you batch-validate entire directories.

These are pages that grab contact info from a database, but no CMS. We paste content into Dreamweaver (yes, that’s still around too) and save it.

I’m about to prepare an email instructing people how to turn curly quotes off, but that will likely only stop a small percentage of them, since we get a lot of reports from outside the office.

Hello,

CSE HTML Validator Std/Pro can definitely check the source of a rendered page. There are several ways to do it. One of the easiest may be to use the integrated web browser which lets you browse the web while checking the source. You can also use server/path mappings to make this easier.

There’s also the Batch Wizard in the pro edition, which can check local files or make HTTP requests to get the “processed/rendered” page source (after the includes have been processed).

With a little more work, you could also copy and paste the source from a browser to CSE HTML Validator’s editor then hit F6 to validate.

As to CSE HTML Validator being a “scam” - I ask is Windows a scam because Linux is free? Of course not… it has features and other qualities that Linux just doesn’t have. The same applies to CSE HTML Validator vs other programs and services (more at http://www.htmlvalidator.com/htmlval/whycseisbetter.html).

I hope this helps.

Frankly, you shouldn’t just ignore the garbage getting pumped into your code from Word. Even today (!) I run into sites that, on my Linux machine, render characters as ?s when there’s ZERO reason for that.

Every document created in Windows that I have ever viewed in vi was filled with ^M and “smart quotes” and other crap.

HTMLvalidator may have given you some good idea how to validate the rendered source, but I support Jason’s idea of cleaning the garbage out before saving to the DB in the first place if possible. On Windows machines the web page may seem fine. On other machines those same browsers may not bother trying to ignore or change the funky chars.

(btw I think it’s nice when a software vendor can help a member out with particular software, thanks for posting HTMLvalidator and welcome to SitePoint. Just be sure not to cross the spam line or the mods will hunt you down and keep a trophy! : )

Metrolyrics claims to send out its pages as UTF-8. This is what I get:

Quién dice cuál es la bandera que sobre un pedazo de tierra ondea
quién decide quién tiene el poder de limitar mi caminar dime quién

Someone’s getting that text from a Windows program, likely.
I hit the back button and find another site with actually readable information.

I might take another look at CSE, but the version we currently have is pretty useless on our CF pages where everything except for the actual content (header, footer, navigation, etc) are includes. If it’s been updated to work on the rendered page, it might be a good option.

Great. I’m glad you might take another look (please try that latest trial version - v10.00 pro)… and I’ll monitor this thread if you have any questions. You are also welcome to post any questions on CSE HTML Validator’s own support forum.

This should provide additional information on working with pages with scripts:
http://www.htmlvalidator.com/htmlval/v100/docs/validate_documents_that_use_server_side_scripting.htm

Thanks. I just installed the trial for 10, and posted a question on your forum on how to ignore CF tags. I still think we might have to still use a second validator, because our pages basically look like this:

<cfinclude template="/includes/hep/header.cfm" />
<div id="pagecontents">
content here
</div>
<cfinclude template="/includes/moddate.txt" />
<cfinclude template="/includes/hep/footer.cfm" />

So I don’t think I could validate anything but the content in batch mode. Still, that might be OK, since the includes don’t ever change and I know they’re valid.

Now if I could run CSE in batch mode using the browser, that would rock.

Thanks. I saw your message there and have replied to it. For anyone interested in following, here is the link:
http://www.htmlvalidator.com/CSEForum/viewtopic.php?f=1&t=1081

As I believe you’ve found out already, you can check the pages directly (without running them through the server) and have CSE HTML Validator ignore all the “cf*” elements (by disabling a flag in the program) - or you could run the pages through the server (using http links) to get the HTML output and have it check that.