I've been doing a lot of research on 301 (permanent) redirects, and use of the rel=canonical tag, and I'm a little concerned my site's not doing a good job, and hence susceptible to duplicate content problems.
Check out this 301-redirect checker: http://www.ragepank.com/redirect-check/ and type in my site's URL: jobstr.com
All 22 of those URLs are returning a 200 (OK) message, when (unless I'm mistaken), only ONE of them should, right? Is this a problem?
But also weird is that of those 22 URLs, nearly all of them just link to "Page Not Found" pages on Jobstr, see: http://jobstr.com/index.htm
If they're going to Page Not Found pages, why don't those redirect checkers return a 404 code???
It looks like the CMS you are using, is returning a 200 and a 'page' for a 404, rathr than returning the actual 404 http status code.
http://jobstr.com/timtest.html for example, and use a test tool like Firebug / Chrome's web dev tools to see traffic.
The 404 response code is the server's response to a "file not found" error. However, if the CMS is redirecting to its own 404 script, then the server is finding that script (the CMS's index.php - surprise!) and serving it with the obligatory 200 response code. In other words, the CMS is handling the "file not found" error internally and the server is operating correctly.
I disagree. If there is no content interesting to humans and/or search engines on a given URI (i.e., a custom 404) page, the server must also return a HTTP status 404 to reflect this. The 404 status response is the main way to indicate to search engines that the content that could (maybe) be found previously on a URI is not /no longer there. Without 404 it would just keep all sites it ever found in it's index, regardless if the content is still there or not (okay, there is a concept called "soft 404", which does use HTTP status 200, but I don't want to get in to that.)
The fact there is a file to serve that content and that that file was found is completely irrelevant. So no, I don't think is the server is working correctly, it must send 404 not found headers on those pages.
Also, as stated in the RFC, 404 is simply "not found", not "file not found", which also suggests it's not about finding files but about finding resources.
The server may be operating correctly, according to the precise technical specification, but the system as a whole is not. If the CMS is handling 404s internally, such that requesting a page that doesn't exist returns a 200 A-OK, the CMS is not operating correctly. Any page on a live domain that does does not exist should return a 404 error. A key part of a search robot's role is to be able to identify where there are live pages, and if it gets a 200 A-OK for absolutely any old rubbish on a domain then it's harder to figure out which pages are still there and which are either dead or never existed in the first place. That's why robots sometimes request completely fictitious (and very unlikely) pages, to check if it can trust your site to correctly return a 404 error.
Thanks for the replies. The above bolded part is particularly concerning...I wasn't sure if this problem I originally posted about was (a) just a weird server quirk, or (b) materially detrimental to my site's SE profile. I'd gotten vague suggestions that it could be hurting me in that respect, but not anything quantifiable...this suggestion that SE's will ping you with fictitious pages to check the integrity of your 404 code...how big a deal is that?
It isn't a suggestion, it does definitely happen.
The reason they do that is so that they know whether they can trust 200 A-OK responses to be genuine pages. If they can then that's all well and good, and they know that any page giving a 200 A-OK is really there and any dead links or expired pages will come up as a 404. On the other hand, if you have a site that returns a 200 A-OK for any URL, the search engine needs to put a bit more effort into making sure that those URLs are returning genuine pages and not pseudo-error pages.
I don't know what effect that will have on your ranking, but it is unlikely to be good. The two possibilities are (i) that your pseudo-error page will make it into the search results (which of course wouldn't happen with a genuine 404), and I have definitely seen this happen, and (ii) that the extra checking needed to make sure each page is genuine would eat into the amount of crawl time that the search engine spends on your site, reducing the amount of time left for actually crawling and indexing the content.
This topic is now closed. New replies are no longer allowed.