Just because I’ve never seen it with SERVER_PORT doesn’t mean it’s incorrect, it mostly likely is.
RewriteRule ^robots.txt$ robots_ssl.txt
is saying
(implied) if the previous condition(s) is/are met,
rewrite the (URL) string “robots.txt” that begins with “r” and ends with “t” - that is it won’t match frobots.txt, robots.txte, or obots.tx etc.
to robots_ssl.txt
Presumably Apache would then serve a different file for Requests that met the condition(s)
You wrote what it’s saying but it doesn’t look like you finished your thought.
It looks like it’s saying: If a request string that has the sequence 443 (a.k.a. https) is received then instead of sending the requestor to robots.txt send the requestor to robots_ssl\.txt. Is that right?
If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt send the requestor to robots_ssl\.txt. Is that right?
If that’s correct, then instead of sending the requestor to robots.txt can you incur a 404 some how (something tells me changing robots_ssl\.txt to 404\.html won’t work) in hopes of https requests getting a 404 response?
%{SERVER_PORT} is (IMHO) the preferred way to determine whether the connection is simply http (80) or https (443). The other way is to test the %{HTTPS} variable for “on” or NULL but that can give strange results on some servers (meaning you need to be careful with the logic of matching “on” (without the quotes, of course) vs matching a NULL value OR non-existent variable.
C77,
Mittinaegue is quite correct about the use of ^ and $ and I’d advise you to use them for your 443 check.
As for your last post, it will generate a 500 error because the syntax does not provide the redirection.
Thanks for the info. I certainly don’t want a 500 error. I just can’t figure out this last part that would respond to a 443 request with a 404 error instead of a robots file. What code do you use to incur a 404 error? A 404.html page is just a page, not an actual 404 error. Can you incur a 404 error in Apache? Maybe that’s it -send the requestor to a non existent page, how about nohttpsatall.html?
is the solution. What it does, for my site, is turn the 70 https pages that don’t exist on my ste but exist in google’s head into 18 https search results complaining about robots.txt use, the most ridiculous search results being this
Home https://xyzcom/
A description for this result is not available because of this site’s robots.txt – learn more.
where google still shows I have https duplicates and changed the title of my home page to “Home” and for each of the https dupliates it has the same description about robots.txt use.
The tutorial linked in my signature has example code for both secure and non-secure redirections, Personally, I’d use both (remember to ONLY redirect your scripts, not your support pages) but double check by using a PHP script in the header of the secure pages to redirect (via header() statement) if not requested or redirected by mod_rewrite).
As for your secure pages, I’d list them (using ^(secure1|secure2|secure3|…)\.php$ ) to ensure that you’re not redirecting non-secure pages to %{HTTPS}, too.
Maybe I’m wrong, but you can’t tell the server something is a 404 error, the server tells you something is a 404 error.
If you have the Gettysburg Address on your 404.html error page and you redirect three different pages to the 404.html page, is google going to “think” those three pages don’t exist any longer or is it going to think those three pages lead to the same page that contains the Gettysburg Address found on 404.html? I thinks it’s goin to associate added the Getysburgh Address to it’s survey of your site - your site is about X and the Getysburgh Address.
Once I manage to get the 404 error triggered then it can be redirected to 404.html. If you skip triggering the 404 error and go directly to the 404.html page, google is not going to know the page no longer exists. Don’t forget, 404.html (or .php) is not a special coding function. The file name of my 404 page on one site of mine is znewa.html and it works fine. It’s necessary to trigger the 404 error in order to tell google the page no longer exists, sending google directly to 404.html does not tell google the page doesn’t exist, what it does is creates multiple addresses to your 404.html page.
is the way, using the non existent file/page doesnotexist.html to trigger the 404 error. Isn’t this what normally happens? A page is removed, each visitor to the removed page triggers the 404 error, google is a visitor, google triggers the 404 error, google sees the 404 error, it removes the content and removes what the address of the content was from its servers’ memory.
You can totally tell the server something is a 404 error… Or, more accurately, you can tell the server to send a 404 response code. You already had that above in post #8 — R=404 — and dklynn showed in again in post #11.
I was hoping I was on the right track with [R=404,L]
but DKs post after said
“As for your last post, it will generate a 500 error …”
So is
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ [R=404,L]
at least syntactically correct?
And barring syntax errors does it say, ’ If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt trigger a 404 error’?
In 11 Dk wrote
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ 404.php [R=404,L]
but what does it say with 404.php placed before [R=404,L]
Does it say if there’s no 404.php trigger a 404 error?
If there’s no 404.php then there’s no need for the [R=404,L] code right after because if there’s no 404.php then 404.php is no different than doesnotexist\…html - it will automatically trigger a 404 error without [R=404,L] .
Yes. However, the phrase “trigger a 404 error” is vague. We’re at the point where we need to be clear about what’s actually happening.
A 404 error is nothing more than any HTTP response with a 404 status code. It’s conceivable, for example, that you could send a 404 status code and still send the content of the resource that you’re claiming is not found. When we use R=404, we achieve two important things: We set the response status to 404, and we prevent Apache from sending the content of robots.txt, because Apache won’t send the resource’s content if it thinks it’s redirecting.
Now, I admit, ideally that would be the end of it. A 404 response status and a blank response body is exactly what we want. But to get Apache to do those things, we had to trick it by telling it that we’re redirecting. So now we have to give it somewhere to redirect to. So we pick 404.php. That file can exist or not exist, probably doesn’t matter much.
It says to rewrite from robots.txt to 404.php. That, in conjunction with the R=404, should make the response headers look something like this:
A 404 page is a page and when it gets served it will return “found” headers - unless you send the “not found” headers with it.
Having a 404 page is a good idea because instead of the visitor seeing a generic error screen you can have a TOC or search or something helpful for them on it and so increase the possibility that they’ll stay at your site.
That’s why I was saying when 404.php doesn’t exist it’s the same as when doesnotexist.html is put in the apache htaccess code. Although this and the idea to redirect to a custom 404 page created confusion. It’s not that I don’t want or don’t have a custom 404 page, it’s that sending (redirecting) a robot (google robot) directly to a custom 404 page that exists doesn’t trigger the actual 404 status and so is useless in trying to get google to stop indexing https pages that don’t exist. All it does is get google to read your custom 404 page.
So with
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ 404.html [R=404,L]
assuming 404.html exists, this code says 'If a request string that has the sequence 443 (https) is received then instead of sending the requestor to robots.txt send the requestor to the custom 404 page AND be sure to trigger a 404 error status ([R=404,L])
And the same can be done with
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ nohttpsatall\.html
When nohttpsatall.html does not exist.
You say the second way works better? I’m worried if I use the second way google in its Webmaster tools will be hounding me forever to fix nohttpsatall.htmll. What advantage do you see with the second way?