Htaccess Vocabulary

Chris77 · May 1, 2014, 6:06am

Sitepoint Memebers,
What is the difference between
RewriteCond %{SERVER_PORT} 443
aand
RewriteCond %{SERVER_PORT} ^443$

From what I can understand from
http://httpd.apache.org/docs/2.2/rewrite/intro.html

the second line means that when the server receives a request with a string that begins with 443 and ends with 443.

Is that right.?

Does the first line mean - when the server receives a request with a string that contains 443

Thanks,

Chris

Mittineague · May 1, 2014, 6:19am

haccess uses Perl flavor regex syntax, so yes, the ^ signifies “beginning” and the $ signiifies “ending”.

I’ve seen it used with URLs before but never with SERVER_PORT so I don’t know if it would be valid or even required for that line.

Chris77 · May 1, 2014, 12:20pm

The second line with ^ and $ I got from step 3 on this page

and also from the second section on this page
http://www.seoworkers.com/seo-articles-tutorials/robots-and-https.html

Do you mean you haven’t seen that line at all or by itself?

If that is sorted ouit, which .txt file is it applied to in the next line
RewriteRule ^robots.txt$ robots_ssl.txt

I can’t see what RewriteRule ^robots.txt$ robots_ssl.txt is saying.

Thanks

Mittineague · May 1, 2014, 4:44pm

Just because I’ve never seen it with SERVER_PORT doesn’t mean it’s incorrect, it mostly likely is.

RewriteRule ^robots.txt$ robots_ssl.txt
is saying
(implied) if the previous condition(s) is/are met,
rewrite the (URL) string “robots.txt” that begins with “r” and ends with “t” - that is it won’t match frobots.txt, robots.txte, or obots.tx etc.
to robots_ssl.txt

Presumably Apache would then serve a different file for Requests that met the condition(s)

Chris77 · May 1, 2014, 9:23pm

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_ssl.txt

You wrote what it’s saying but it doesn’t look like you finished your thought.
It looks like it’s saying: If a request string that has the sequence 443 (a.k.a. https) is received then instead of sending the requestor to robots.txt send the requestor to robots_ssl\.txt. Is that right?

Mittineague · May 1, 2014, 9:29pm

Yes, except with the ^$ it would be “is” not “has” assuming the same regex syntax also works with SERVER_PORT

Chris77 · May 1, 2014, 9:51pm

If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt send the requestor to robots_ssl\.txt. Is that right?

If that’s correct, then instead of sending the requestor to robots.txt can you incur a 404 some how (something tells me changing robots_ssl\.txt to 404\.html won’t work) in hopes of https requests getting a 404 response?

Chris77 · May 2, 2014, 4:49pm

I found this page
http://stackoverflow.com/questions/21813039/filtering-pages-to-redirect-to-404-in-htaccess
Would you code it like this?

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ [R=404,L]

dklynn · May 2, 2014, 11:54pm

Alan,

%{SERVER_PORT} is (IMHO) the preferred way to determine whether the connection is simply http (80) or https (443). The other way is to test the %{HTTPS} variable for “on” or NULL but that can give strange results on some servers (meaning you need to be careful with the logic of matching “on” (without the quotes, of course) vs matching a NULL value OR non-existent variable.

C77,

Mittinaegue is quite correct about the use of ^ and $ and I’d advise you to use them for your 443 check.

As for your last post, it will generate a 500 error because the syntax does not provide the redirection.

Regards,

DK

Chris77 · May 3, 2014, 4:59am

Thanks for the info. I certainly don’t want a 500 error. I just can’t figure out this last part that would respond to a 443 request with a 404 error instead of a robots file. What code do you use to incur a 404 error? A 404.html page is just a page, not an actual 404 error. Can you incur a 404 error in Apache? Maybe that’s it -send the requestor to a non existent page, how about nohttpsatall.html?

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ nohttpsatall\.html

dklynn · May 3, 2014, 10:38am

C77,

Don’t you have a 404 script you use?

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\\.txt$ 404.php [R=404,L]

Regards,

DK

Chris77 · May 3, 2014, 12:11pm

The problem I’m having is google thinking I have https duplicates of all my http pages. A lot of programmers are saying

step 3 on this page

and the second section on this page
http://www.seoworkers.com/seo-articles-tutorials/robots-and-https.html

is the solution. What it does, for my site, is turn the 70 https pages that don’t exist on my ste but exist in google’s head into 18 https search results complaining about robots.txt use, the most ridiculous search results being this

Home
https://xyzcom/
A description for this result is not available because of this site’s robots.txt – learn more.

where google still shows I have https duplicates and changed the title of my home page to “Home” and for each of the https dupliates it has the same description about robots.txt use.

I’m wondering if this

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ nohttpsatall\.html

will solve the problem since I have no page nohttpsatall.html and so would seem to lead all https requests to a 404 error.

Thanks

dklynn · May 3, 2014, 9:42pm

C77,

The tutorial linked in my signature has example code for both secure and non-secure redirections, Personally, I’d use both (remember to ONLY redirect your scripts, not your support pages) but double check by using a PHP script in the header of the secure pages to redirect (via header() statement) if not requested or redirected by mod_rewrite).

As for your secure pages, I’d list them (using ^(secure1|secure2|secure3|…)\.php$ ) to ensure that you’re not redirecting non-secure pages to %{HTTPS}, too.

Regards,

DK

Chris77 · May 4, 2014, 2:13am

Maybe I’m wrong, but you can’t tell the server something is a 404 error, the server tells you something is a 404 error.

If you have the Gettysburg Address on your 404.html error page and you redirect three different pages to the 404.html page, is google going to “think” those three pages don’t exist any longer or is it going to think those three pages lead to the same page that contains the Gettysburg Address found on 404.html? I thinks it’s goin to associate added the Getysburgh Address to it’s survey of your site - your site is about X and the Getysburgh Address.

Once I manage to get the 404 error triggered then it can be redirected to 404.html. If you skip triggering the 404 error and go directly to the 404.html page, google is not going to know the page no longer exists. Don’t forget, 404.html (or .php) is not a special coding function. The file name of my 404 page on one site of mine is znewa.html and it works fine. It’s necessary to trigger the 404 error in order to tell google the page no longer exists, sending google directly to 404.html does not tell google the page doesn’t exist, what it does is creates multiple addresses to your 404.html page.

So it looks like

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ doesnotexist\…html

is the way, using the non existent file/page doesnotexist.html to trigger the 404 error. Isn’t this what normally happens? A page is removed, each visitor to the removed page triggers the 404 error, google is a visitor, google triggers the 404 error, google sees the 404 error, it removes the content and removes what the address of the content was from its servers’ memory.

Jeff_Mott · May 4, 2014, 3:04am

You can totally tell the server something is a 404 error… Or, more accurately, you can tell the server to send a 404 response code. You already had that above in post #8 — R=404 — and dklynn showed in again in post #11.

Chris77 · May 4, 2014, 2:31pm

I was hoping I was on the right track with [R=404,L]

but DKs post after said
“As for your last post, it will generate a 500 error …”

So is
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ [R=404,L]

at least syntactically correct?

And barring syntax errors does it say, ’ If a request string is the sequence 443 (a.k.a. https) then instead of sending the requestor to robots.txt trigger a 404 error’?

In 11 Dk wrote
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ 404.php [R=404,L]

but what does it say with 404.php placed before [R=404,L]

Does it say if there’s no 404.php trigger a 404 error?

If there’s no 404.php then there’s no need for the [R=404,L] code right after because if there’s no 404.php then 404.php is no different than doesnotexist\…html - it will automatically trigger a 404 error without [R=404,L] .

So if that’s true then

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ [R=404,L]

would seem to be right - trigger a 404 error for all https requests.

Thanks

Jeff_Mott · May 4, 2014, 5:24pm

DK was right about that, because rewrite rules require a substitution.

Not yet. You still have to rewrite to somewhere.

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ 404.php [R=404,L]

Yes. However, the phrase “trigger a 404 error” is vague. We’re at the point where we need to be clear about what’s actually happening.

A 404 error is nothing more than any HTTP response with a 404 status code. It’s conceivable, for example, that you could send a 404 status code and still send the content of the resource that you’re claiming is not found. When we use R=404, we achieve two important things: We set the response status to 404, and we prevent Apache from sending the content of robots.txt, because Apache won’t send the resource’s content if it thinks it’s redirecting.

Now, I admit, ideally that would be the end of it. A 404 response status and a blank response body is exactly what we want. But to get Apache to do those things, we had to trick it by telling it that we’re redirecting. So now we have to give it somewhere to redirect to. So we pick 404.php. That file can exist or not exist, probably doesn’t matter much.

It says to rewrite from robots.txt to 404.php. That, in conjunction with the R=404, should make the response headers look something like this:

Status: HTTP/1.1 404 Not Found
Location: 404.php

Mittineague · May 4, 2014, 5:35pm

A 404 page is a page and when it gets served it will return “found” headers - unless you send the “not found” headers with it.

Having a 404 page is a good idea because instead of the visitor seeing a generic error screen you can have a TOC or search or something helpful for them on it and so increase the possibility that they’ll stay at your site.

Jeff_Mott · May 4, 2014, 5:50pm

As I was writing that, it occurred to me that a solution you proposed earlier would probably work better. Just rewrite to a non-existent page.

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ nohttpsatall.html

Assuming nohttpsatall.html doesn’t actually exist, this should do exactly what you want.

Chris77 · May 4, 2014, 6:34pm

That’s why I was saying when 404.php doesn’t exist it’s the same as when doesnotexist.html is put in the apache htaccess code. Although this and the idea to redirect to a custom 404 page created confusion. It’s not that I don’t want or don’t have a custom 404 page, it’s that sending (redirecting) a robot (google robot) directly to a custom 404 page that exists doesn’t trigger the actual 404 status and so is useless in trying to get google to stop indexing https pages that don’t exist. All it does is get google to read your custom 404 page.

So with
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ 404.html [R=404,L]

assuming 404.html exists, this code says 'If a request string that has the sequence 443 (https) is received then instead of sending the requestor to robots.txt send the requestor to the custom 404 page AND be sure to trigger a 404 error status ([R=404,L])

And the same can be done with
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ nohttpsatall\.html

When nohttpsatall.html does not exist.

You say the second way works better? I’m worried if I use the second way google in its Webmaster tools will be hounding me forever to fix nohttpsatall.htmll. What advantage do you see with the second way?