PHP For HTTPS Problem

Sitepoint Members,
Search engines view http and http versions of your site as two different websites, creating duplicate content of your site which lowers your search engine ranking. I found that my site appears to google as having duplicate content as https pages. Why, I don’t know becayse i don’t have ssl installed on my account.

The most often written about way to deal with this is to serve a different robots.txt for HTTPS
http://blog.leonardchallis.com/seo/serve-a-different-robots-txt-for-https/
http://www.seoworkers.com/seo-articles-tutorials/robots-and-https.html
http://www.seosandwitch.com/2012/08/http-and-https-in-seo-what-to-do.html

Another site said to use canonical links on every preffered page
http://www.creare.co.uk/http-vs-https-duplicate-content

aaand the same site also gave this php code
<?php
if (isset($_SERVER[‘HTTPS’]) && strtolower($_SERVER[‘HTTPS’]) == ‘on’) {
echo ‘<meta name=“robots” content=“noindex,follow” />’. "
";
}
?>

I guess it goes in the head just as the meta tags for no index no follow do.

Is there anything I should worry about with this code?

Thanks,

Chris

I put that code in. In vew source nothng of the code shows. Shoud it show?

It should only show if:

isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) == 'on'

Otherwise nothing will be displayed and you must assume one or the other is off or both are not on.

Are you saying code should show in view source if HTTPS is turned on, if https is not turned on nothing will show in view source and if … then I was completely lost with “one or the other is off”. What are the two things that can be off, https and what else?

Yes both need to be on to display the code otherwise nothing is displayed

What do you mean by your use of “both”? Secure http (httpS) and regular http?

The code you posted will display this in your page:

<meta name="robots" content="noindex,follow" />

If the URL is HTTPS.
So, if you go to your site in HTTPS moe: https://www.yoursite.com
and you check the source, you should see this:

<meta name="robots" content="noindex,follow" />

If you browse to your site without HTTPS : http://www.yoursite.com
The code shouldn’t appear.

It tells Google to “not index the current page” and to “not follow the links inside the page”.
This should do the trick, it will just take a couple of days / weeks for Google to remove the pages from its indexes.

I see. I don’t think the PHP will work because I don’t have SSL installed.
What I have in the htaccess is
#Options +FollowSymLinks
Options +SymLinksIfOwnerMatch
RewriteCond %{SERVER_PORT} 443
RewriteRule ^(.*)$ http://mysite\.com/404.html [R=301,L]
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl\.txt [L]

but this doesn’t seem to work for my site becuase for months now google has had 20 or more https pages lister for my non https site.

When you go to your site with HTTPS in the URL with your browser, does it work or not?

I’m really not an expert in .htaccess, but it seems that it would redirect all HTTPs traffic to your 404 page. Is that it? With a 301 error. Which is… not that good I guess?
Who is hosting your site? Can’t you ask them to disable the HTTPs? …

When I type in https for a page for my site I get a “This Connection is Untrusted” from firefox. If I choose “I Understand The Risks” it takes me to the full a non https address, meaning that the site is not the main site of my account (but is is the largest by a 100 fold), takes me to http://thesiteimworkingonnow.mymainsiteofmyaccount.com.

The code
RewriteCond %{SERVER_PORT} 443
RewriteRule ^(.*)$ http://mysite\.com/404.html [R=301,L]

was written by my webhost

To me it means anything coming through the 443 port (443 is for SSL/https) send to a 404 page

The rest of the code

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl\.txt [L]

was also written by my webhost.

To me it means everything else (e.g. coming through port 80) send to that second robots.txt page which is

User-agent: *
Disallow: /

Which is how its done on this page

but why would I want to send robots coming through port 80 to
User-agent: *
Disallow: /
(disallow all robots)

Wouldn’t that stop all robots coming through non httpS ports (^443), such as port 80, from crawling my site.

All I’m trying to do is stop robots from coming through the httpS port because I don’t want them reporting that I have httpS pages. If they can’t get through the 443 port they have nothing to report.

Forget Above, I Was Cut Off By The 30 Minute Limit

When I type in https for a page for my site I get a “This Connection is Untrusted” from firefox. If I choose “I Understand The Risks” it takes me to the full a non https address, meaning that the site is not the main site of my account (but is is the largest by a 100 fold), takes me to http://thesiteimworkingonnow.mymainsiteofmyaccount.com.

The code
RewriteCond %{SERVER_PORT} 443
RewriteRule ^(.*)$ http://mysite\.com/404.html [R=301,L]

was written by my webhost

To me it means anything coming through the 443 port (SSL/https traffic ) send to a 404 page

The rest of the code

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl\.txt [L]

was also written by my webhost.

To me it means everything else,^, (e.g. coming through port 80) send to Not, ^, my regular robots.txt page but rather that second robots.txr page., which is

User-agent: *
Disallow: /

Which is how its done on this page

but why would I want to send robots coming through port 80 to
User-agent: *
Disallow: /
(disallow all robots)

Wouldn’t that stop all robots coming through port 80, from crawling my site.
I would think the second 2 lines of code should be removed.

Well, if you are redirected to a URL without HTTPS when you go to an HTTPS address, then the same will happen to Google. It’s probably just that Google didn’t cleaned its index yet. You could try to remove it yourself with Google webmaster tools:

Unfortunately the https pages don’t exist so there’s no removing them from my site.

It doesn’t matter if it “exists” or not. If it’s in Google’s indexes, it existed somehow. The procedure to remove a page from Google’s indexes with webmaster tools is exactly for pages that don’t exist anymore. Did you try it?

Google W. won’t take anything but http. In its webmaster tools choose Index and then Remove URLs and where it says, “Enter the URL that you’d like to remove (case-sensitive)” If you enter, for example, https://abc, what comes back is http://mysite.com/https://abc. Google’s webmaster tools has been useless for this problem.

Well, I never tested it with https so I’ll take your word for it. That’s weird.
Anyway, if you can’t access the HTTPs page, Google should remove it from its index eventually. Maybe you could try to ask on a SEO forum. I think your problem isn’t related to PHP anymore.

Where is the SEO forum? I couldn’t find it, unless you’re talking about CMS stuff. My site is not run on any sort of program.

There are a lot of SEO resources online. SEO means Search Engine Optimization. There are forums dedicated only for SEO (not on sitepoint but looking on Google for “SEO Forum” will give you a bunch of links).

I thought you meant an SEO forum on Sitepoint. I’m surprised they don’t have an SEO forum.