Suggestions on robots.txt

mikey_w · May 18, 2015, 6:30pm

What should I put in the robots.txt file?

Some say to leave it empty, others say to put things in there?

Does it even serve a purpose anymore?

RyanReese · May 18, 2015, 6:34pm

https://www.google.com/search?q=what+to+use+robots.txt+for&ie=utf-8&oe=utf-8

The robots.txt file defines how a
search engine spider like Googlebot should interact with the pages and
files of your web site. If there are files and directories you do not
want indexed by search engines, you can use a robots.txt file to define where the robots should not go.

A big example of this is test folders. Although search engines can ignore robots.txt.

I use it.

WolfShade · May 18, 2015, 6:35pm

It serves a purpose for legitimate search engines like Google. But there are search engines that ignore robots.txt.

I’ve got a robots.txt file that has the following:

User-agent: *
Disallow: /{folder to ignore}/

User-agent: SemrushBot
Disallow: *

User-agent: nlcrawler
Disallow: *

User-agent: MJ12bot
Disallow: *

User-agent: FeedDemon
Disallow: *

User-agent: Awasu
Disallow: *

HTH,

mikey_w · May 18, 2015, 7:33pm

@RyanReese and @WolfShade,

The point I was trying to make is that if I want spiders to crawl everything in my public_html folder then do I need anything in the robots.txt file?

Do I have to tell everyone, “Crawl everything in the document root”?

And looking at things from the other way, is there anything which I would NOT want people to crawl? (I have a directory outside of the web root where I store things like passwords and config files…)

WolfShade · May 18, 2015, 8:13pm

If you want every bot to crawl everything, then you don’t need a robots.txt file.

If you have a staging area (or developer sub-folder off the root) that shouldn’t be available to the public (password protected folder, or a secure login), then you want to include that in a robots.txt file as a “Disallow”. You don’t have to put every folder under that folder, it’s automatically recursive, in a sense.

HTH,

RyanReese · May 18, 2015, 8:16pm

No .

mikey_w · May 18, 2015, 8:31pm

Still not getting what to allow and disallow.

When spiders crawl my website, they can only see the name of the file or folder, right? (Or can they see the contents?)

Should I block access to things like my “css” directory? Or my “images” directory?

To me, it seems the only thing you would really want indexed are finished pages (e.g. index.php, account.php, some-article.php, faq.php, etc), right?

WolfShade · May 18, 2015, 8:43pm

Basically a spider starts at some kind of root home page (home.html; home.php; home.cfm, etc.) and follows every link, recursively, from that main page. Unless, of course, it’s a legit search engine and robots.txt prevents certain links from being followed. And, no, they don’t just get the filename, they get contents, too (else the meta tag would be pretty much useless.)

[quote=“mikey_w, post:7, topic:190097”]
Should I block access to things like my “css” directory? Or my “images” directory?
[/quote]No. Images can be spidered for Google’s “images” section. CSS I’m not so sure about. But why bother trying to block that?

mikey_w · May 20, 2015, 5:51pm

Sounds like you are saying spiders can see all of the HTML, but I was asking about the file contents (e.g. PHP code)…

I sure as hell would hope they can read my PHP otherwise I’d lose all security!

[quote=“WolfShade, post:8, topic:190097, full:true”]

I’m big on security and trying to make sure I am not exposing PHP code or configuration settings or anything that would allow a hacker to do bad things to any of my websites.

WolfShade · May 20, 2015, 5:54pm

I work in ColdFusion. AFAIK, the only way to get your PHP code would be to either A) hack the web server or FTP to the web server, or B) disable the PHP server portion so the web server (Apache, IIS) serves up the code. Spiders only see the on-the-fly generated HTML that the PHP (or CF) server sends.

mikey_w · May 20, 2015, 6:09pm

So with that being said, then is there anything I wouldn’t want spiders to see in my web root?

Also, just out of curiosity, if a person put all of there files in a directory outside of the web root - except for an index.php file - then would that prevent people from seeing your code if the webserver ever screwed up as you mention?

WolfShade · May 20, 2015, 6:22pm

As I stated in the fifth post:
If you have a staging area (or developer sub-folder off the root) that
shouldn’t be available to the public (password protected folder, or a
secure login), then you want to include that in a robots.txt file as a
“Disallow”.

Anything else should be left alone.

Keep in mind… this only applies to search engines that pay attention to robots.txt. Other search engines ignore robots.txt, so for them it won’t matter, anyway.

As far as docs out of webroot, I don’t know enough about how web servers work to answer that question. Hopefully, someone else who knows can answer that one.

Mittineague · May 20, 2015, 6:25pm

Except that they may use it to decide where they want to look hoping to find something the site wants them to not see.

i.e. don’t use robots.txt for security purposes.

mikey_w · May 20, 2015, 7:20pm

So if I leave my robots.txt file blank then it won’t hurt my security, right?

RyanReese · May 20, 2015, 7:22pm

That’s correct. Adding to it also won’t help your security. Following robots.txt is a rule…however…it’s more like a “guideline” and your robots.txt can be ignored, if people want.

John_Betong · July 6, 2015, 8:03am

I have a tremendous amount of junk on a particular site and extensively use robots.txt to alert bots and/or crawlers to omit particular folders hoping that the GWT → Crawl → Crawl errors are not shown.

Robots.txt is also used to disallow pages which are not Google Mobile Friendly in the hope that this will increase the mobile search ranking.

Ohnoo · July 13, 2015, 10:57am

Basicaly I use robots.txt as a way to keep some pages from being indexed, not out of security but for the fact that they would be of no interest to anyone that is not a guest actually at our physical location where links are posted.

Why have a search engine index something not useful. If however it gets indexed no big deal.

One thing I do include is instructions to wayback machine to ignore the site because the results are sometimes horrible.

system · October 12, 2015, 6:01pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.