Robot.txt File

There is even a whole site about that file :smiley:
http://www.robotstxt.org/robotstxt.html

1 Like

The important thing to remember is that not all search engine spiders respect the robots.txt file. Google does, I think Yahoo does, most if not all of your top search engines will; however, there are entities on the internet (maybe a search engine, maybe just a script-kiddie) who do NOT respect the robots.txt file, and will spider everything possible.

V/r,

:slight_smile:

Robots.txt file is very important for SEO because with this file we can restrict any page file image etc from Google indexing. Some times we do not want to show some pages files to search engines but we want to show with users then we can use robots.txt file.

It helps us to save our site from Spamming…

I hope you will satisfy with you answer if yes then give me THANKS or appreciate me so that i can help you… and other people…

Regards,

Singhi

Perhaps you could explain more clearly what you mean by that. It can help you to avoid issues caused by duplicate content on your site by preventing access to certain areas, but there are other ways to deal with that, such as using canonical links.

I’m just a bit confused, because your wording sounds as if using robots.txt can somehow protect your site from Spam, but that is in no way true.

1 Like

I am really surprised when somebody does not has knowledge then how can they ask questions on any answers. Robots.txt file has simple work that prevent SPAM but how let me explain…

Lets suppose you have copied content on your website or you have a similar page like competitor site and you copied their content as it… If you will not disallow robots file to index that page then Google can SPAM your website because you copied their content.

Second thing if you are sure that you want to show your that page to users/visitors then you have to use robots.txt file and disallow your that page to index within Google. Your site will not be go in SPAM…

I hope @Technobear can understand this

I don’t know if TechnoBear understands, but I’m fairly sure I do.

What you’re suggesting is to use the robots.txt file in a way that violates Google’s policy and puts a site at risk of being removed from the SERPs

2 Likes

So in other words, you have stolen content from another site to pass off as your own?

[quote=“rajivsinghi, post:8, topic:203889”]
If you will not disallow robots file to index that page then Google can SPAM your website because you copied their content.
[/quote]So you use robots.txt in the hope that you can get away with using another site’s content? Apart from the fact that your stolen content will be of limited value if it’s not indexed, hiding it from Googlebot will not necessarily prevent it being found - and if Google discovers you are using copied content and trying to conceal the fact, your site will, quite rightly, be penalised.

SitePoint forums are a reputable resource, and we do not promote or condone such practices in any way. A quick look around here will find people posting for help, who have assumed they could by-pass Google’s guidelines, and are now desperately looking for help recovering from a penalty. So to anybody who may be thinking about taking this poor advice, I would say - don’t do it; it isn’t worth the risk.

[quote=“rajivsinghi, post:8, topic:203889”]
I am really surprised when somebody does not has knowledge then how can they ask questions on any answers. Robots.txt file has simple work that prevent SPAM but how let me explain…
[/quote]My original confusion was caused by your incorrect use of the word Spam here. Spamming is the sending of unsolicited messages or irrelevant advertising. What you are referring to here is plagiarism - content theft - and the associated Google penalties.

1 Like

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit.

Robot.txt file is used to “tell” the search engine robots what pages they should not visit and therefore index. It is important for SEO, because your website can contain some pages with service information or with the duplicated content, that can influence the SEO rates in a bad way, or some pages that you don’t want to show to the users.

Through robot.txt google will crawl a website or webpage. The robot.txt can allow or disallow the google to crawl the particular webpages.

robots.txt file is a set of instruction( ethical) that how a site can be crawled. It depend upon bots that they follow the instructions or not.

It tells search engine to which page or directory can be crawled or which directory should not be crawled.

For example, you can see robots.txt of various popular sites https://www.facebook.com/robots.txt

or https://www.google.com/robots.txt

In robots.txt we can include list of sitemaps and feed also so search engine get information about our links ,our sitemap.

PURE SPAM, So first you write duplicate content (spam) and then try to defend that content by hiding from search engine’s eyes (spam).

And you thought, SE (Google) will not read those content if you disallow them in robots.txt file?
Google has many ways to crawl a page and can add for indexing and serving in search results even if you disallow that page in robots.txt file.
DO NOT use robots.txt if you really don’t want to see a specific page in search results. Meta Robot tag do better job than robots.txt file for indexing behavior of a page.

Again, I have to say I think this use of the term “Spam” is incorrect and confusing.

Short version: http://www.oxforddictionaries.com/definition/english/spam

Long version: https://en.wikipedia.org/wiki/Spamming

Actually Robots.txt does not help at all in SEO. I used robots.txt to save my resource bandwidth.

What is general purpose of Robots.txt?

To block/accessing your content on specific directory from bots. It is standard guideline, that all polite bots like Google, Bing, Yahoo, Baidu etc following very strictly, while other impolite bots like dotbot(by moz), MJ12(by Majastic SEO), that does not follow robots.txt guideline very strictly, you need to add it’s useragent name in order to prevent from crawling. Here is tutorial that will talk more about impolite and polite bots. Hope @WolfShade like it :smile:

What is miss used by Robots.txt

  1. To Prevent duplicate content - You can use canonical link tag for duplicate
  2. To Prevent JuicyRank - Most of webmaster, block those directory like demo directory in robots.txt, so it is just sculpt your Pagerank.
  3. To Prevent Indexing - You can use noindex tag. @rajivsinghi please do not spread your unwanted knowledge here.

I think it is important to note that crawlers have the ability to crawl all the site’s content.

Disallow is only a request not to include certain site content in their index.

I have noticed that requesting a Google Webmasters Tool’s option to “Fetch as Google” sometimes returns with an incomplete page returned because I have disallowed certain folders.

1 Like

This is really cheap words you used for me i am really unhappy with your post if you need any help if you have any doubt if anybody can suggest you then how can you blame on the person… This is really bad thing…

I hope you can understand your mistake and will never do that… Accept that if you do not have knowledge about something people can help you but if you blame them then they will not…

Thanks

can you please show me any example that if you use robots.txt file and you disallow the page from search engine even they know about the copied content/… Don’t misguide others learn and teach are two things that we need to accept…

Thanks

file: http:www.yourdomainname.com/robots.txt

# Remarks 
#    User-agent: [the name of the robot the following rule applies]
#    Disallow: [the URL path you want to block]
#    Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]

User-agent: *
Disallow: /_CACHE_JJ/
Disallow: /afiles/
Allow:    /afiles/images/
Disallow: /subs/
Disallow: /downloads/
Allow: /downloads/sp-e/test-jj/

Please also note that Google Webmaster Tools has a “Validate Robots.txt” option to ensure all the commands are valid.

Thanks for your help… Now anybody can understand this… in Remark column read carefully Disallow means " [the URL path you want to block]"

Thanks

I think I copied that from one of the numerous sites with detailed explanations of how to use a robots.txt file.

Personally I think it is a bit confusing and should not mention the URL which to me is the complete domain, path and/or web page.

I find it handy to be able to block a complete path but also be able to allow specific sub-paths under the blocked path.