0 down vote favorite
share [g+] share [fb] share [tw]
How do I tell crawlers / bots not to index any URL that has /node/ pattern? Following is since day one but I noticed that Google has still indexed a lot of URLs that has /node/ in it, e.g. [noparse]www.mywebsite.com/node/123/32[/noparse]
Disallow: /node/
Is there anything that states that do not index any URL that has /node/ Should I write something like following: Disallow: /node/*
Disallow: /node/ is the correct syntax to disallow crawling of a directory called “node”. Disallow: /node/* is incorrect. You can find full details of how to use a robots.txt file here.
I have read somewhere - I no longer have any idea where - that Google likes to be addressed personally, so you write one version for Google and one for everybody else. e.g.
# For Googlebot
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /scripts/
# For all bots
User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/
I’ve no idea how reliable that information is, but for the sake of a few bytes, it does no harm to include it. It works for me.
I’ve not seen anything like that. What you can do is to give Googlebot (or any other bot) different restrictions by specifying those first and then doing a catch-all * for ‘all others’.
/node/ is not a physical directory, this is how drupal 6 shows it’s content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content
You can, but don’t expect it to work. If you follow the link I gave above, you’ll find:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: bot”, "Disallow: /tmp/" or “Disallow: *.gif”.
I’m still not seeing any problem. Google couldn’t care whether it’s a physical directory or not – all it will do is pattern-match the URL, and if it matches the pattern then bingo, it will not send Googlebot down that road.