Robots.txt - exclude any URL that contains “/node/”

0 down vote favorite
share [g+] share [fb] share [tw]

How do I tell crawlers / bots not to index any URL that has /node/ pattern? Following is since day one but I noticed that Google has still indexed a lot of URLs that has /node/ in it, e.g. [noparse]www.mywebsite.com/node/123/32[/noparse]

Disallow: /node/

Is there anything that states that do not index any URL that has /node/ Should I write something like following: Disallow: /node/*

Thanks

It’s very simple to follow this code.
User-agent: *
Disallow: / Page URL

write the page URL which you did’t want to crawl by Google.

Hi Steev,

It’s just not possible, there are so many pages, only a pattern based restriction is required.

Regards

Disallow: /node/ is the correct syntax to disallow crawling of a directory called “node”. Disallow: /node/* is incorrect. You can find full details of how to use a robots.txt file here.

I have read somewhere - I no longer have any idea where - that Google likes to be addressed personally, so you write one version for Google and one for everybody else. e.g.

# For Googlebot
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /scripts/

# For all bots
User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/

I’ve no idea how reliable that information is, but for the sake of a few bytes, it does no harm to include it. :slight_smile: It works for me.

That will work for any URL of the form example.com/node/whatever, but it won’t work for example.com/something/node/whatever … is that a problem?

I’ve not seen anything like that. What you can do is to give Googlebot (or any other bot) different restrictions by specifying those first and then doing a catch-all * for ‘all others’.

That will work for any URL of the form example.com/node/whatever, but it won’t work for example.com/something/node/whatever … is that a problem?

The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. [noparse]www.mywebsite.com/node/123/32[/noparse]

/node/ is not a physical directory, this is how drupal 6 shows it’s content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content

You can write the line using wildcards like the one below:

Disallow: /node/

Thanks topgrade, this is a syntax error, please check:
Disallow: /node/
at:
http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php

You can, but don’t expect it to work. If you follow the link I gave above, you’ll find:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: bot”, "Disallow: /tmp/" or “Disallow: *.gif”.

I’m still not seeing any problem. Google couldn’t care whether it’s a physical directory or not – all it will do is pattern-match the URL, and if it matches the pattern then bingo, it will not send Googlebot down that road.

Hi Stevie,

I checked google as:
site:[noparse]www.mywebsite.com[/noparse] inurl:node and it gives me hundreds of results,

e.g. [noparse]http://www.mywebsite.com/node/193[/noparse]

does this mean Google not respecting robots.txt?

my robots.txt exists since day 1.