Robots.txt - exclude any URL that contains “/node/”

iamopensource · April 13, 2012, 11:43am

0 down vote favorite
share [g+] share [fb] share [tw]

How do I tell crawlers / bots not to index any URL that has /node/ pattern? Following is since day one but I noticed that Google has still indexed a lot of URLs that has /node/ in it, e.g. [noparse]www.mywebsite.com/node/123/32[/noparse]

Disallow: /node/

Is there anything that states that do not index any URL that has /node/ Should I write something like following: Disallow: /node/*

Thanks

Steev21 · April 13, 2012, 12:32pm

It’s very simple to follow this code.
User-agent: *
Disallow: / Page URL

write the page URL which you did’t want to crawl by Google.

iamopensource · April 13, 2012, 12:40pm

Hi Steev,

It’s just not possible, there are so many pages, only a pattern based restriction is required.

Regards

TechnoBear · April 13, 2012, 12:55pm

Disallow: /node/ is the correct syntax to disallow crawling of a directory called “node”. Disallow: /node/* is incorrect. You can find full details of how to use a robots.txt file here.

I have read somewhere - I no longer have any idea where - that Google likes to be addressed personally, so you write one version for Google and one for everybody else. e.g.

# For Googlebot
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /scripts/

# For all bots
User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/

I’ve no idea how reliable that information is, but for the sake of a few bytes, it does no harm to include it. It works for me.

Stevie_D · April 13, 2012, 2:49pm

That will work for any URL of the form example.com/node/whatever, but it won’t work for example.com/something/node/whatever … is that a problem?

TechnoBear:

I have read somewhere - I no longer have any idea where - that Google likes to be addressed personally, so you write one version for Google and one for everybody else. e.g.
# For Googlebot
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /scripts/

# For all bots
User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/
I’ve no idea how reliable that information is, but for the sake of a few bytes, it does no harm to include it. It works for me.

I’ve not seen anything like that. What you can do is to give Googlebot (or any other bot) different restrictions by specifying those first and then doing a catch-all * for ‘all others’.

iamopensource · April 13, 2012, 3:11pm

That will work for any URL of the form example.com/node/whatever, but it won’t work for example.com/something/node/whatever … is that a problem?

The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. [noparse]www.mywebsite.com/node/123/32[/noparse]

/node/ is not a physical directory, this is how drupal 6 shows it’s content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content

topgrade · April 13, 2012, 3:12pm

You can write the line using wildcards like the one below:

Disallow: /node/

iamopensource · April 14, 2012, 7:20am

Thanks topgrade, this is a syntax error, please check:
Disallow: /node/
at:
http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php

TechnoBear · April 14, 2012, 9:34am

You can, but don’t expect it to work. If you follow the link I gave above, you’ll find:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: bot”, "Disallow: /tmp/" or “Disallow: *.gif”.

Stevie_D · April 14, 2012, 10:44am

I’m still not seeing any problem. Google couldn’t care whether it’s a physical directory or not – all it will do is pattern-match the URL, and if it matches the pattern then bingo, it will not send Googlebot down that road.

iamopensource · April 16, 2012, 11:21am

Hi Stevie,

I checked google as:
site:[noparse]www.mywebsite.com[/noparse] inurl:node and it gives me hundreds of results,

e.g. [noparse]http://www.mywebsite.com/node/193[/noparse]

does this mean Google not respecting robots.txt?

my robots.txt exists since day 1.