I have a (test) domain name I use for development. Perhaps I should be using a sub-domain name. But I don’t.
I know of no published link to this domain anywhere. Sometimes–while developing a new website–I forget to put a meta NOFOLLOW element in the headers of the new pages I’m generating.
Interestingly, keyword searches on the content of those new and not-yet-published websites often end up indexed on Google. How did Google ever know about that domain in the first place, when I never (ever) published a link to it?
I suppose it only has to be spidered once and then it’s a known entity for ever. But I am still curious to know how they ever found out about my test domain in the first place. Is NOFOLLOW the only way to have a test domain? Or are sub-domains a better way to go?
I was not able to find a definitive answer on this but have seen the same activity when I build websites. These sites are not linked to anything (that I know of or that I can find) and yet they appear in Google search results despite the claim that Googlebot finds pages based on other links.
Here is an interesting article that may point you in the right direction.
Ah. The Google toolbar. I have never used it. But some of my customers probably have. So when I sent my customer an email that said: “look at my test domain and tell me what you think” … then their browser makes Google aware of that URL. Once aware once they know about it forever.
Sometimes WHOIS records are replicated, so eventually google crawls finds the linked A records and follows them.
If you don’t want them to be indexed or visited, the only sure way to prevent access is to throw up an htaccess username/password prompt. Most hosts offer this through the control panel named something like “protected folder”.
This (robots.txt) is the best solution. NOFOLLOW in the header is too hard to control with generated pages that rely on config files or database values…at least in the chaos of development time. Robots.txt tends to be stable throughout all of that.