Anyone had this site start scraping theirs?
It's seems like a bit of an issue, scraping, linking to and taking content.
I have most of my site now linked via updowner.com
Surely google should know about this and deindex the site. I did a quick search and it comes up in the search.
Anyone have any advice on how to block bots that I don't want coming in via robots or htaccess
I don't have vast knowledge of scraping but I have looked into it. If you have RSS feeds set up, then it's very easy for scrapers to automate your content onto theirs through these feeds. There are also specialist programs set up to scrape information (outside of RSS feeds). I'm not aware of any ways to block these apart from disable the RSS.
What I have seen is that most websites that form content from scraping others, are usually very poor quality and heavily focused on advertising.
For SEO purposes, Google only recognises the site where the original content came from, so don't worry if others are ripping content from your site. If your content is unique, then you really have nothing to worry about.
Make sure that pages where information may be scrapped have reference back to you site. That way it's free advertising and extended PR. See scraping as a pat on the back for you - i.e. for providing great content that others want to promote.
robots.txt is useless for blocking unwanted bots, as bots don't have to obey it and badbots are the least likely to take any notice of it. If you know the user-agent string or IP address for the particular bot, you can block it via .htaccess.
I think I will set up one of those bot traps.
1 other question in, what size / line length should .htaccess file be kept to to ensure speed.