Bots and Bandwidth

Ellysdirectory · July 23, 2012, 1:16pm

Hello guys. please take a look and give me advice.

Viewed traffic * 2,229
5,069
(2.27 visits/visitor) 19,189
(3.78 Pages/Visit) 56,236
(11.09 Hits/Visit) 413.92 MB
(83.61 KB/Visit)
Not viewed traffic *
19,803 25,698 142.91 MB

and this.

Robots/Spiders visitors (Top 25) - Full list - Last visit
16 different robots* Hits Bandwidth Last visit
Unknown robot (identified by ‘bot*’) 3,162+383 36.68 MB 23 Jul 2012 - 07:29
Unknown robot (identified by ‘robot’) 1,795+43 12.11 MB 23 Jul 2012 - 06:59
Unknown robot (identified by ‘*bot’) 917+489 24.00 MB 23 Jul 2012 - 01:50
Googlebot 1,304+54 5.96 MB 23 Jul 2012 - 07:26
Unknown robot (identified by ‘spider’) 961+84 6.03 MB 23 Jul 2012 - 05:11
Unknown robot (identified by empty user agent string) 754+15 9.98 MB 23 Jul 2012 - 07:17
Unknown robot (identified by ‘crawl’) 662+50 7.78 MB 23 Jul 2012 - 07:29
Unknown robot (identified by ‘checker’) 425 11.65 MB 19 Jul 2012 - 01:18
MSNBot 274+28 1.71 MB 23 Jul 2012 - 06:24
Unknown robot (identified by hit on ‘robots.txt’) 0+263 80.57 KB 23 Jul 2012 - 06:38
Alexa (IA Archiver) 144+46 3.23 MB 22 Jul 2012 - 21:06
Yahoo Slurp 132+41 917.40 KB 23 Jul 2012 - 04:55
Voyager 8 0 19 Jul 2012 - 07:48
Voila 4+3 29.59 KB 20 Jul 2012 - 19:15
MSNBot-media 1+4 5.79 KB 19 Jul 2012 - 19:07
Netcraft 1 24.83 KB 06 Jul 2012 - 12:55

Robots shown here gave hits or traffic “not viewed” by visitors, so they are not included in other charts. Numbers after + are successful hits on “robots.txt” files.

i think this to much bandwidth, so how to minimal the bandwidth at Bot??thanks guys

MarPlo · July 23, 2012, 2:01pm

Hi,
A few tens of MB is not so much bandwidth. Also, depends of the size of your files included in the page content, as “.css” and “.js” files, and images.
The bots can be from search engines, like google, alexa, … etc.
Any way, try look on the net for: "prevent hotlinking ".

Ellysdirectory · July 23, 2012, 2:12pm

really??
so this normal…

i though that was not normal…heheh…sorry bothering you.

ServerStorm · July 23, 2012, 3:24pm

Hi

This is normal… in most cases you want to encourage regular visits by search robots. @MarPlo ; was correct that you want to look at ways to block hot linking. You also want to ensure that your ISP host using intrusion detection and most likely use a SNORT account to filter unwanted traffic. This is normally done at your ISP’s firewall level.

You can find out a way using an apache host to filter unwanted hot linking in dklynn’s tutorial on mod_rewrite

Regards,
Steve

cheesedude · July 23, 2012, 7:51pm

Encouraging visits by Google, Bing/Yahoo, and other search engines is one thing. But there are bots out there that you do not want sucking down your data transfer such as Brandwatch. (And for me, Yandex, Baidu, and other non-U.S. bots.) I’ve had bots suck down almost 2 GB of data in a single day. I block the ones I don’t want using htaccess.

There are other methods to attempt to stop scrapers from downloading your entire site. I don’t use them as of yet.

ServerStorm · July 23, 2012, 8:14pm

How often do the unwanted bots that you try to block change their agent string or their I.P.?

cheesedude · July 24, 2012, 10:23pm

I block by user agent string and they haven’t changed. But then, these are “legitimate” bots (like Brandwatch). The kind that aren’t trying to hide what they are. The scrapers and other bots trying to harvest your content are not going to be stopped using htaccess unless you can block their IP range.

About the only thing I can think of to stop a scraper bot is to store information in a database about it such as its IP address with a little logic to see how many pages it is accessing. I’ve had scrapers download upwards of 5 pages a second. While I’ve considered writing the code to prevent this, I haven’t yet done it. I don’t think it would be too hard, though. It would require database access on every pageview, though. Which is a small performance hit.

ServerStorm · July 25, 2012, 2:09pm

Thanks, and it is generally not desirable to block entire I.P ranges, so .htaccess isn’t the most comprehensive solution.

I prefer using my firewall’s intrusion detection to block high rates from a single host/User Agent and even DOS attacks. I also use SNORT and sure sometimes I have to tune it or block some particularly troublesome I.P.s but overall this is easiest for what I do.

Thanks,

Steve

dklynn · July 25, 2012, 10:33pm

SS,

If your target market is within your country, no sense in allowing hackers from China, Romania, etc., to even visit your website. Therefore, I’d disagree with your statement about block IP addresses. Too easy to proxy attacks but that forces another thing for hackers (bots) to do.

Saying that, I agree that your firewall is the optimum solution. :tup:

Regards,

DK

ServerStorm · July 25, 2012, 11:42pm

Good point! However how do you best know that the I.P.s/Proxy I.P.s to block without blocking legitimate traffic?

dklynn · July 26, 2012, 3:55am

Steve,

You can’t (obviously). That’s where the “target market” comes in. Sitting off in NZ, many of my clients are marketing ONLY to NZ so I can easily test for NZ IP addresses and block everyone else (if that’s what the client wants, of course). Not many proxy servers in NZ, too, so that makes my task easier.

Back to your question, though, you can’t. It then becomes a trade for the client whether to block of allow bots. With the advantage to bots (changing user agent strings and using proxies), I don’t recommend blocking, ergo my support for your firewall.

Regards,

DK

ServerStorm · July 26, 2012, 12:00pm

Thanks DK!

jmccormac · July 26, 2012, 3:27pm

Well some large site operators do block entire IP ranges and often all IP ranges associated with problem countries. This is often better off done at the IP level using iptables or its equivalent. The main problem that large sites tend to have is from scrapers trying to download the site’s entire content. Many scrapers operate from hosting/vpn IP ranges and consequently, these ranges are often candidates to be blocked. With large sites, the standard operating procedure is to block on detection. The only thing that would concern the admin of a large site is whether it is more efficient to block with .htaccess or a software firewall at IP level.

Regards…jmcc

ServerStorm · July 26, 2012, 4:15pm

For this I would definitely use an Enterprise firewall; far more control.

Regards,
Steve

jmccormac · July 26, 2012, 4:49pm

Polite crawlers will generally obey robots.txt and this can be a good first line of defence for those that play by the rules. With blocking by .htaccess, if you are going to block a ‘legitimate’ searchengine crawler, it may make some sense to exclude robots.txt from the block. Enterprise firewalls are good but they can be expensive. Sometimes a good blocklist, iptables and perhaps mod_security can handle a lot of the problems a site will encounter.

Regards…jmcc

ServerStorm · July 26, 2012, 5:00pm

Yes I have for the last 7 years used PFSense an original fork off of Monowall. PFSense is completely open source, has i.p. chaining as a core, intrusion detection, CARP(combining multiple WAN connects into one larger pipe with failover, and Virtual LANs (when having switches that can support this). You are quite correct though that iptables and mod_security can go quite far, still I prefer PFSense

system · July 29, 2012, 4:25pm

Couldn’t you set up a trap for bots that don’t obey robot.txt?

You could write a disallow statement for a trap page (which isn’t important for SEO), and if a bot visits it, log the ip and the block the ip, and the string name?

TechnoBear · July 29, 2012, 4:58pm

You mean something like this? Although it just blocks the IP, not the user-string.

system · July 29, 2012, 5:07pm

:tup:

Anyone had any experiences with this method of protection?

TechnoBear · July 29, 2012, 6:04pm

I’ve tried that one on a couple of sites and it seems very effective. I’ve also used this, which blocks bad bots, but doesn’t automatically ban the IP. It does, however, record the IP and give you the option to ban it manually.