Bots and Bandwidth

ServerStorm · July 29, 2012, 8:24pm

Very nice recommendation @TechnoBear; . I reviewed the PHP and approach and it is pretty solid. This is great if you don’t have a high-featured firewall and could also be used in conjunction with firewalls.

system · August 6, 2012, 1:04am

Wow. That is a solid option and I was surprised it was free! The administrator’s dashboard or “panel” is probably the most advanced I’ve seen - Not that the smaller PHP options were any competition though.

dklynn · August 6, 2012, 5:44am

MBAs are taught to target their audience; the same applies to websites. If you have a client who, by virtue of their product or service, has a very limited target market, then consider allowing only IP blocks which are in that targetted location? I just signed up for a free account at http://ipinfodb.com to use their country service (NZ is a small target market) to limit displaying a contact form to Kiwi-based IPs and it works a treat!

Regards,

DK

system · August 7, 2012, 1:59pm

What about SEO? Aren’t you relying on search engines having servers in the same areas as your audience?

dklynn · August 8, 2012, 5:12am

bear,

SEOs are world-wide but some specialize in a specific locality. Good point, though, so you’d need to “punch a hole” based on the SE’s you want to invite in.

Regards,

DK

Markdidj · August 8, 2012, 2:14pm

I read this thread a couple of days ago and would like to ask something. It is very relevant.

My site is live, if content changes it updates the users current view. The check is made every 10 seconds. I created a page that gets the RSS from the BBC feeds and embeds it into the current page. Rather than the script reading the BBC’s RSS on the initial hit and every subsiquent “live” hit for each user, I check it once and save the HTML into its own file on the server. Subsiquent hits then check the time this was last done, if less than 10 seconds use the current saved html, else go to the BBC and see if the feed has been updated. This makes sure that if my site gets busy the most requests it will ever send to the BBC is 1 every 10 seconds. It also improves the execution time for that particular script. I was thinking of doing the same to my website. When I do I query on a database to build a public page instead of just outputting it to the client I save it first as HTML file on the server. All requests in the next 10 seconds use just the info from that file, either with includes or reading as XML and outputting to the screen. The first request after that 10 seconds repeats the process.

A theory is, if they are going to scrape your content they will find a way of doing it, and speeding up the time it takes for you give them that data may be a good defense.

Would something like that help?

EastCoast · August 8, 2012, 11:28pm

Caching is a good idea wherever there are performance issues due to high traffic, you might also consider compressing the output to reduce bandwidth use. On a really busy site you’d want to cache to memory rather than disk as it’s a lot faster.

Markdidj · August 9, 2012, 1:34am

Thanks EastCoast. How do I go about compressing output from a server so it is still readable by a browser? I’m new to this part of webdev.

system · August 9, 2012, 11:10am

A few things you can do:

Remove white space formatting from the html
Make sure images are as small as they can be without compromising quality too much
Using sprites help
Avoid inline CSS and JavaScript
Avoid table layout or excessive divs
Try and keep CSS and JavaScript files short - remove unnecessary code, and remove formatting white space
AJAX can reduce page requests, but be careful about it’s effects on accessibility.

Markdidj · August 9, 2012, 12:24pm

Nice. I’ve been doing most of that already by outputting my pages with JavaScript instead of HTML and designing my site with mobile mode compatability as a high priority. I’m heading in the write direction
I thought it may have ment outputting in binary or something like that, which puzzled me a bit

The good thing about live server coded javascript is named variables and functions can be named in comments in the server side code. Here’s my server side javascript for cookies

'---- Cookies ----'

's=cookie name
't=default value

response.write "function getCookie(s,t){"
 response.write "a=document.cookie.split("";"");"
 response.write "for(i=0;i<a.length;i++){"
  response.write "b=a[i].replace("" "","""").split(""="");"
  response.write "if((b.length==2)&&(b[0]==s)) return unescape(b[1]);"
 response.write "}; return t;"
response.write "};"

(first example I could find). So although I might get confused if it did it like that in javascript, I don’t so much when I use live script.

TechnoBear · August 9, 2012, 12:40pm

[FONT=Verdana]

If you’re using Apache, you can add something like this to your .htaccess file:

SetOutputFilter DEFLATE

<FilesMatch "\\.(js|css|html)$">
SetOutputFilter DEFLATE
</FilesMatch>

which will compress files of the types specified. http://httpd.apache.org/docs/2.2/mod/mod_deflate.html[/FONT]

ServerStorm · August 9, 2012, 12:47pm

Hi,

This is normally done at the web server level, for example, if using apache then mod_cache is configured. Alternatively if using PHP then one can use output buffering to perform simple caching, or there are a number of PHP accelerators that perform caching as well as compression you can find a lot of info via Google but here is a [URL=“http://en.wikipedia.org/wiki/List_of_PHP_accelerators”]Wikipedia article. Opcode caches eliminate many inefficiencies during execution phases on the server. You can also cache variables.

If you have a site where content doesn’t change regularly then it likely is a candidate for file caching. You can look at APC or one of PEAR caching libraries.

For memory caching you might look at memcache where their claim to fame is (from their home page)

Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

Steve

Markdidj · August 9, 2012, 1:01pm

MMM thanks. That leads me on to another thing I’ve been puzzling over as well. My live scripts that I allow to cache because they only contain reused functions seem to work a lot faster from the cache. Does the browser compile javascript before caching it?

Also, I read somewhere recently that files sent from the server can have a last modified date in the header and it’s possible to get the browser to compare the last modified date of a file on the server to the one in the cache. How would I go about implementing that, where the file is only got if the last modified header date is newer?

ServerStorm · August 9, 2012, 1:17pm

Hi

Search on '[google]Output buffering caching[/google]

Igal_Zeifman · August 15, 2012, 1:01pm

Hi

Regarding your initial question.

First of all, in reference to your bandwidth question I wanted to point out a study released by our company <snip>removed advertising</snip> just a few month ago. It analysed traffic data of several thousands websites and, in the end, showed a 50% and more bot generated traffic (80% for smaller sites). On average, 31% of those visits were made by malicious intruders (spammers, scrapers and etc).

<snip>removed advertising</snip>

Finally, as suggested, limiting access from irrelevant Geo-locations may sound as a good idea, but before doing so you should know that legitimate bots may sometimes use “weird” IPs - for example, Googlebot can originate from China and I`m sure you don’t want to block that…
Reference: http://productforums.google.com/forum/#!topic/webmasters/rEZQskC884s

So I would not recommend setting any non-specific rules, at least not before checking IP ranges for the most important bots out there.

Hope this helps.

cheesedude · October 14, 2012, 5:50pm

Just thought I would mention that even though I have an entry in robots.txt to ban that Brandwatch magpie-crawler bot (and have for a long time) that bot will not go away. I have an entry in htaccess to give the bot 403 Forbidden codes. I’ve been doing that for a couple years now and it still will not go away. It’s making upwards of 8 page requests per second at times. That has to be one of the worst bots I’ve come across.

EastCoast · October 15, 2012, 1:48pm

Brandwatch are on twitter might be an idea to give them some constructive criticism that is publicly visible on there