ImcDocsRobots

Basically, the problem is that in the beginning of the month, all the web crawlers are updating their caches and meta-informations, so that the server get completely mad.

Some doc

Some stats

on 2005-02-17, docs.indymedia.org-access.log (created on 2005-02-13) :

% grep -i bot docs.indymedia.org-access.log | wc -l
44645
% grep -i bot docs.indymedia.org-access.log | grep  '?' | wc -l
19319

on 2005-02-24, docs.indymedia.org-access.log (created on 2005-02-20) :

% wc -l docs.indymedia.org-access.log               
125493
% grep -i bot docs.indymedia.org-access.log | wc -l
6457
% grep -i bot docs.indymedia.org-access.log | grep  '?' | wc -l
1034

on 2005-04-03, docs.indymedia.org-access.log.1

% wc -l docs.indymedia.org-access.log.1               
128913
% grep -i bot docs.indymedia.org-access.log.1 | wc -l
22503
% grep -i bot docs.indymedia.org-access.log.1 | grep  '?' | wc -l
4391

What's been tried

Actually, these things allowed us to improve this situation. Maybe we can even consider the issue solved. -- DjRom

Disallow: /*?* to robots.txt

I think it will work for Google, but the robot.txt validator says wildcards are non-standard (e.g. other spiders might ignore it).

Let's see. Activated on 2005-02-17.

Disallow skin= to unauthentificated users

see ImcDocsLog#Performance_optimizations
Topic revision: r6 - 07 Nov 2006, IntRigeri
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback