You are here: Sysadmin > ImcDocs > ImcDocsTodo > ImcDocsRobots
ImcDocsRobots
Basically, the problem is that in the beginning of the month, all the web crawlers are updating their caches and meta-informations, so that the server get completely mad.
Some doc
Some stats
on 2005-02-17, docs.indymedia.org-access.log (created on 2005-02-13) :
% grep -i bot docs.indymedia.org-access.log | wc -l
44645
% grep -i bot docs.indymedia.org-access.log | grep '?' | wc -l
19319
on 2005-02-24, docs.indymedia.org-access.log (created on 2005-02-20) :
% wc -l docs.indymedia.org-access.log
125493
% grep -i bot docs.indymedia.org-access.log | wc -l
6457
% grep -i bot docs.indymedia.org-access.log | grep '?' | wc -l
1034
on 2005-04-03, docs.indymedia.org-access.log.1
% wc -l docs.indymedia.org-access.log.1
128913
% grep -i bot docs.indymedia.org-access.log.1 | wc -l
22503
% grep -i bot docs.indymedia.org-access.log.1 | grep '?' | wc -l
4391
What's been tried
Actually, these things allowed us to improve this situation. Maybe we can even consider the issue solved. --
DjRom
Disallow: /*?* to robots.txt
I think it will work for Google, but the robot.txt validator says wildcards are non-standard (e.g. other spiders might ignore it).
Let's see. Activated on 2005-02-17.
Disallow skin= to unauthentificated users
see
ImcDocsLog#Performance_optimizations
to top