Skip to topic | Skip to bottom
Home
Sysadmin
Sysadmin.ImcDocsRobotsr1.6 - 07 Nov 2006 - 13:56 - IntRigeritopic end
You are here: Sysadmin > ImcDocs > ImcDocsTodo > ImcDocsRobots

Start of topic | Skip to actions

ImcDocsRobots

Basically, the problem is that in the beginning of the month, all the web crawlers are updating their caches and meta-informations, so that the server get completely mad.

Some doc

Some stats

on 2005-02-17, docs.indymedia.org-access.log (created on 2005-02-13) :

% grep -i bot docs.indymedia.org-access.log | wc -l
44645
% grep -i bot docs.indymedia.org-access.log | grep  '?' | wc -l
19319

on 2005-02-24, docs.indymedia.org-access.log (created on 2005-02-20) :

% wc -l docs.indymedia.org-access.log               
125493
% grep -i bot docs.indymedia.org-access.log | wc -l
6457
% grep -i bot docs.indymedia.org-access.log | grep  '?' | wc -l
1034

on 2005-04-03, docs.indymedia.org-access.log.1

% wc -l docs.indymedia.org-access.log.1               
128913
% grep -i bot docs.indymedia.org-access.log.1 | wc -l
22503
% grep -i bot docs.indymedia.org-access.log.1 | grep  '?' | wc -l
4391

What's been tried

Actually, these things allowed us to improve this situation. Maybe we can even consider the issue solved. -- DjRom

Disallow: /*?* to robots.txt

I think it will work for Google, but the robot.txt validator says wildcards are non-standard (e.g. other spiders might ignore it).

Let's see. Activated on 2005-02-17.

Disallow skin= to unauthentificated users

see ImcDocsLog#Performance_optimizations
to top

You are here: Sysadmin > ImcDocs > ImcDocsTodo > ImcDocsRobots

to top

Copyright © 1999-2008 by the contributing authors.
All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding this tool? Send feedback (in English, Francais, Deutsch or Dutch).