IMC Archives working group

This is a working group that work on collecting and storage data from IMC's web sites (newswire and features), from IMC's internal organizing tools (docs, lists etc). In an attempt to save this data for the future. Below there are a few information about the ongoing work of this group.

Table of content :

Historical - email from micah about images.indymedia.org

images.indymedia.org is something from long long long long long ago that still exists. There is significant IMC history on this machine, and people need to know about this. Its very important and I am afraid it will be lost.

First and foremost, it is not backed up. When Troy wrote in January of this year[1] with usage statistics (its over 200gb of data) he said two things: 1. capacity will not be growing, 2. the data is not backed up. At this time, I was advocating people push for creating a large IMC backup archive to preserve this data and people are going to have to find a solution to this problem when the disk fills up. Toya created this page[2] on the wiki about this project.

Ages ago when indymedia first started we had a sympathetic friend who worked at Loudeye, Troy Davis (troy@nack.net). For the WTO protests he setup a server on incredible bandwidth to serve rich-media content (movies, audio, etc.) and when people uploaded this content to IMC sites, this media was sent to this server and the URLs re-written in Active and then the files removed from the IMC server. We didn't have the disk space or bandwidth to serve these files, and so everything was sent to centerstage.loudeye.com (this was where images.indymedia.org was pointing to).

Pretty much every indymedia site used this service. In fact its listed in the Active Install notes in section 3.5.3[3]. Originally there was one account ('imc') and it was over FTP and I still remember the password. When all the IMC sites (with a few exceptions) were all located on stallman there was a cronjob that handled this for everyone:

*/10 * * * * root /usr/local/sbin/pushtoeye.pl > /dev/null 2>&1

We encouraged www-features to use this service, instead of linking to images remotely[4].

After a while Troy left Loudeye, but he took the machine with all the great historical content and hooked it up at home on just as fast bandwidth and has kept it available for everyone to use. Its been Troy about this server was in June about illegal content from some sites that needed to be deleted[5]

There were a lot of programs written to manage this content, pushtoeye was the one used for the longest time. Toni wrote 'multiplacer' at one point, announced in 2000[6].

In January 2003 there was an IMC-tech meeting[7] where JB talked about cleaning up Loudeye content and removing duplicates because of space problems. He had determined that at that time there was 227,479 unique rich-media files on loudeye from all the various IMC sites. There was about 500,000 files in total, but many duplicates. This was in 2003 and this content has not been backed up for years, and many IMC sites still use this service (take for example chiapas[8]. In 2001 Troy posted some statistics[9] about usage, over 100k hits a day with peak usage of more than 25 hits per second.

  1. http://lists.indymedia.org/pipermail/imc-tech/2007-January/0105-ue.html
  2. http://docs.indymedia.org/view/Global/ImcArchives
  3. http://www.active.org.au/doc/active/install.html
  4. http://archives.lists.indymedia.org/www-features/2001-March/000255.html
  5. http://lists.indymedia.org/pipermail/imc-tech/2007-June/0604-6j.html
  6. http://archives.lists.indymedia.org/imc-summaries/2002-May/000038.html
  7. http://docs.indymedia.org/view/Sysadmin/MeetingLog3Jan2003
  8. http://images.indymedia.org/imc/chiapas/
  9. http://archives.lists.indymedia.org/imc-tech/2001-July/004691.html - known as stream.paranode.com for a long time. The most recent email from

Proposal to IMC-Tech and IMC-Finance

Dear IMC-Tech and IMC-Finance,

we as IMC volunteers from all around the world would like to proposal that the IMC Global Network has a storage server, so we can make sure our data is storaged and saved for the future generations. The storage server is just the first step for this work, we need to localize and collect this data, some are offline, some we will have to search around to see who has it and so on.

A working group will be called to be created, so this work can be done and anyone can add themselves to the project. We buided a page at the wiki: to help coordinate this work. A call will be written on that page and translations can be added, the idea is to send it around to start collect information about who we need to contact to get the data, to get more people involved etc. Also at the wiki you can see a list of the sites which we will try to collect data from, if you remember something that is missing on this list please added it on the wiki.

This is a very important work to be done and talking to each others we realized that the network does has resources to be doing it, the problem is lack of people aware of this need and also who are able to help out on it. And about resources, this is why we are writing to you folks. We would like to proposal that IMC-Tech pay for the storage server and IMC-Finance assume the responsability for paying the connection for the server (if it is need ).

Costs:

  • storage server - $$
  • hosting - $$

Server configuration information:

Local where we will host information:

Technical considerations (to be debated/corrected/refuted)

  • It would be ideal to use the contact db as the source of information for which sites to archive. This could include sites marked both "inactive" and "live." For inactive sites, once a static copy had been made, there's no need to deal with the original site. For live sites, it will be necessary to continue to crawl the site for updates.

  • The contact db also gives us a canonical hostname, which means we could have a predictable location for the archive site. If the site was at host.domain.tld, it should be found in the archives at http://archive.indymedia.org/host.domain.tld

  • If we wanted to preserve links, we could also work with the dns group to point inactive domain names at a vhost on archive.indymedia.org pointing to the most current backup, so that, at least for *.indymedia.org domains, old links would still work.

  • To prevent unecessary bandwidth usage of the archive server, it might be desirable to keep the archives of "live" sites private until that status changes.

  • One problem is that, unattended, if we only keep one copy of a website, we run the risk of over writing our archive copy with error messages or 404 pages if the tech infrastructure behind the site goes away before the formal process decsision is made. Since it's unlikely we can in any automated fashion detect whether or not we are archiving the actual page, rather than some unrelated error message (think of domain name expiry and parking pages, for example), it's imperative to keep multiple backups for each site, say on a weekly basis.

  • This makes the issue of diskspace even more of a problem than it is already - one technique might be to not store duplicates of identical files, which is probably something that can be figured out (hopefully) from http headers in many cases, especially for media files.

  • Potential software - there is good software for making simple mirrors, like wget and httrack, but neither of these solves the multiple copy issue, although the latter seems extensible in a fairly straightforward manner to deal with saving multiple copies (see the callback documentation). Archive.org also maintains a really powerful crawler called Heretrix, which is worth looking into, but might be overkill. Interestingly, the wayback machine, with integrated search, is also available as open source software, and might solve many of the UI issues easily, at the cost of using the more complicated crawler. Another pontential candidate for a more sophisticated "replay" interface besides straight rewritten html is wera

  • Hidden articles are also potentially a huge problem - we need to think carefully about how our crawling and archiving strategy relates to the autonomy of local collectives to moderate and remove their content. If an article is hidden on a local imc, how do we know to remove our copies as well? There definitely needs to be a good procedure in place for fielding requests for deletion of material from the archive.

Call for help

List of places to collect data from:

-- ToyaMileno - 25 Oct 2007 - added micah's email which explains why this proposal was created

-- JohnDudaAccount - 16 Jun 2007 (added "Technical Considerations")

-- ToyaMileno - 11 May 2007
Topic revision: r6 - 25 Oct 2007, ToyaMileno
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback