Comments on this design document should be addressed to Active2-dev@ender.indymedia.org


-*- mode: outline -*-

Active2 Design [DRAFT0]
Mike Warren
July 20, 2002

* OVERVIEW

The overall system comprises three parts: the back-end, the middle
and the front-end. Very briefly, the back-end stores data and makes it
available to the middle which is responsible for filtering the
data into an internal representation. This representation is
understood by the front-end, which produces output for the world to
see (i.e. HTML, PDF, RSS, PostScript, Palm Pilot formats, etc.)

The backend will support P2P methods of information-sharing. The
middle and front will be written in Python. At least the
interface to the backend will probably also be Python, although this
hopefully won't matter.

There is a lexicon on the Wiki which explains terms we've discussed
already; such words are capitalized.


* FILE KEYS

There was lots of discussion of filekeys. The only important bit is
the hash, which is [almost certainly] all the back-end will use to
store/retrieve things (the rest is just hints). Filekeys look like:

IMC:mimetype:hash

where IMC and mimetype are optional. The order doesn't matter, EXCEPT
that the hash MUST come last. (Just to ease parsing). For example, all of
these are equivalent filekeys:

utopia.indymedia.org:text/plain:AFF456123...FF109
text/plain:AFF456123...FF109
utopia.indymedia.org:AFF456123...FF109
text/plain:utopia.indymedia.org:AFF456123...FF109
AFF456123...FF109

* BACK END

There are two types of data the backend must store: Metadata and
Content. The major difference is that Metadata is mutable while
Content is not.

Content consists of things like text, pictures, movies or sound. The
text can be in any format which the middle-end can parse (first target
will probably be some kind of plain-ish text and HTML). This could easily
be expanded to things like PDF, XML, etcetera.

Really, there are two types of Metadata as well: those that change and
those that ``don't''. The metadata which is associated with Content
will not change (well, it can, but see below). I will call immutable
Metadata ``Information'' instead, to be less confusing.

** file-based (at least by appearances)

Everything will appear to be a file to things talking to the
backend. Everything will all be accessed via Filekeys (discussed
elsewhere). Metadata (mutable and non) will be in XML after being
retrieved (although it doesn't have to be *stored* in XML). Some of
this will be revealed as NewsML (i.e. Articles).

** mutable metadata (Metadata)

Mutable metadata will be stored in RDF assertions linking two Filekeys
with a verb. So specify author-information for an Image, for example,
one would say ``<filekey of Image> isAuthoredBy <filekey of author
information>'' (in RDF). All such RDF assertions will be
one-per-``file'' and each such assertion will have a Filekey. (So
therefore one may make RDF-assertions about RDF-assertions, for
example). This would allow people to, for example, comment on the
relationships between data or to reference *links* explicitly (rather
than just the end-points of links).

** immutable metadata (Information)

So, what did I mean when I said metadata-about-Content was sort-of
editable? Well, let's say you submitted some photographs and later on
decide that the titles for them should change. Since the
Image-information is immutable, you can't just directly edit
that. Instead, you submit new Image-information (with your new title)
and instead change the Metadata linking that Image-information to your 
actual Image.

What this means is that the IMC releases a new signed RDF assertion
saying ``<filekey of new Image-information> isMetadataFor <filekey of
Image>''.

** versioning

Versioning of Content, Information and Metadata will be date-based;
newer versions are newer data.

** Conclusion

To the outside world, the backend looks like a filesystem which stores
Content, Information (about Content) and Metadata. Everything is
accessed via a Filekey. Information and Metadata are XML; Content is
whatever it is (i.e. text, PDF, .jpg, .png, etc.)

** API

There should be a Python wrapper which knows how to talk to various
backend (and/or the backends should expose a common API). Luckily,
this is easy:

***  get( filekey ) => (mimetype,content) or None

given a filekey, either returns a tuple consisting of the mimetype and
the actual content, or the None object.

*** put( content ) => [nothing]

The hash is computed and the content inserted into the
backend. Exceptions should happen on errors.

*** get_metadata( filekeya, verb, filekeyb ) => list of three-tuples, 
each consisting of (filekey,verb,filekey)

This searches the RDF assertions for the verb specified. If either
filekey is None, then it is a wildcard.  Same for the verb. (All three
can't be None). (This is similar to the Mozilla RDF engine, if anyone
is familiar with that.) So, let's say we have the following RDF
assertions:

XX authorOf YY
ZZ informationFor YY 
XX authorOf A
XX authorOf B
AA inforationFor A
BB inforationFor B

So if one called ``get_metadata( None, "informationFor", None )'', the 
the following would be returned:

[(ZZ informationFor YY), (BB informationFor B)]

``get_metadata( XX, None, None )'' would return:

[(XX authorOf YY), (XX authorOf AA), (XX authorOf B)]


** caching

Obviously, searching some arbitrary P2P system for RDF assertions
would probably take a long time (relative to how long Web users are
going to wait before getting antsy). However, each IMC can maintain a
database cache of the metadata they've used (or have a good chance of
using). Since it doesn't change (only gets upgraded with new
metadata), this is easy.

Updates to the local cache can happen for anything which is changed
via a particular IMC. If an IMC gets a request for a story it doesn't
have in its cache, it can ask the other IMCs about it, and/or do a
search of the backend. The latter might take a little time, so a
``please wait a few minutes; re-constituting foreign story'' type page
could appear for a while.

Upshot: any story in the entire IMC network could be read from any IMC
if one knows the Filekey for the Article (well, any node running Active
2, anyway). This means, for example, that if there's something going
down in a particular place and that IMC's Web server goes down, people
can still read all the stories which have been published by visiting
other IMCs and asking for the Filekeys of the Article they
want. (e.g. http://utopia.indymedia.org/article/<filekey>).


* MIDDLE

The middle will talk to the back-end to fetch the data needed by
the front-end. Basically, it's just a filter which converts back-end
data into the internal representation for the front-end. This is
probably the place where the cache would be used extensively, so it
might make sense to implement that as part of the middle. This
would help keep the back-end nice and clean.

It will also have nice hooks (i.e. whatever's needed) to assist the
front-end in producing output.

** internal representation

As per the example code I produced a while back, the internal
representation will be a stream of Python objects. These will be
equivalent to XML-type markup, but without the hassle. In my example,
I had the following classes: Document, Paragraph, and Title (a
Document contained Paragraphs and Titles).

Obviously, there need to be more. I would propose (at least):
Document, Paragraph, Text, Title, Media (text, image, movie, sound),
Link (some URL).

(Document contains Paragraphs and Titles; Paragraph contains Text, and 
Media).

Text could obviously have different hints (bold, large, underlined,
etc.), as could the others as needed. (For example, Document could be
a comment or an article.)

Media would likely have subclasses for the different types (Sound,
etc.) which would contain the metadata for that particular type of
Media.

Other ideas: Author, ***

Ideally, this will be as minimal as possible; the translation from
this to {PDF, HTML, etc.} will then be simplified in the front-end.


** aliases

I think it makes sense to implement aliases in the
middle. Aliases are a nice way to refer to filekeys without having
to remember a giant string of hexadecimal digits. Aliases will be
IMC-local, and will be a one-to-one mapping to Article Filekeys. For
example the alias ``2002/06/14/kidnapping'' might refer to a
particular filekey.

These could be automatically assigned or manually assigned. Articles
should always reveal their canonical link-location which will be
``http://imc/article/<filekey>''

ALTERNATIVE: aliases could *all* be automagically generated from some
combination of the date of an Article and its title. Then, one
wouldn't need to worry about storing them all the time; if they were
lost, they could be re-constituted as-needed.

ALTERNATIVE: aliases could be semi-automated: authors could enter an
optional single-string ``keyword'' for their Article. If
``YYYY/MM/DD/keyword'' doesn't already exist as an alias, then that
becomes an alias to their Article. No keyword means no alias (or an
automatic alias).

SUGGESTION: to implement the ``canonical'' linking bit, I would
suggest that if someone types in a URL using an alias, the Active2 web
server should re-direct them to the canonical address and present the
aliased-address as a cut-n-paste option. In this way, people who are
viewing an article can ``add bookmark'' in the usual manner and
bookmark the canonical link, while people who want to send a link to
their friend can cut and paste the ``nice'' link instead. (The other
option is the inverse, but that would lead to most people bookmarking
the alias instead of the canonical link...)



* FRONT-END

The front end will request things from the middle, which gives
everything to the front-end in the known internal format. The
front-end will consist of two main areas: Python support classes for
the Cheetah templates and the Cheetah templates themselves.

There should be a set of ``base'' Cheetah templates which will be
subclassed into themes/different interfaces.

Since Cheetah templates support multiple-inheritance (just like
Python, not surprisingly) we should be able to hide most of the
messiness from template designers.

Ideally, for example, the base Article templates will have methods
called for each Paragraph (or whatever) which themes can override to
get different appearances.

The layout should look something like this (for the Article
templates):


TemplateBase -> ArticleBase -> RSSArticle  -> RSS1.1Article
UtilityBase  /              -> PDFArticle  -> ClassicPDFArticle 
                  -> UtopiaPDFArticle
                  -> FunNewStylePDFArticle

                            -> HTMLArticle -> ClassicHTMLArticle
                  -> UtopiaHTMLArticle
                  -> FunNewStyleHTMLArticle

There will also need to be templates for Newswires (in the new-sense),
for example, which are collections of Article summaries, plus probably
a bunch of user information and so on.

The namespaces for themes should be worked out so that one may
subclass a theme and replace only those templates one is unhappy
with. That is, if you really like the Classic theme, but don't like
the font used for titles by the PDFArticle template, you should be
able to make a subclass of the Classic theme, and only subclass
PDFArticle and only replace the method which produces titles.

* OVERALL CLASS LAYOUT

Here is what I think should be the overall class layout for the Python
packages. This is not 100% complete. Things under themes/ are just for
the ``supplied'' themes; if individual IMCs want different local
themes, there should be a well-defined directory for those.

This doesn't include any of the web-server stuff which will actually
use the stuff in frontend/ below. WebWare is a strong contender right
now, and subclasses of its various bits should go under frontend/
somewhere; they're just not there yet.

active2/

       backend/
         Interface
         freenet/
           FreenetInterface
         circle/
          CircleInterface
         filesystem/
         FileSystemInterface

       document/
          container/
         Document
         Paragraph
          media/
          Media
          Sound
          Image
          Video
          Audio
          text/
         Text
         Link
         Title  
          author/
           Author

 frontend/
   aliasmanager/
          AliasManager
          Alias
   templates/
       base/
           ArticleBase
           NewswireBase
       utility/
         BackendRetrieveMixIn
         BackendInsertMixIn
       article/
         html/
             HTMLArticle
         pdf/
            PDFArticle
         rss/
            RSS10Article
            RSS11Article
    themes/
          classic/
            HTMLArticle
            PDFArticle
            RSSArticle
          utopia/
           HTMLArticle
           PDFArticle
           RSSArticle

 util/
     filekey/
            FileKey
  




Topic revision: r5 - 09 Sep 2002, MikeWarren
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback