Storage Backend
===============

**WARNING:** since we rebuilt our backend, this documentation is depreacted.
Come here in some days and we'll have an updated one.

PyPLN can make use of various storage formats for both documents and analytical
results. Since PyPLN is built with distributed processing in mind, the
configuration of storage backends should, whenever possible be distributed as
well. That is all databases should be available to workers and sinks as local
resources, but at the same time be part of a distributed infrastructure, being
equally available to all machines in the cluster.

Whenever possible, we will use MongoDB document database to handle storage,
due to simplicity of its deployment and usage on distributed environments.


File Storage
------------

For PyPLN, document storage happens at more than one stage of of the pipeline.
At the beginning, we have the document files in their original formats prior to
the text extraction phase. To avoid resorting to specific implementations of
distributed filesystems, we use instead MongoDB's GridFS which transparently
handles distribution of files across the cluster without the need of extra
configurations.


Document Storage
----------------

For storing the raw text versions of the files and its analysis, PyPLN takes
advantage of the schemaless nature of MongoDB. MongoDB collections are sets of
JSON objects (which are stored internally in a binary format called BSON),
which in our case will be constructed gradually. Initially a document in a
MongoDB collection will have only one these fields::

    {'_id': ObjectId('...'),
     'meta': {
              'name': 'test.pdf',
              'created_on': '2012-05-29',
             },
     'analysis': {},
    }

When this document passes for a pipeline, each node of the pipeline (we call
these nodes "apps") will do some kind of analysis in the document and
optionally store more information. For example:

- ``extractor`` app extracts text from PDF files;
- ``tokenizer`` app extracts tokens from text;
- ``freqdist`` app extracts frequency distribution of tokens in the text;
- ``pos`` app extracts do the part-of-speech tagging;
- and so on...

.. figure:: _static/default-pipeline.png

   The PyPLN's default pipeline

Each of these new data will be available at the dictionary ``analysis``, inside
the MongoDB's document, for example::

    {'_id': ObjectId('...'),
     'meta': {
              'name': 'test.pdf',
              'created_on': '2012-05-29',
              'Author': u'Álvaro Justen',
              'Paper format': 'A4',
             },
     'analysis': {
                  'text': 'This is a sentence, this is a test.',
                  'tokens': ['This', 'is', 'a', 'sentence', ',', 'this', 'is',
                             'a', 'test', '.'],
                  'freqdist': [('a', 2), ('this', 2), ('is', 2),
                               ('sentence', 1), (',', 1), ('.', 1),
                               ('test', 1)],
                  'pos': [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'),
                          ('sentence', 'NN'), (',', ','), ('this', 'DT'),
                          ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN'),
                          ('.', '.')],
     },
    }

In the example above, each app created an entry in ``analysis`` key of
MongoDB's document (``extractor`` created ``text``, ``tokenizer`` created
``tokens``, ``freqdist`` created ``freqdist`` and ``pos`` created ``pos``).
An app can also add file meta-information, for example, extractor can add
``document['meta']['Author']`` and ``document['meta']['Paper format']`` since
it's available in PDF meta-data and can be used for further analysis.

In the interface between the pipeline and the app must be a way to the app
to define which keys will be (optionally) created in ``document['analysis']``
and ``document['meta']``.