WAeUP Data Center
*****************

The WAeUP data center cares for managing CSV files and importing then.

:Test-Layer: unit

Creating a data center
======================

A data center can be created easily:

    >>> from waeup.sirp.datacenter import DataCenter
    >>> mydatacenter = DataCenter()
    >>> mydatacenter
    <waeup.sirp.datacenter.DataCenter object at 0x...>

Each data center has a location in file system where files are stored:

    >>> storagepath = mydatacenter.storage
    >>> storagepath
    '/.../waeup/sirp/files'


Managing the storage path
-------------------------

We can set another storage path:

    >>> import os
    >>> os.mkdir('newlocation')
    >>> newpath = os.path.abspath('newlocation')
    >>> mydatacenter.setStoragePath(newpath)
    []

The result here is a list of filenames, that could not be
copied. Luckily, this list is empty.

When we set a new storage path, we can tell to move all files in the
old location to the new one. To see this feature in action, we first
have to put a file into the old location:

    >>> open(os.path.join(newpath, 'myfile.txt'), 'wb').write('hello')

Now we can set a new location and the file will be copied:

    >>> verynewpath = os.path.abspath('verynewlocation')
    >>> os.mkdir(verynewpath)

    >>> mydatacenter.setStoragePath(verynewpath, move=True)
    []

    >>> storagepath = mydatacenter.storage
    >>> 'myfile.txt' in os.listdir(verynewpath)
    True

We remove the created file to have a clean testing environment for
upcoming examples:

    >>> os.unlink(os.path.join(storagepath, 'myfile.txt'))

Uploading files
===============

We can get a list of files stored in that location:

    >>> mydatacenter.getFiles()
    []

Let's put some file in the storage:

    >>> import os
    >>> filepath = os.path.join(storagepath, 'data.csv')
    >>> open(filepath, 'wb').write('Some Content\n')

Now we can find a file:

    >>> mydatacenter.getFiles()
    [<waeup.sirp.datacenter.DataCenterFile object at 0x...>]

As we can see, the actual file is wrapped by a convenience wrapper,
that enables us to fetch some data about the file. The data returned
is formatted in strings, so that it can easily be put into output
pages:

    >>> datafile = mydatacenter.getFiles()[0]
    >>> datafile.getSize()
    '13 bytes'

    >>> datafile.getDate() # Nearly current datetime...
    '...'

Clean up:

    >>> import shutil
    >>> shutil.rmtree(newpath)
    >>> shutil.rmtree(verynewpath)


Distributing processed files
============================

When files were processed by a batch processor, we can put the
resulting files into desired destinations.

We recreate the datacenter root in case it is missing:

    >>> import os
    >>> dc_root = mydatacenter.storage
    >>> fin_dir = os.path.join(dc_root, 'finished')
    >>> unfin_dir = os.path.join(dc_root, 'unfinished')

    >>> def recreate_dc_storage():
    ...   if os.path.exists(dc_root):
    ...     shutil.rmtree(dc_root)
    ...   os.mkdir(dc_root)
    ...   mydatacenter.setStoragePath(mydatacenter.storage)
    >>> recreate_dc_storage()

We define a function that creates a set of faked result files:

    >>> import os
    >>> import tempfile
    >>> def create_fake_results(source_basename, create_pending=True):
    ...   tmp_dir = tempfile.mkdtemp()
    ...   src = os.path.join(dc_root, source_basename)
    ...   pending_src = None
    ...   if create_pending:
    ...     pending_src = os.path.join(tmp_dir, 'mypendingsource.csv')
    ...   finished_src = os.path.join(tmp_dir, 'myfinishedsource.csv')
    ...   for path in (src, pending_src, finished_src):
    ...     if path is not None:
    ...       open(path, 'wb').write('blah')
    ...   return tmp_dir, src, finished_src, pending_src

Now we can create the set of result files, that typically come after a
successful processing of a regular source:

Now we can try to distribute those files. Let's start with a source
file, that was processed successfully:

    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.csv', create_pending=False)
    >>> mydatacenter.distProcessedFiles(True, src, finished_src,
    ...                            pending_src)
    >>> sorted(os.listdir(dc_root))
    ['finished', 'logs', 'unfinished']

    >>> sorted(os.listdir(fin_dir))
    ['mysource.csv', 'mysource.finished.csv']

    >>> sorted(os.listdir(unfin_dir))
    []

The created dir will be removed for us by the datacenter. This way we
can assured, that less temporary dirs are left hanging around:

    >>> os.path.exists(tmp_dir)
    False

The root dir is empty, while the original file and the file containing
all processed data were moved to'finished/'.

Now we restart, but this time we fake an erranous action:

    >>> recreate_dc_storage()
    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.csv')
    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
    ...                                 pending_src)
    >>> sorted(os.listdir(dc_root))
    ['finished', 'logs', 'mysource.pending.csv', 'unfinished']

    >>> sorted(os.listdir(fin_dir))
    ['mysource.finished.csv']

    >>> sorted(os.listdir(unfin_dir))
    ['mysource.csv']

While the original source was moved to the 'unfinished' dir, the
pending file went to the root and the set of already processed items
are stored in finished/.

We fake processing the pending file and assume that everything went
well this time:

    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.pending.csv', create_pending=False)
    >>> mydatacenter.distProcessedFiles(True, src, finished_src,
    ...                                 pending_src)

    >>> sorted(os.listdir(dc_root))
    ['finished', 'logs', 'unfinished']

    >>> sorted(os.listdir(fin_dir))
    ['mysource.csv', 'mysource.finished.csv']

    >>> sorted(os.listdir(unfin_dir))
    []

The result is the same as in the first case shown above.

We restart again, but this time we fake several non-working imports in
a row.

We start with a faulty start-import:

    >>> recreate_dc_storage()
    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.csv')
    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
    ...                                 pending_src)

We try to process the pending file, which fails again:

    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.pending.csv')
    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
    ...                                 pending_src)

We try to process the new pending file:

    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.pending.csv')
    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
    ...                                 pending_src)

    >>> sorted(os.listdir(dc_root))
    ['finished', 'logs', 'mysource.pending.csv', 'unfinished']

    >>> sorted(os.listdir(fin_dir))
    ['mysource.finished.csv']

    >>> sorted(os.listdir(unfin_dir))
    ['mysource.csv']

Finally, we process the pending file and everything works:

    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
    ...  'mysource.pending.csv', create_pending=False)
    >>> mydatacenter.distProcessedFiles(True, src, finished_src,
    ...                                 pending_src)

    >>> sorted(os.listdir(dc_root))
    ['finished', 'logs', 'unfinished']

    >>> sorted(os.listdir(fin_dir))
    ['mysource.csv', 'mysource.finished.csv']

    >>> sorted(os.listdir(unfin_dir))
    []

The root dir is empty (contains no input files) and only the files in
finished-subdirectory remain.

Clean up:

    >>> shutil.rmtree(verynewpath)

Handling imports
================

Data centers can find objects ready for CSV imports and associate
appropriate importers with them.

Getting importers
-----------------

To do so, data centers look up their parents for the nearest ancestor,
that implements `ICSVDataReceivers` and grab all attributes, that
provide some importer.

We therefore have to setup a proper scenario first.

We start by creating a simple thing that is ready for receiving CSV
data:

    >>> class MyCSVReceiver(object):
    ...   pass

Then we create a container for such a CSV receiver:

    >>> import grok
    >>> from waeup.sirp.interfaces import ICSVDataReceivers
    >>> from waeup.sirp.datacenter import DataCenter
    >>> class SomeContainer(grok.Container):
    ...   grok.implements(ICSVDataReceivers)
    ...   def __init__(self):
    ...     self.some_receiver = MyCSVReceiver()
    ...     self.other_receiver = MyCSVReceiver()
    ...     self.datacenter = DataCenter()

By implementing `ICSVDataReceivers`, a pure marker interface, we
indicate, that we want instances of this class to be searched for CSV
receivers.

This root container has two CSV receivers.

The datacenter is also an attribute of our root container.

Before we can go into action, we also need an importer, that is able
to import data into instances of MyCSVReceiver:

    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
    >>> from waeup.sirp.interfaces import IWAeUPCSVImporter
    >>> from waeup.sirp.utils.importexport import CSVImporter
    >>> class MyCSVImporter(CSVImporter):
    ...   grok.adapts(ICSVFile, MyCSVReceiver)
    ...   grok.provides(IWAeUPCSVImporter)
    ...   datatype = u'My Stuff'
    ...   def doImport(self, filepath, clear_old_data=True,
    ...                                overwrite=True):
    ...     print "Data imported!"

We grok the components to get the importer (which is actually an
adapter) registered with the component architechture:

    >>> grok.testing.grok('waeup')
    >>> grok.testing.grok_component('MyCSVImporter', MyCSVImporter)
    True

Now we can create an instance of `SomeContainer`:

    >>> mycontainer = SomeContainer()

As we are not creating real sites and the objects are 'placeless' from
the ZODB point of view, we fake a location by telling the datacenter,
that its parent is the container:

    >>> mycontainer.datacenter.__parent__ = mycontainer
    >>> datacenter = mycontainer.datacenter

When a datacenter is stored in the ZODB, this step will happen
automatically.

Before we can go on, we have to set a usable path where we can store
files without doing harm:

    >>> os.mkdir('filestore')
    >>> filestore = os.path.abspath('filestore')
    >>> datacenter.setStoragePath(filestore)
    []

Furthermore we must create a file for possible import, as we will get
only importers, for which also an importable file is available:

    >>> import os
    >>> filepath = os.path.join(datacenter.storage, 'mydata.csv')
    >>> open(filepath, 'wb').write("""col1,col2
    ... 'ATerm','Something'
    ... """)

The datacenter is now able to find the CSV receivers in its parents:

    >>> datacenter.getImporters()
    [<MyCSVImporter object at 0x...>, <MyCSVImporter object at 0x...>]


Imports with the WAeUP portal
-----------------------------

The examples above looks complicated, but this is the price for
modularity. If you create a new container type, you can define an
importer and it will be used automatically by other components.

In the WAeUP portal the only component that actually provides CSV data
importables is the `University` object.


Getting imports (not: importers)
--------------------------------

We can get 'imports':

    >>> datacenter.getPossibleImports()
    [(<...DataCenterFile object at 0x...>, 
      [(<MyCSVImporter object at 0x...>, '...'),
       (<MyCSVImporter object at 0x...>, '...')])]

As we can see, an import is defined here as a tuple of a
DataCenterFile and a list of available importers with an associated
data receiver (the thing where the data should go to).

The data receiver is given as an ZODB object id (if the data receiver
is persistent) or a simple id (if it is not).

Clean up:

    >>> import shutil
    >>> shutil.rmtree(filestore)


Data center helpers
===================

Data centers provide several helper methods to make their usage more
convenient.


Receivers and receiver ids
--------------------------

As already mentioned above, imports are defined as triples containing

* a file to import, 

* an importer to do the import and 

* an object, which should be updated by the data file.

The latter normally is some kind of container, like a faculty
container or similar. This is what we call a ``receiver`` as it
receives the data from the file via the importer.

The datacenter finds receivers by looking up its parents for a
component, that implements `ICSVDataReceivers` and scanning that
component for attributes, that can be adapted to `ICSVImporter`.

I.e., once found an `ICSVDataReceiver` parent, the datacenter gets all
importers that can be applied to attributes of this component. For
each attribute there can be at most one importer.

When building the importer list for a certain file, we also check,
that the headers of the file comply with what the respective importers
expect. So, if a file contains broken headers, the file won't be
offered for import at all.

The contexts of the found importers then build our list of available
receivers. This means also, that for each receiver provided by the
datacenter, there is also an importer available.

If for a potential receiver no importer can be found, this receiver
will be skipped.

As one type of importer might be able to serve several receivers, we
also have to provide a unique id for each receiver. This is, where
``receiver ids`` come into play.

Receiver ids of objects are determined as

* the ZODB oid of the object if the object is persistent

* the result of id(obj) otherwise.

The value won this way is a long integer which we turn into a
string. If the value was get from the ZODB oid, we also prepend it
with a ``z`` to avoid any clash with non-ZODB objects (they might
deliver the same id, although this is *very* unlikely).