:mod:`waeup.utils.batching` -- Batch processing
***********************************************

Batch processing is much more than pure data import.

:test-layer: functional

Overview
========

Basically, it means processing CSV files in order to mass-create,
mass-remove, or mass-update data.

So you can feed CSV files to importers or processors, that are part of
the batch-processing mechanism.

Importers/Processors
--------------------

Each CSV file processor

* accepts a single data type identified by an interface.

* knows about the places inside a site (University) where to store,
  remove or update the data.

* can check headers before processing data.

* supports the mode 'create', 'update', 'remove'.

* creates logs and failed-data csv files.

Output
------

The results of processing are written to logfiles. Beside this a new
CSV file is created during processing, containing only those data
sets, that could not be processed.

This new CSV file is called like the input file, appended by mode and
'.pending'. So, when the input file is named 'foo.csv' and something
went wrong during processing, then a file 'foo.csv.create.pending'
will be generated (if the operation mode was 'create'). The .pending
file is a CSV file that contains the failed rows appended by a column
``--ERRROR--`` in which the reasons for processing failures are
listed.

It looks like this::
 
     -----+      +---------+
    /     |      |         |              +------+
   | .csv +----->|Batch-   |              |      |
   |      |      |processor+----changes-->| ZODB |
   |  +------+   |         |              |      |
   +--|      |   |         +              +------+
      | Mode +-->|         |                 -------+
      |      |   |         +----outputs-+-> /       |
      |      |   +---------+            |  |.pending|
      +------+   ^                      |  |        |
                 |                      |  +--------+
           +-----++                     v
           |Inter-|                  -----+
           |face  |                 /     |
           +------+                | .msg |
                                   |      |
                                   +------+


Creating a batch processor
==========================

We create an own batch processor for an own datatype. This datatype
must be based on an interface that the batcher can use for converting
data.

Founding Stoneville
-------------------

We start with the interface:

    >>> from zope.interface import Interface
    >>> from zope import schema
    >>> class ICave(Interface):
    ...   """A cave."""
    ...   name = schema.TextLine(
    ...     title = u'Cave name',
    ...     default = u'Unnamed',
    ...     required = True)
    ...   dinoports = schema.Int(
    ...     title = u'Number of DinoPorts (tm)',
    ...     required = False,
    ...     default = 1)
    ...   owner = schema.TextLine(
    ...     title = u'Owner name',
    ...     required = True,
    ...     missing_value = 'Fred Estates Inc.')
    ...   taxpayer = schema.Bool(
    ...     title = u'Payes taxes',
    ...     required = True,
    ...     default = False)

Now a class that implements this interface:

    >>> import grok
    >>> class Cave(object):
    ...   grok.implements(ICave)
    ...   def __init__(self, name=u'Unnamed', dinoports=2,
    ...                owner='Fred Estates Inc.', taxpayer=False):
    ...     self.name = name
    ...     self.dinoports = 2
    ...     self.owner = owner
    ...     self.taxpayer = taxpayer

We also provide a factory for caves. Strictly speaking, this not
necessary but makes the batch processor we create afterwards, better
understandable.

    >>> from zope.component import getGlobalSiteManager
    >>> from zope.component.factory import Factory
    >>> from zope.component.interfaces import IFactory
    >>> gsm = getGlobalSiteManager()
    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')

Now we can create caves using a factory:

    >>> from zope.component import createObject
    >>> createObject('Lovely Cave')
    <Cave object at 0x...>

This is nice, but we still lack a place, where we can place all the
lovely caves we want to sell.

Furthermore, as a replacement for a real site, we define a place where
all caves can be stored: Stoneville! This is a lovely place for
upperclass cavemen (which are the only ones that can afford more than
one dinoport).

We found Stoneville:

    >>> stoneville = dict()

Everything in place.

Now, to improve local health conditions, imagine we want to populate
Stoneville with lots of new happy dino-hunting natives that slept on
the bare ground in former times and had no idea of
bathrooms. Disgusting, isn't it?

Lots of cavemen need lots of caves.

Of course we can do something like:

    >>> cave1 = createObject('Lovely Cave')
    >>> cave1.name = "Fred's home"
    >>> cave1.owner = "Fred"
    >>> stoneville[cave1.name] = cave1

and Stoneville has exactly

    >>> len(stoneville)
    1

inhabitant. But we don't want to do this for hundreds or thousands of
citizens-to-be, do we?

It is much easier to create a simple CSV list, where we put in all the
data and let a batch processor do the job.

The list is already here:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner,taxpayer
    ... Barneys Home,2,Barney,1
    ... Wilmas Asylum,1,Wilma,1
    ... Freds Dinoburgers,10,Fred,0
    ... Joeys Drive-in,110,Joey,0
    ... """)

All we need, is a batch processor now.

    >>> from waeup.utils.batching import BatchProcessor
    >>> class CaveProcessor(BatchProcessor):
    ...   util_name = 'caveprocessor'
    ...   grok.name(util_name)
    ...   name = 'Cave Processor'
    ...   iface = ICave
    ...   location_fields = ['name']
    ...   factory_name = 'Lovely Cave'
    ...
    ...   def parentsExist(self, row, site):
    ...     return True
    ...
    ...   def getParent(self, row, site):
    ...     return stoneville
    ...
    ...   def entryExists(self, row, site):
    ...     return row['name'] in stoneville.keys()
    ...
    ...   def getEntry(self, row, site):
    ...     if not self.entryExists(row, site):
    ...       return None
    ...     return stoneville[row['name']]
    ...
    ...   def delEntry(self, row, site):
    ...     del stoneville[row['name']]
    ...
    ...   def addEntry(self, obj, row, site):
    ...     stoneville[row['name']] = obj
    ...
    ...   def updateEntry(self, obj, row, site):
    ...     for key, value in row.items():
    ...       setattr(obj, key, value)

Create the fellows:

    >>> processor = CaveProcessor()
    >>> processor.doImport('newcomers.csv', 
    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
    ...                    mode='create', user='Bob')
    (4, 0)

The result means: four entries were processed and no warnings
occured. Let's check:

    >>> sorted(stoneville.keys())
    [u'Barneys Home', ..., u'Wilmas Asylum']

The values of the Cave instances have correct type:

    >>> barney = stoneville['Barneys Home']
    >>> barney.dinoports
    2

which is a number, not a string.

Apparently, when calling the processor, we gave some more info than
only the CSV filepath. What does it all mean?

While the first argument is the path to the CSV file, we also have to
give an ordered list of headernames. These replace the header field
names that are actually in the file. This way we can override faulty
headers.

The ``mode`` paramter tells what kind of operation we want to perform:
``create``, ``update``, or ``remove`` data.

The ``user`` parameter finally is optional and only used for logging.

We can, by the way, see the results of our run in a logfile which is
named ``newcomers.csv.create.msg``:

    >>> print open('newcomers.csv.create.msg').read()
    Source: newcomers.csv
    Mode: create
    Date: ...
    User: Bob
    Failed datasets: newcomers.csv.create.pending
    Processing time: ... s (... s/item)
    Processed: 4 lines (4 successful/ 0 failed)
    <BLANKLINE>

As we can see, the processing was successful. Otherwise, all problems
could be read here as we can see, if we do the same operation again:

    >>> processor.doImport('newcomers.csv',
    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
    ...                    mode='create', user='Bob')
    (4, 4)

The log file will tell us this in more detail:

    >>> print open('newcomers.csv.create.msg').read()
    Source: newcomers.csv
    Mode: create
    Date: ...
    User: Bob
    Failed datasets: newcomers.csv.create.pending
    Processing time: ... s (... s/item)
    Processed: 4 lines (0 successful/ 4 failed)

This time a new file was created, which keeps all the rows we could not
process and an additional column with error messages:

    >>> print open('newcomers.csv.create.pending').read()
    owner,name,taxpayer,dinoports,--ERRORS--
    Barney,Barneys Home,1,2,This object already exists. Skipping.
    Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
    Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
    Joey,Joeys Drive-in,0,110,This object already exists. Skipping.

This way we can correct the faulty entries and afterwards retry without
having the already processed rows in the way.

We also notice, that the values of the taxpayer column are returned as
in the input file. There we wrote '1' for ``True`` and '0' for
``False`` (which is accepted by the converters).


Updating entries
----------------

To update entries, we just call the batchprocessor in a different
mode:

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    (4, 0)

Now we want to tell, that Wilma got an extra port for her second dino:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... Wilmas Asylum,2,Wilma
    ... """)

    >>> wilma = stoneville['Wilmas Asylum']
    >>> wilma.dinoports
    1

We start the processor:

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    (1, 0)

    >>> wilma = stoneville['Wilmas Asylum']
    >>> wilma.dinoports
    2

Wilma's number of dinoports raised.

If we try to update an unexisting entry, an error occurs:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... NOT-WILMAS-ASYLUM,2,Wilma
    ... """)

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    (1, 1)
    
Also invalid values will be spotted:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
    ... """)

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    (1, 1)

We can also update only some cols, leaving some out. We skip the
'dinoports' column in the next run:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,owner
    ... Wilmas Asylum,Barney
    ... """)

    >>> processor.doImport('newcomers.csv', ['name', 'owner'],
    ...                    mode='update', user='Bob')
    (1, 0)

    >>> wilma.owner
    u'Barney'

We can however, not leave out the 'location field' ('name' in our
case), as this one tells us which entry to update:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... 2,Wilma
    ... """)

    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    Traceback (most recent call last):
    ...
    FatalCSVError: Need at least columns 'name' for import!

This time we get even an exception!

We can tell to set dinoports to ``None`` although this is not a
number, as we declared the field not required in the interface:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... "Wilmas Asylum",,"Wilma"
    ... """)

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    (1, 0)

    >>> wilma.dinoports is None
    True

Generally, empty strings are considered as ``None``:

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... "Wilmas Asylum","","Wilma"
    ... """)

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='update', user='Bob')
    (1, 0)

    >>> wilma.dinoports is None
    True

Removing entries
----------------

In 'remove' mode we can delete entries. Here validity of values in
non-location fields doesn't matter because those fields are ignored.

    >>> open('newcomers.csv', 'wb').write(
    ... """name,dinoports,owner
    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
    ... """)

    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
    ...                    mode='remove', user='Bob')
    (1, 0)

    >>> sorted(stoneville.keys())
    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']

Oops! Wilma is gone.


Clean up:

    >>> import os
    >>> os.unlink('newcomers.csv')
    >>> os.unlink('newcomers.csv.create.pending')
    >>> os.unlink('newcomers.csv.create.msg')
    >>> os.unlink('newcomers.csv.remove.msg')
    >>> os.unlink('newcomers.csv.update.msg')