:mod:`waeup.utils.batching` -- Batch processing *********************************************** Batch processing is much more than pure data import. :test-layer: functional Overview ======== Basically, it means processing CSV files in order to mass-create, mass-remove, or mass-update data. So you can feed CSV files to importers or processors, that are part of the batch-processing mechanism. Importers/Processors -------------------- Each CSV file processor * accepts a single data type identified by an interface. * knows about the places inside a site (University) where to store, remove or update the data. * can check headers before processing data. * supports the mode 'create', 'update', 'remove'. * creates logs and failed-data csv files. Output ------ The results of processing are written to logfiles. Beside this a new CSV file is created during processing, containing only those data sets, that could not be processed. This new CSV file is called like the input file, appended by mode and '.pending'. So, when the input file is named 'foo.csv' and something went wrong during processing, then a file 'foo.csv.create.pending' will be generated (if the operation mode was 'create'). The .pending file is a CSV file that contains the failed rows appended by a column ``--ERRROR--`` in which the reasons for processing failures are listed. It looks like this:: -----+ +---------+ / | | | +------+ | .csv +----->|Batch- | | | | | |processor+----changes-->| ZODB | | +------+ | | | | +--| | | + +------+ | Mode +-->| | -------+ | | | +----outputs-+-> / | | | +---------+ | |.pending| +------+ ^ | | | | | +--------+ +-----++ v |Inter-| -----+ |face | / | +------+ | .msg | | | +------+ Creating a batch processor ========================== We create an own batch processor for an own datatype. This datatype must be based on an interface that the batcher can use for converting data. Founding Stoneville ------------------- We start with the interface: >>> from zope.interface import Interface >>> from zope import schema >>> class ICave(Interface): ... """A cave.""" ... name = schema.TextLine( ... title = u'Cave name', ... default = u'Unnamed', ... required = True) ... dinoports = schema.Int( ... title = u'Number of DinoPorts (tm)', ... required = False, ... default = 1) ... owner = schema.TextLine( ... title = u'Owner name', ... required = True, ... missing_value = 'Fred Estates Inc.') ... taxpayer = schema.Bool( ... title = u'Payes taxes', ... required = True, ... default = False) Now a class that implements this interface: >>> import grok >>> class Cave(object): ... grok.implements(ICave) ... def __init__(self, name=u'Unnamed', dinoports=2, ... owner='Fred Estates Inc.', taxpayer=False): ... self.name = name ... self.dinoports = 2 ... self.owner = owner ... self.taxpayer = taxpayer We also provide a factory for caves. Strictly speaking, this not necessary but makes the batch processor we create afterwards, better understandable. >>> from zope.component import getGlobalSiteManager >>> from zope.component.factory import Factory >>> from zope.component.interfaces import IFactory >>> gsm = getGlobalSiteManager() >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!') >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave') Now we can create caves using a factory: >>> from zope.component import createObject >>> createObject('Lovely Cave') This is nice, but we still lack a place, where we can place all the lovely caves we want to sell. Furthermore, as a replacement for a real site, we define a place where all caves can be stored: Stoneville! This is a lovely place for upperclass cavemen (which are the only ones that can afford more than one dinoport). We found Stoneville: >>> stoneville = dict() Everything in place. Now, to improve local health conditions, imagine we want to populate Stoneville with lots of new happy dino-hunting natives that slept on the bare ground in former times and had no idea of bathrooms. Disgusting, isn't it? Lots of cavemen need lots of caves. Of course we can do something like: >>> cave1 = createObject('Lovely Cave') >>> cave1.name = "Fred's home" >>> cave1.owner = "Fred" >>> stoneville[cave1.name] = cave1 and Stoneville has exactly >>> len(stoneville) 1 inhabitant. But we don't want to do this for hundreds or thousands of citizens-to-be, do we? It is much easier to create a simple CSV list, where we put in all the data and let a batch processor do the job. The list is already here: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner,taxpayer ... Barneys Home,2,Barney,1 ... Wilmas Asylum,1,Wilma,1 ... Freds Dinoburgers,10,Fred,0 ... Joeys Drive-in,110,Joey,0 ... """) All we need, is a batch processor now. >>> from waeup.utils.batching import BatchProcessor >>> class CaveProcessor(BatchProcessor): ... util_name = 'caveprocessor' ... grok.name(util_name) ... name = 'Cave Processor' ... iface = ICave ... location_fields = ['name'] ... factory_name = 'Lovely Cave' ... ... def parentsExist(self, row, site): ... return True ... ... def getParent(self, row, site): ... return stoneville ... ... def entryExists(self, row, site): ... return row['name'] in stoneville.keys() ... ... def getEntry(self, row, site): ... if not self.entryExists(row, site): ... return None ... return stoneville[row['name']] ... ... def delEntry(self, row, site): ... del stoneville[row['name']] ... ... def addEntry(self, obj, row, site): ... stoneville[row['name']] = obj ... ... def updateEntry(self, obj, row, site): ... for key, value in row.items(): ... setattr(obj, key, value) Create the fellows: >>> processor = CaveProcessor() >>> processor.doImport('newcomers.csv', ... ['name', 'dinoports', 'owner', 'taxpayer'], ... mode='create', user='Bob') (4, 0) The result means: four entries were processed and no warnings occured. Let's check: >>> sorted(stoneville.keys()) [u'Barneys Home', ..., u'Wilmas Asylum'] The values of the Cave instances have correct type: >>> barney = stoneville['Barneys Home'] >>> barney.dinoports 2 which is a number, not a string. Apparently, when calling the processor, we gave some more info than only the CSV filepath. What does it all mean? While the first argument is the path to the CSV file, we also have to give an ordered list of headernames. These replace the header field names that are actually in the file. This way we can override faulty headers. The ``mode`` paramter tells what kind of operation we want to perform: ``create``, ``update``, or ``remove`` data. The ``user`` parameter finally is optional and only used for logging. We can, by the way, see the results of our run in a logfile which is named ``newcomers.csv.create.msg``: >>> print open('newcomers.csv.create.msg').read() Source: newcomers.csv Mode: create Date: ... User: Bob Failed datasets: newcomers.csv.create.pending Processing time: ... s (... s/item) Processed: 4 lines (4 successful/ 0 failed) As we can see, the processing was successful. Otherwise, all problems could be read here as we can see, if we do the same operation again: >>> processor.doImport('newcomers.csv', ... ['name', 'dinoports', 'owner', 'taxpayer'], ... mode='create', user='Bob') (4, 4) The log file will tell us this in more detail: >>> print open('newcomers.csv.create.msg').read() Source: newcomers.csv Mode: create Date: ... User: Bob Failed datasets: newcomers.csv.create.pending Processing time: ... s (... s/item) Processed: 4 lines (0 successful/ 4 failed) This time a new file was created, which keeps all the rows we could not process and an additional column with error messages: >>> print open('newcomers.csv.create.pending').read() owner,name,taxpayer,dinoports,--ERRORS-- Barney,Barneys Home,1,2,This object already exists. Skipping. Wilma,Wilmas Asylum,1,1,This object already exists. Skipping. Fred,Freds Dinoburgers,0,10,This object already exists. Skipping. Joey,Joeys Drive-in,0,110,This object already exists. Skipping. This way we can correct the faulty entries and afterwards retry without having the already processed rows in the way. We also notice, that the values of the taxpayer column are returned as in the input file. There we wrote '1' for ``True`` and '0' for ``False`` (which is accepted by the converters). Updating entries ---------------- To update entries, we just call the batchprocessor in a different mode: >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='update', user='Bob') (4, 0) Now we want to tell, that Wilma got an extra port for her second dino: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... Wilmas Asylum,2,Wilma ... """) >>> wilma = stoneville['Wilmas Asylum'] >>> wilma.dinoports 1 We start the processor: >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='update', user='Bob') (1, 0) >>> wilma = stoneville['Wilmas Asylum'] >>> wilma.dinoports 2 Wilma's number of dinoports raised. If we try to update an unexisting entry, an error occurs: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... NOT-WILMAS-ASYLUM,2,Wilma ... """) >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='update', user='Bob') (1, 1) Also invalid values will be spotted: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... Wilmas Asylum,NOT-A-NUMBER,Wilma ... """) >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='update', user='Bob') (1, 1) We can also update only some cols, leaving some out. We skip the 'dinoports' column in the next run: >>> open('newcomers.csv', 'wb').write( ... """name,owner ... Wilmas Asylum,Barney ... """) >>> processor.doImport('newcomers.csv', ['name', 'owner'], ... mode='update', user='Bob') (1, 0) >>> wilma.owner u'Barney' We can however, not leave out the 'location field' ('name' in our case), as this one tells us which entry to update: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... 2,Wilma ... """) >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'], ... mode='update', user='Bob') Traceback (most recent call last): ... FatalCSVError: Need at least columns 'name' for import! This time we get even an exception! We can tell to set dinoports to ``None`` although this is not a number, as we declared the field not required in the interface: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... "Wilmas Asylum",,"Wilma" ... """) >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='update', user='Bob') (1, 0) >>> wilma.dinoports is None True Generally, empty strings are considered as ``None``: >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... "Wilmas Asylum","","Wilma" ... """) >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='update', user='Bob') (1, 0) >>> wilma.dinoports is None True Removing entries ---------------- In 'remove' mode we can delete entries. Here validity of values in non-location fields doesn't matter because those fields are ignored. >>> open('newcomers.csv', 'wb').write( ... """name,dinoports,owner ... "Wilmas Asylum","ILLEGAL-NUMBER","" ... """) >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'], ... mode='remove', user='Bob') (1, 0) >>> sorted(stoneville.keys()) [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in'] Oops! Wilma is gone. Clean up: >>> import os >>> os.unlink('newcomers.csv') >>> os.unlink('newcomers.csv.create.pending') >>> os.unlink('newcomers.csv.create.msg') >>> os.unlink('newcomers.csv.remove.msg') >>> os.unlink('newcomers.csv.update.msg')