source: main/waeup.kofa/trunk/src/waeup/kofa/doctests/batchprocessing.txt @ 15694

Last change on this file since 15694 was 12946, checked in by Henrik Bettermann, 10 years ago

Rename doctests again.

File size: 16.2 KB
RevLine 
[12920]1Batch Processing
2****************
[4837]3
4Batch processing is much more than pure data import.
5
6Overview
7========
8
9Basically, it means processing CSV files in order to mass-create,
10mass-remove, or mass-update data.
11
[7933]12So you can feed CSV files to processors, that are part of
[4847]13the batch-processing mechanism.
[4837]14
[7933]15Processors
16----------
[4837]17
[4847]18Each CSV file processor
[4837]19
20* accepts a single data type identified by an interface.
21
22* knows about the places inside a site (University) where to store,
23  remove or update the data.
24
25* can check headers before processing data.
26
27* supports the mode 'create', 'update', 'remove'.
28
[4903]29* creates log entries (optional)
[4837]30
[4903]31* creates csv files containing successful and not-successful processed
32  data respectively.
33
[4837]34Output
35------
36
[4903]37The results of processing are written to loggers, if a logger was
38given. Beside this new CSV files are created during processing:
[4837]39
[4903]40* a pending CSV file, containing datasets that could not be processed
[4837]41
[4903]42* a finished CSV file, containing datasets successfully processed.
43
44The pending file is not created if everything works fine. The
45respective path returned in that case is ``None``.
46
47The pending file (if created) is a CSV file that contains the failed
48rows appended by a column ``--ERRROR--`` in which the reasons for
49processing failures are listed.
50
51The complete paths of these files are returned. They will be in a
52temporary directory created only for this purpose. It is the caller's
53responsibility to remove the temporay directories afterwards (the
54datacenters distProcessedFiles() method takes care for that).
55
[4837]56It looks like this::
57 
58     -----+      +---------+
59    /     |      |         |              +------+
60   | .csv +----->|Batch-   |              |      |
61   |      |      |processor+----changes-->| ZODB |
62   |  +------+   |         |              |      |
63   +--|      |   |         +              +------+
64      | Mode +-->|         |                 -------+
65      |      |   |         +----outputs-+-> /       |
[4903]66      |  +----+->+---------+            |  |.pending|
67      +--|Log |  ^                      |  |        |
68         +----+  |                      |  +--------+
[4837]69           +-----++                     v
[4903]70           |Inter-|                  ----------+
71           |face  |                 /          |
72           +------+                | .finished |
73                                   |           |
74                                   +-----------+
[4837]75
76
[12920]77Creating a Batch Processor
[4837]78==========================
79
80We create an own batch processor for an own datatype. This datatype
81must be based on an interface that the batcher can use for converting
82data.
83
84Founding Stoneville
85-------------------
86
87We start with the interface:
88
89    >>> from zope.interface import Interface
90    >>> from zope import schema
91    >>> class ICave(Interface):
92    ...   """A cave."""
93    ...   name = schema.TextLine(
94    ...     title = u'Cave name',
95    ...     default = u'Unnamed',
96    ...     required = True)
97    ...   dinoports = schema.Int(
98    ...     title = u'Number of DinoPorts (tm)',
99    ...     required = False,
100    ...     default = 1)
101    ...   owner = schema.TextLine(
102    ...     title = u'Owner name',
103    ...     required = True,
104    ...     missing_value = 'Fred Estates Inc.')
[4871]105    ...   taxpayer = schema.Bool(
106    ...     title = u'Payes taxes',
107    ...     required = True,
108    ...     default = False)
[4837]109
110Now a class that implements this interface:
111
112    >>> import grok
113    >>> class Cave(object):
114    ...   grok.implements(ICave)
115    ...   def __init__(self, name=u'Unnamed', dinoports=2,
[4871]116    ...                owner='Fred Estates Inc.', taxpayer=False):
[4837]117    ...     self.name = name
118    ...     self.dinoports = 2
119    ...     self.owner = owner
[4871]120    ...     self.taxpayer = taxpayer
[4837]121
122We also provide a factory for caves. Strictly speaking, this not
123necessary but makes the batch processor we create afterwards, better
124understandable.
125
126    >>> from zope.component import getGlobalSiteManager
127    >>> from zope.component.factory import Factory
128    >>> from zope.component.interfaces import IFactory
129    >>> gsm = getGlobalSiteManager()
130    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
131    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
132
133Now we can create caves using a factory:
134
135    >>> from zope.component import createObject
136    >>> createObject('Lovely Cave')
137    <Cave object at 0x...>
138
139This is nice, but we still lack a place, where we can place all the
140lovely caves we want to sell.
141
142Furthermore, as a replacement for a real site, we define a place where
143all caves can be stored: Stoneville! This is a lovely place for
144upperclass cavemen (which are the only ones that can afford more than
145one dinoport).
146
147We found Stoneville:
148
149    >>> stoneville = dict()
150
151Everything in place.
152
153Now, to improve local health conditions, imagine we want to populate
154Stoneville with lots of new happy dino-hunting natives that slept on
155the bare ground in former times and had no idea of
156bathrooms. Disgusting, isn't it?
157
158Lots of cavemen need lots of caves.
159
160Of course we can do something like:
161
162    >>> cave1 = createObject('Lovely Cave')
163    >>> cave1.name = "Fred's home"
164    >>> cave1.owner = "Fred"
165    >>> stoneville[cave1.name] = cave1
166
167and Stoneville has exactly
168
169    >>> len(stoneville)
170    1
171
172inhabitant. But we don't want to do this for hundreds or thousands of
173citizens-to-be, do we?
174
175It is much easier to create a simple CSV list, where we put in all the
176data and let a batch processor do the job.
177
178The list is already here:
179
180    >>> open('newcomers.csv', 'wb').write(
[4871]181    ... """name,dinoports,owner,taxpayer
182    ... Barneys Home,2,Barney,1
183    ... Wilmas Asylum,1,Wilma,1
184    ... Freds Dinoburgers,10,Fred,0
185    ... Joeys Drive-in,110,Joey,0
[4837]186    ... """)
187
188All we need, is a batch processor now.
189
[7811]190    >>> from waeup.kofa.utils.batching import BatchProcessor
[8224]191    >>> from waeup.kofa.interfaces import IGNORE_MARKER
[4837]192    >>> class CaveProcessor(BatchProcessor):
193    ...   util_name = 'caveprocessor'
194    ...   grok.name(util_name)
195    ...   name = 'Cave Processor'
196    ...   iface = ICave
197    ...   location_fields = ['name']
198    ...   factory_name = 'Lovely Cave'
199    ...
200    ...   def parentsExist(self, row, site):
201    ...     return True
202    ...
203    ...   def getParent(self, row, site):
204    ...     return stoneville
205    ...
206    ...   def entryExists(self, row, site):
207    ...     return row['name'] in stoneville.keys()
208    ...
209    ...   def getEntry(self, row, site):
210    ...     if not self.entryExists(row, site):
211    ...       return None
212    ...     return stoneville[row['name']]
213    ...
214    ...   def delEntry(self, row, site):
215    ...     del stoneville[row['name']]
216    ...
217    ...   def addEntry(self, obj, row, site):
218    ...     stoneville[row['name']] = obj
219    ...
[9706]220    ...   def updateEntry(self, obj, row, site, filename):
[4985]221    ...     # This is not strictly necessary, as the default
222    ...     # updateEntry method does exactly the same
[4837]223    ...     for key, value in row.items():
[8224]224    ...       if value != IGNORE_MARKER:
225    ...         setattr(obj, key, value)
[4837]226
[4886]227If we also want the results being logged, we must provide a logger
228(this is optional):
229
230    >>> import logging
231    >>> logger = logging.getLogger('stoneville')
232    >>> logger.setLevel(logging.DEBUG)
233    >>> logger.propagate = False
234    >>> handler = logging.FileHandler('stoneville.log', 'w')
235    >>> logger.addHandler(handler)
236
[4837]237Create the fellows:
238
239    >>> processor = CaveProcessor()
[6273]240    >>> result = processor.doImport('newcomers.csv',
[4871]241    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]242    ...                    mode='create', user='Bob', logger=logger)
[4902]243    >>> result
[4895]244    (4, 0, '/.../newcomers.finished.csv', None)
[4837]245
246The result means: four entries were processed and no warnings
[4895]247occured. Furthermore we get filepath to a CSV file with successfully
248processed entries and a filepath to a CSV file with erraneous entries.
249As everything went well, the latter is ``None``. Let's check:
[4837]250
251    >>> sorted(stoneville.keys())
252    [u'Barneys Home', ..., u'Wilmas Asylum']
253
254The values of the Cave instances have correct type:
255
256    >>> barney = stoneville['Barneys Home']
257    >>> barney.dinoports
258    2
259
260which is a number, not a string.
261
262Apparently, when calling the processor, we gave some more info than
263only the CSV filepath. What does it all mean?
264
265While the first argument is the path to the CSV file, we also have to
266give an ordered list of headernames. These replace the header field
267names that are actually in the file. This way we can override faulty
268headers.
269
270The ``mode`` paramter tells what kind of operation we want to perform:
271``create``, ``update``, or ``remove`` data.
272
273The ``user`` parameter finally is optional and only used for logging.
274
[4886]275We can, by the way, see the results of our run in a logfile if we
276provided a logger during the call:
[4837]277
[4886]278    >>> print open('stoneville.log').read()
[9739]279    processed: newcomers.csv, create mode, 4 lines (4 successful/ 0 failed), ... s (... s/item)
[4837]280
[9739]281
[4902]282We cleanup the temporay dir created by doImport():
283
284    >>> import shutil
285    >>> import os
286    >>> shutil.rmtree(os.path.dirname(result[2]))
287
[4837]288As we can see, the processing was successful. Otherwise, all problems
289could be read here as we can see, if we do the same operation again:
290
[4902]291    >>> result = processor.doImport('newcomers.csv',
[4871]292    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]293    ...                    mode='create', user='Bob', logger=logger)
[4902]294    >>> result
[4895]295    (4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]296
[4895]297This time we also get a path to a .pending file.
298
[4837]299The log file will tell us this in more detail:
300
[4886]301    >>> print open('stoneville.log').read()
[9739]302    processed: newcomers.csv, create mode, 4 lines (4 successful/ 0 failed), ... s (... s/item)
303    processed: newcomers.csv, create mode, 4 lines (0 successful/ 4 failed), ... s (... s/item)
[4837]304
[9739]305
[4837]306This time a new file was created, which keeps all the rows we could not
[4877]307process and an additional column with error messages:
[4837]308
[4902]309    >>> print open(result[3]).read()
[4877]310    owner,name,taxpayer,dinoports,--ERRORS--
[12868]311    Barney,Barneys Home,1,2,This object already exists.
312    Wilma,Wilmas Asylum,1,1,This object already exists.
313    Fred,Freds Dinoburgers,0,10,This object already exists.
314    Joey,Joeys Drive-in,0,110,This object already exists.
[4837]315
316This way we can correct the faulty entries and afterwards retry without
317having the already processed rows in the way.
318
[4871]319We also notice, that the values of the taxpayer column are returned as
320in the input file. There we wrote '1' for ``True`` and '0' for
321``False`` (which is accepted by the converters).
[4837]322
[4902]323Clean up:
[4871]324
[4902]325    >>> shutil.rmtree(os.path.dirname(result[2]))
326
[4912]327
328We can also tell to ignore some cols from input by passing
329``--IGNORE--`` as col name:
330
331    >>> result = processor.doImport('newcomers.csv', ['name',
332    ...                             '--IGNORE--', '--IGNORE--'],
333    ...                    mode='update', user='Bob')
334    >>> result
335    (4, 0, '...', None)
336
337Clean up:
338
339    >>> shutil.rmtree(os.path.dirname(result[2]))
340
341If something goes wrong during processing, the respective --IGNORE--
[6824]342cols won't be populated  in the resulting pending file:
[4912]343
344    >>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
345    ...                             '--IGNORE--', '--IGNORE--'],
346    ...                    mode='create', user='Bob')
347    >>> result
348    (4, 4, '...', '...')
349
350    >>> print open(result[3], 'rb').read()
[6824]351    name,dinoports,--ERRORS--
[12868]352    Barneys Home,2,This object already exists.
353    Wilmas Asylum,1,This object already exists.
354    Freds Dinoburgers,10,This object already exists.
355    Joeys Drive-in,110,This object already exists.
[4912]356
357
358Clean up:
359
360    >>> shutil.rmtree(os.path.dirname(result[2]))
361
362
[12920]363Updating Entries
[4837]364----------------
365
366To update entries, we just call the batchprocessor in a different
367mode:
368
[4902]369    >>> result = processor.doImport('newcomers.csv', ['name',
370    ...                             'dinoports', 'owner'],
[4837]371    ...                    mode='update', user='Bob')
[4902]372    >>> result
[4895]373    (4, 0, '...', None)
[4837]374
[4879]375Now we want to tell, that Wilma got an extra port for her second dino:
[4837]376
377    >>> open('newcomers.csv', 'wb').write(
378    ... """name,dinoports,owner
379    ... Wilmas Asylum,2,Wilma
380    ... """)
381
382    >>> wilma = stoneville['Wilmas Asylum']
383    >>> wilma.dinoports
384    1
385
[4902]386Clean up:
387
388    >>> shutil.rmtree(os.path.dirname(result[2]))
389
390
[4837]391We start the processor:
392
[4902]393    >>> result = processor.doImport('newcomers.csv', ['name',
394    ...                    'dinoports', 'owner'], mode='update', user='Bob')
395    >>> result
[4895]396    (1, 0, '...', None)
[4837]397
398    >>> wilma = stoneville['Wilmas Asylum']
399    >>> wilma.dinoports
400    2
401
402Wilma's number of dinoports raised.
403
[4902]404Clean up:
405
406    >>> shutil.rmtree(os.path.dirname(result[2]))
407
408
[4837]409If we try to update an unexisting entry, an error occurs:
410
411    >>> open('newcomers.csv', 'wb').write(
412    ... """name,dinoports,owner
413    ... NOT-WILMAS-ASYLUM,2,Wilma
414    ... """)
415
[4902]416    >>> result = processor.doImport('newcomers.csv', ['name',
417    ...                             'dinoports', 'owner'],
[4837]418    ...                    mode='update', user='Bob')
[4902]419    >>> result
[4895]420    (1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]421
422Clean up:
423
424    >>> shutil.rmtree(os.path.dirname(result[2]))
425
[4837]426   
427Also invalid values will be spotted:
428
429    >>> open('newcomers.csv', 'wb').write(
430    ... """name,dinoports,owner
431    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
432    ... """)
433
[4902]434    >>> result = processor.doImport('newcomers.csv', ['name',
435    ...                             'dinoports', 'owner'],
[4837]436    ...                    mode='update', user='Bob')
[4902]437    >>> result
[4895]438    (1, 1, '...', '...')
[4837]439
[4902]440Clean up:
441
442    >>> shutil.rmtree(os.path.dirname(result[2]))
443
444
[4837]445We can also update only some cols, leaving some out. We skip the
446'dinoports' column in the next run:
447
448    >>> open('newcomers.csv', 'wb').write(
449    ... """name,owner
450    ... Wilmas Asylum,Barney
451    ... """)
452
[4902]453    >>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
454    ...                             mode='update', user='Bob')
455    >>> result
[4895]456    (1, 0, '...', None)
[4837]457
458    >>> wilma.owner
459    u'Barney'
460
[4902]461Clean up:
462
463    >>> shutil.rmtree(os.path.dirname(result[2]))
464
465
[4837]466We can however, not leave out the 'location field' ('name' in our
467case), as this one tells us which entry to update:
468
469    >>> open('newcomers.csv', 'wb').write(
470    ... """name,dinoports,owner
471    ... 2,Wilma
472    ... """)
473
474    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
475    ...                    mode='update', user='Bob')
476    Traceback (most recent call last):
477    ...
478    FatalCSVError: Need at least columns 'name' for import!
479
480This time we get even an exception!
481
[8227]482Generally, empty strings are considered as ``None``:
[4837]483
484    >>> open('newcomers.csv', 'wb').write(
485    ... """name,dinoports,owner
[8227]486    ... "Wilmas Asylum","","Wilma"
[4837]487    ... """)
488
[4902]489    >>> result = processor.doImport('newcomers.csv', ['name',
490    ...                             'dinoports', 'owner'],
[8227]491    ...                    mode='update', user='Bob')
[4902]492    >>> result
[4895]493    (1, 0, '...', None)
[4837]494
[8227]495    >>> wilma.dinoports
496    2
[4837]497
[4902]498Clean up:
499
500    >>> shutil.rmtree(os.path.dirname(result[2]))
501
[8227]502We can tell to set dinoports to ``None`` although this is not a
503number, as we declared the field not required in the interface:
[4837]504
505    >>> open('newcomers.csv', 'wb').write(
506    ... """name,dinoports,owner
[8227]507    ... "Wilmas Asylum","XXX","Wilma"
[4837]508    ... """)
509
[4902]510    >>> result = processor.doImport('newcomers.csv', ['name',
511    ...                             'dinoports', 'owner'],
[8227]512    ...                    mode='update', user='Bob', ignore_empty=False)
[4902]513    >>> result
[4895]514    (1, 0, '...', None)
[4837]515
516    >>> wilma.dinoports is None
517    True
518
[4902]519Clean up:
520
521    >>> shutil.rmtree(os.path.dirname(result[2]))
522
[12920]523Removing Entries
[4837]524----------------
525
526In 'remove' mode we can delete entries. Here validity of values in
527non-location fields doesn't matter because those fields are ignored.
528
529    >>> open('newcomers.csv', 'wb').write(
530    ... """name,dinoports,owner
531    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
532    ... """)
533
[4902]534    >>> result = processor.doImport('newcomers.csv', ['name',
535    ...                             'dinoports', 'owner'],
[4837]536    ...                    mode='remove', user='Bob')
[4902]537    >>> result
[4895]538    (1, 0, '...', None)
[4837]539
540    >>> sorted(stoneville.keys())
541    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
542
543Oops! Wilma is gone.
544
[4902]545Clean up:
[4837]546
[4902]547    >>> shutil.rmtree(os.path.dirname(result[2]))
548
549
[4837]550Clean up:
551
552    >>> import os
553    >>> os.unlink('newcomers.csv')
[4886]554    >>> os.unlink('stoneville.log')
Note: See TracBrowser for help on using the repository browser.