source: waeup/trunk/src/waeup/utils/batching.txt @ 4895

Last change on this file since 4895 was 4895, checked in by uli, 15 years ago

Update tests.

File size: 14.1 KB
Line 
1:mod:`waeup.utils.batching` -- Batch processing
2***********************************************
3
4Batch processing is much more than pure data import.
5
6:test-layer: functional
7
8Overview
9========
10
11Basically, it means processing CSV files in order to mass-create,
12mass-remove, or mass-update data.
13
14So you can feed CSV files to importers or processors, that are part of
15the batch-processing mechanism.
16
17Importers/Processors
18--------------------
19
20Each CSV file processor
21
22* accepts a single data type identified by an interface.
23
24* knows about the places inside a site (University) where to store,
25  remove or update the data.
26
27* can check headers before processing data.
28
29* supports the mode 'create', 'update', 'remove'.
30
31* creates logs and failed-data csv files.
32
33Output
34------
35
36The results of processing are written to logfiles. Beside this a new
37CSV file is created during processing, containing only those data
38sets, that could not be processed.
39
40This new CSV file is called like the input file, appended by mode and
41'.pending'. So, when the input file is named 'foo.csv' and something
42went wrong during processing, then a file 'foo.csv.create.pending'
43will be generated (if the operation mode was 'create'). The .pending
44file is a CSV file that contains the failed rows appended by a column
45``--ERRROR--`` in which the reasons for processing failures are
46listed.
47
48It looks like this::
49 
50     -----+      +---------+
51    /     |      |         |              +------+
52   | .csv +----->|Batch-   |              |      |
53   |      |      |processor+----changes-->| ZODB |
54   |  +------+   |         |              |      |
55   +--|      |   |         +              +------+
56      | Mode +-->|         |                 -------+
57      |      |   |         +----outputs-+-> /       |
58      |      |   +---------+            |  |.pending|
59      +------+   ^                      |  |        |
60                 |                      |  +--------+
61           +-----++                     v
62           |Inter-|                  -----+
63           |face  |                 /     |
64           +------+                | .msg |
65                                   |      |
66                                   +------+
67
68
69Creating a batch processor
70==========================
71
72We create an own batch processor for an own datatype. This datatype
73must be based on an interface that the batcher can use for converting
74data.
75
76Founding Stoneville
77-------------------
78
79We start with the interface:
80
81    >>> from zope.interface import Interface
82    >>> from zope import schema
83    >>> class ICave(Interface):
84    ...   """A cave."""
85    ...   name = schema.TextLine(
86    ...     title = u'Cave name',
87    ...     default = u'Unnamed',
88    ...     required = True)
89    ...   dinoports = schema.Int(
90    ...     title = u'Number of DinoPorts (tm)',
91    ...     required = False,
92    ...     default = 1)
93    ...   owner = schema.TextLine(
94    ...     title = u'Owner name',
95    ...     required = True,
96    ...     missing_value = 'Fred Estates Inc.')
97    ...   taxpayer = schema.Bool(
98    ...     title = u'Payes taxes',
99    ...     required = True,
100    ...     default = False)
101
102Now a class that implements this interface:
103
104    >>> import grok
105    >>> class Cave(object):
106    ...   grok.implements(ICave)
107    ...   def __init__(self, name=u'Unnamed', dinoports=2,
108    ...                owner='Fred Estates Inc.', taxpayer=False):
109    ...     self.name = name
110    ...     self.dinoports = 2
111    ...     self.owner = owner
112    ...     self.taxpayer = taxpayer
113
114We also provide a factory for caves. Strictly speaking, this not
115necessary but makes the batch processor we create afterwards, better
116understandable.
117
118    >>> from zope.component import getGlobalSiteManager
119    >>> from zope.component.factory import Factory
120    >>> from zope.component.interfaces import IFactory
121    >>> gsm = getGlobalSiteManager()
122    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
123    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
124
125Now we can create caves using a factory:
126
127    >>> from zope.component import createObject
128    >>> createObject('Lovely Cave')
129    <Cave object at 0x...>
130
131This is nice, but we still lack a place, where we can place all the
132lovely caves we want to sell.
133
134Furthermore, as a replacement for a real site, we define a place where
135all caves can be stored: Stoneville! This is a lovely place for
136upperclass cavemen (which are the only ones that can afford more than
137one dinoport).
138
139We found Stoneville:
140
141    >>> stoneville = dict()
142
143Everything in place.
144
145Now, to improve local health conditions, imagine we want to populate
146Stoneville with lots of new happy dino-hunting natives that slept on
147the bare ground in former times and had no idea of
148bathrooms. Disgusting, isn't it?
149
150Lots of cavemen need lots of caves.
151
152Of course we can do something like:
153
154    >>> cave1 = createObject('Lovely Cave')
155    >>> cave1.name = "Fred's home"
156    >>> cave1.owner = "Fred"
157    >>> stoneville[cave1.name] = cave1
158
159and Stoneville has exactly
160
161    >>> len(stoneville)
162    1
163
164inhabitant. But we don't want to do this for hundreds or thousands of
165citizens-to-be, do we?
166
167It is much easier to create a simple CSV list, where we put in all the
168data and let a batch processor do the job.
169
170The list is already here:
171
172    >>> open('newcomers.csv', 'wb').write(
173    ... """name,dinoports,owner,taxpayer
174    ... Barneys Home,2,Barney,1
175    ... Wilmas Asylum,1,Wilma,1
176    ... Freds Dinoburgers,10,Fred,0
177    ... Joeys Drive-in,110,Joey,0
178    ... """)
179
180All we need, is a batch processor now.
181
182    >>> from waeup.utils.batching import BatchProcessor
183    >>> class CaveProcessor(BatchProcessor):
184    ...   util_name = 'caveprocessor'
185    ...   grok.name(util_name)
186    ...   name = 'Cave Processor'
187    ...   iface = ICave
188    ...   location_fields = ['name']
189    ...   factory_name = 'Lovely Cave'
190    ...
191    ...   def parentsExist(self, row, site):
192    ...     return True
193    ...
194    ...   def getParent(self, row, site):
195    ...     return stoneville
196    ...
197    ...   def entryExists(self, row, site):
198    ...     return row['name'] in stoneville.keys()
199    ...
200    ...   def getEntry(self, row, site):
201    ...     if not self.entryExists(row, site):
202    ...       return None
203    ...     return stoneville[row['name']]
204    ...
205    ...   def delEntry(self, row, site):
206    ...     del stoneville[row['name']]
207    ...
208    ...   def addEntry(self, obj, row, site):
209    ...     stoneville[row['name']] = obj
210    ...
211    ...   def updateEntry(self, obj, row, site):
212    ...     for key, value in row.items():
213    ...       setattr(obj, key, value)
214
215If we also want the results being logged, we must provide a logger
216(this is optional):
217
218    >>> import logging
219    >>> logger = logging.getLogger('stoneville')
220    >>> logger.setLevel(logging.DEBUG)
221    >>> logger.propagate = False
222    >>> handler = logging.FileHandler('stoneville.log', 'w')
223    >>> logger.addHandler(handler)
224
225Create the fellows:
226
227    >>> processor = CaveProcessor()
228    >>> processor.doImport('newcomers.csv',
229    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
230    ...                    mode='create', user='Bob', logger=logger)
231    (4, 0, '/.../newcomers.finished.csv', None)
232
233The result means: four entries were processed and no warnings
234occured. Furthermore we get filepath to a CSV file with successfully
235processed entries and a filepath to a CSV file with erraneous entries.
236As everything went well, the latter is ``None``. Let's check:
237
238    >>> sorted(stoneville.keys())
239    [u'Barneys Home', ..., u'Wilmas Asylum']
240
241The values of the Cave instances have correct type:
242
243    >>> barney = stoneville['Barneys Home']
244    >>> barney.dinoports
245    2
246
247which is a number, not a string.
248
249Apparently, when calling the processor, we gave some more info than
250only the CSV filepath. What does it all mean?
251
252While the first argument is the path to the CSV file, we also have to
253give an ordered list of headernames. These replace the header field
254names that are actually in the file. This way we can override faulty
255headers.
256
257The ``mode`` paramter tells what kind of operation we want to perform:
258``create``, ``update``, or ``remove`` data.
259
260The ``user`` parameter finally is optional and only used for logging.
261
262We can, by the way, see the results of our run in a logfile if we
263provided a logger during the call:
264
265    >>> #print open('newcomers.csv.create.msg').read()
266    >>> print open('stoneville.log').read()
267    --------------------
268    Bob: Batch processing finished: OK
269    Bob: Source: newcomers.csv
270    Bob: Mode: create
271    Bob: User: Bob
272    Bob: Processing time: ... s (... s/item)
273    Bob: Processed: 4 lines (4 successful/ 0 failed)
274    --------------------
275
276As we can see, the processing was successful. Otherwise, all problems
277could be read here as we can see, if we do the same operation again:
278
279    >>> processor.doImport('newcomers.csv',
280    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
281    ...                    mode='create', user='Bob', logger=logger)
282    (4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
283
284This time we also get a path to a .pending file.
285
286The log file will tell us this in more detail:
287
288    >>> #print open('newcomers.csv.create.msg').read()
289    >>> print open('stoneville.log').read()
290    --------------------
291    ...
292    --------------------
293    Bob: Batch processing finished: FAILED
294    Bob: Source: newcomers.csv
295    Bob: Mode: create
296    Bob: User: Bob
297    Bob: Failed datasets: newcomers.pending.csv
298    Bob: Processing time: ... s (... s/item)
299    Bob: Processed: 4 lines (0 successful/ 4 failed)
300    --------------------
301
302This time a new file was created, which keeps all the rows we could not
303process and an additional column with error messages:
304
305    >>> print open('newcomers.pending.csv').read()
306    owner,name,taxpayer,dinoports,--ERRORS--
307    Barney,Barneys Home,1,2,This object already exists. Skipping.
308    Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
309    Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
310    Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
311
312This way we can correct the faulty entries and afterwards retry without
313having the already processed rows in the way.
314
315We also notice, that the values of the taxpayer column are returned as
316in the input file. There we wrote '1' for ``True`` and '0' for
317``False`` (which is accepted by the converters).
318
319
320Updating entries
321----------------
322
323To update entries, we just call the batchprocessor in a different
324mode:
325
326    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
327    ...                    mode='update', user='Bob')
328    (4, 0, '...', None)
329
330Now we want to tell, that Wilma got an extra port for her second dino:
331
332    >>> open('newcomers.csv', 'wb').write(
333    ... """name,dinoports,owner
334    ... Wilmas Asylum,2,Wilma
335    ... """)
336
337    >>> wilma = stoneville['Wilmas Asylum']
338    >>> wilma.dinoports
339    1
340
341We start the processor:
342
343    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
344    ...                    mode='update', user='Bob')
345    (1, 0, '...', None)
346
347    >>> wilma = stoneville['Wilmas Asylum']
348    >>> wilma.dinoports
349    2
350
351Wilma's number of dinoports raised.
352
353If we try to update an unexisting entry, an error occurs:
354
355    >>> open('newcomers.csv', 'wb').write(
356    ... """name,dinoports,owner
357    ... NOT-WILMAS-ASYLUM,2,Wilma
358    ... """)
359
360    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
361    ...                    mode='update', user='Bob')
362    (1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
363   
364Also invalid values will be spotted:
365
366    >>> open('newcomers.csv', 'wb').write(
367    ... """name,dinoports,owner
368    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
369    ... """)
370
371    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
372    ...                    mode='update', user='Bob')
373    (1, 1, '...', '...')
374
375We can also update only some cols, leaving some out. We skip the
376'dinoports' column in the next run:
377
378    >>> open('newcomers.csv', 'wb').write(
379    ... """name,owner
380    ... Wilmas Asylum,Barney
381    ... """)
382
383    >>> processor.doImport('newcomers.csv', ['name', 'owner'],
384    ...                    mode='update', user='Bob')
385    (1, 0, '...', None)
386
387    >>> wilma.owner
388    u'Barney'
389
390We can however, not leave out the 'location field' ('name' in our
391case), as this one tells us which entry to update:
392
393    >>> open('newcomers.csv', 'wb').write(
394    ... """name,dinoports,owner
395    ... 2,Wilma
396    ... """)
397
398    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
399    ...                    mode='update', user='Bob')
400    Traceback (most recent call last):
401    ...
402    FatalCSVError: Need at least columns 'name' for import!
403
404This time we get even an exception!
405
406We can tell to set dinoports to ``None`` although this is not a
407number, as we declared the field not required in the interface:
408
409    >>> open('newcomers.csv', 'wb').write(
410    ... """name,dinoports,owner
411    ... "Wilmas Asylum",,"Wilma"
412    ... """)
413
414    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
415    ...                    mode='update', user='Bob')
416    (1, 0, '...', None)
417
418    >>> wilma.dinoports is None
419    True
420
421Generally, empty strings are considered as ``None``:
422
423    >>> open('newcomers.csv', 'wb').write(
424    ... """name,dinoports,owner
425    ... "Wilmas Asylum","","Wilma"
426    ... """)
427
428    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
429    ...                    mode='update', user='Bob')
430    (1, 0, '...', None)
431
432    >>> wilma.dinoports is None
433    True
434
435Removing entries
436----------------
437
438In 'remove' mode we can delete entries. Here validity of values in
439non-location fields doesn't matter because those fields are ignored.
440
441    >>> open('newcomers.csv', 'wb').write(
442    ... """name,dinoports,owner
443    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
444    ... """)
445
446    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
447    ...                    mode='remove', user='Bob')
448    (1, 0, '...', None)
449
450    >>> sorted(stoneville.keys())
451    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
452
453Oops! Wilma is gone.
454
455
456Clean up:
457
458    >>> import os
459    >>> os.unlink('newcomers.csv')
460    >>> os.unlink('newcomers.finished.csv')
461    >>> os.unlink('stoneville.log')
Note: See TracBrowser for help on using the repository browser.