source: waeup/trunk/src/waeup/utils/batching.txt @ 4886

Last change on this file since 4886 was 4886, checked in by uli, 15 years ago

Update tests.

File size: 13.7 KB
RevLine 
[4837]1:mod:`waeup.utils.batching` -- Batch processing
2***********************************************
3
4Batch processing is much more than pure data import.
5
6:test-layer: functional
7
8Overview
9========
10
11Basically, it means processing CSV files in order to mass-create,
12mass-remove, or mass-update data.
13
[4847]14So you can feed CSV files to importers or processors, that are part of
15the batch-processing mechanism.
[4837]16
[4847]17Importers/Processors
18--------------------
[4837]19
[4847]20Each CSV file processor
[4837]21
22* accepts a single data type identified by an interface.
23
24* knows about the places inside a site (University) where to store,
25  remove or update the data.
26
27* can check headers before processing data.
28
29* supports the mode 'create', 'update', 'remove'.
30
31* creates logs and failed-data csv files.
32
33Output
34------
35
36The results of processing are written to logfiles. Beside this a new
37CSV file is created during processing, containing only those data
38sets, that could not be processed.
39
40This new CSV file is called like the input file, appended by mode and
41'.pending'. So, when the input file is named 'foo.csv' and something
42went wrong during processing, then a file 'foo.csv.create.pending'
[4879]43will be generated (if the operation mode was 'create'). The .pending
44file is a CSV file that contains the failed rows appended by a column
45``--ERRROR--`` in which the reasons for processing failures are
46listed.
[4837]47
48It looks like this::
49 
50     -----+      +---------+
51    /     |      |         |              +------+
52   | .csv +----->|Batch-   |              |      |
53   |      |      |processor+----changes-->| ZODB |
54   |  +------+   |         |              |      |
55   +--|      |   |         +              +------+
56      | Mode +-->|         |                 -------+
57      |      |   |         +----outputs-+-> /       |
58      |      |   +---------+            |  |.pending|
59      +------+   ^                      |  |        |
60                 |                      |  +--------+
61           +-----++                     v
62           |Inter-|                  -----+
63           |face  |                 /     |
64           +------+                | .msg |
65                                   |      |
66                                   +------+
67
68
69Creating a batch processor
70==========================
71
72We create an own batch processor for an own datatype. This datatype
73must be based on an interface that the batcher can use for converting
74data.
75
76Founding Stoneville
77-------------------
78
79We start with the interface:
80
81    >>> from zope.interface import Interface
82    >>> from zope import schema
83    >>> class ICave(Interface):
84    ...   """A cave."""
85    ...   name = schema.TextLine(
86    ...     title = u'Cave name',
87    ...     default = u'Unnamed',
88    ...     required = True)
89    ...   dinoports = schema.Int(
90    ...     title = u'Number of DinoPorts (tm)',
91    ...     required = False,
92    ...     default = 1)
93    ...   owner = schema.TextLine(
94    ...     title = u'Owner name',
95    ...     required = True,
96    ...     missing_value = 'Fred Estates Inc.')
[4871]97    ...   taxpayer = schema.Bool(
98    ...     title = u'Payes taxes',
99    ...     required = True,
100    ...     default = False)
[4837]101
102Now a class that implements this interface:
103
104    >>> import grok
105    >>> class Cave(object):
106    ...   grok.implements(ICave)
107    ...   def __init__(self, name=u'Unnamed', dinoports=2,
[4871]108    ...                owner='Fred Estates Inc.', taxpayer=False):
[4837]109    ...     self.name = name
110    ...     self.dinoports = 2
111    ...     self.owner = owner
[4871]112    ...     self.taxpayer = taxpayer
[4837]113
114We also provide a factory for caves. Strictly speaking, this not
115necessary but makes the batch processor we create afterwards, better
116understandable.
117
118    >>> from zope.component import getGlobalSiteManager
119    >>> from zope.component.factory import Factory
120    >>> from zope.component.interfaces import IFactory
121    >>> gsm = getGlobalSiteManager()
122    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
123    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
124
125Now we can create caves using a factory:
126
127    >>> from zope.component import createObject
128    >>> createObject('Lovely Cave')
129    <Cave object at 0x...>
130
131This is nice, but we still lack a place, where we can place all the
132lovely caves we want to sell.
133
134Furthermore, as a replacement for a real site, we define a place where
135all caves can be stored: Stoneville! This is a lovely place for
136upperclass cavemen (which are the only ones that can afford more than
137one dinoport).
138
139We found Stoneville:
140
141    >>> stoneville = dict()
142
143Everything in place.
144
145Now, to improve local health conditions, imagine we want to populate
146Stoneville with lots of new happy dino-hunting natives that slept on
147the bare ground in former times and had no idea of
148bathrooms. Disgusting, isn't it?
149
150Lots of cavemen need lots of caves.
151
152Of course we can do something like:
153
154    >>> cave1 = createObject('Lovely Cave')
155    >>> cave1.name = "Fred's home"
156    >>> cave1.owner = "Fred"
157    >>> stoneville[cave1.name] = cave1
158
159and Stoneville has exactly
160
161    >>> len(stoneville)
162    1
163
164inhabitant. But we don't want to do this for hundreds or thousands of
165citizens-to-be, do we?
166
167It is much easier to create a simple CSV list, where we put in all the
168data and let a batch processor do the job.
169
170The list is already here:
171
172    >>> open('newcomers.csv', 'wb').write(
[4871]173    ... """name,dinoports,owner,taxpayer
174    ... Barneys Home,2,Barney,1
175    ... Wilmas Asylum,1,Wilma,1
176    ... Freds Dinoburgers,10,Fred,0
177    ... Joeys Drive-in,110,Joey,0
[4837]178    ... """)
179
180All we need, is a batch processor now.
181
182    >>> from waeup.utils.batching import BatchProcessor
183    >>> class CaveProcessor(BatchProcessor):
184    ...   util_name = 'caveprocessor'
185    ...   grok.name(util_name)
186    ...   name = 'Cave Processor'
187    ...   iface = ICave
188    ...   location_fields = ['name']
189    ...   factory_name = 'Lovely Cave'
190    ...
191    ...   def parentsExist(self, row, site):
192    ...     return True
193    ...
194    ...   def getParent(self, row, site):
195    ...     return stoneville
196    ...
197    ...   def entryExists(self, row, site):
198    ...     return row['name'] in stoneville.keys()
199    ...
200    ...   def getEntry(self, row, site):
201    ...     if not self.entryExists(row, site):
202    ...       return None
203    ...     return stoneville[row['name']]
204    ...
205    ...   def delEntry(self, row, site):
206    ...     del stoneville[row['name']]
207    ...
208    ...   def addEntry(self, obj, row, site):
209    ...     stoneville[row['name']] = obj
210    ...
211    ...   def updateEntry(self, obj, row, site):
212    ...     for key, value in row.items():
213    ...       setattr(obj, key, value)
214
[4886]215If we also want the results being logged, we must provide a logger
216(this is optional):
217
218    >>> import logging
219    >>> logger = logging.getLogger('stoneville')
220    >>> logger.setLevel(logging.DEBUG)
221    >>> logger.propagate = False
222    >>> handler = logging.FileHandler('stoneville.log', 'w')
223    >>> logger.addHandler(handler)
224
[4837]225Create the fellows:
226
227    >>> processor = CaveProcessor()
[4871]228    >>> processor.doImport('newcomers.csv',
229    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]230    ...                    mode='create', user='Bob', logger=logger)
[4879]231    (4, 0)
[4837]232
233The result means: four entries were processed and no warnings
234occured. Let's check:
235
236    >>> sorted(stoneville.keys())
237    [u'Barneys Home', ..., u'Wilmas Asylum']
238
239The values of the Cave instances have correct type:
240
241    >>> barney = stoneville['Barneys Home']
242    >>> barney.dinoports
243    2
244
245which is a number, not a string.
246
247Apparently, when calling the processor, we gave some more info than
248only the CSV filepath. What does it all mean?
249
250While the first argument is the path to the CSV file, we also have to
251give an ordered list of headernames. These replace the header field
252names that are actually in the file. This way we can override faulty
253headers.
254
255The ``mode`` paramter tells what kind of operation we want to perform:
256``create``, ``update``, or ``remove`` data.
257
258The ``user`` parameter finally is optional and only used for logging.
259
[4886]260We can, by the way, see the results of our run in a logfile if we
261provided a logger during the call:
[4837]262
[4886]263    >>> #print open('newcomers.csv.create.msg').read()
264    >>> print open('stoneville.log').read()
265    --------------------
266    Bob: Batch processing finished: OK
267    Bob: Source: newcomers.csv
268    Bob: Mode: create
269    Bob: User: Bob
270    Bob: Processing time: ... s (... s/item)
271    Bob: Processed: 4 lines (4 successful/ 0 failed)
272    --------------------
[4837]273
274As we can see, the processing was successful. Otherwise, all problems
275could be read here as we can see, if we do the same operation again:
276
[4871]277    >>> processor.doImport('newcomers.csv',
278    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]279    ...                    mode='create', user='Bob', logger=logger)
[4879]280    (4, 4)
[4837]281
282The log file will tell us this in more detail:
283
[4886]284    >>> #print open('newcomers.csv.create.msg').read()
285    >>> print open('stoneville.log').read()
286    --------------------
287    ...
288    --------------------
289    Bob: Batch processing finished: FAILED
290    Bob: Source: newcomers.csv
291    Bob: Mode: create
292    Bob: User: Bob
293    Bob: Failed datasets: newcomers.csv.create.pending
294    Bob: Processing time: ... s (... s/item)
295    Bob: Processed: 4 lines (0 successful/ 4 failed)
296    --------------------
[4837]297
298This time a new file was created, which keeps all the rows we could not
[4877]299process and an additional column with error messages:
[4837]300
301    >>> print open('newcomers.csv.create.pending').read()
[4877]302    owner,name,taxpayer,dinoports,--ERRORS--
303    Barney,Barneys Home,1,2,This object already exists. Skipping.
304    Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
305    Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
306    Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
[4837]307
308This way we can correct the faulty entries and afterwards retry without
309having the already processed rows in the way.
310
[4871]311We also notice, that the values of the taxpayer column are returned as
312in the input file. There we wrote '1' for ``True`` and '0' for
313``False`` (which is accepted by the converters).
[4837]314
[4871]315
[4837]316Updating entries
317----------------
318
319To update entries, we just call the batchprocessor in a different
320mode:
321
322    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
323    ...                    mode='update', user='Bob')
[4879]324    (4, 0)
[4837]325
[4879]326Now we want to tell, that Wilma got an extra port for her second dino:
[4837]327
328    >>> open('newcomers.csv', 'wb').write(
329    ... """name,dinoports,owner
330    ... Wilmas Asylum,2,Wilma
331    ... """)
332
333    >>> wilma = stoneville['Wilmas Asylum']
334    >>> wilma.dinoports
335    1
336
337We start the processor:
338
339    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
340    ...                    mode='update', user='Bob')
[4879]341    (1, 0)
[4837]342
343    >>> wilma = stoneville['Wilmas Asylum']
344    >>> wilma.dinoports
345    2
346
347Wilma's number of dinoports raised.
348
349If we try to update an unexisting entry, an error occurs:
350
351    >>> open('newcomers.csv', 'wb').write(
352    ... """name,dinoports,owner
353    ... NOT-WILMAS-ASYLUM,2,Wilma
354    ... """)
355
356    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
357    ...                    mode='update', user='Bob')
[4879]358    (1, 1)
[4837]359   
360Also invalid values will be spotted:
361
362    >>> open('newcomers.csv', 'wb').write(
363    ... """name,dinoports,owner
364    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
365    ... """)
366
367    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
368    ...                    mode='update', user='Bob')
[4879]369    (1, 1)
[4837]370
371We can also update only some cols, leaving some out. We skip the
372'dinoports' column in the next run:
373
374    >>> open('newcomers.csv', 'wb').write(
375    ... """name,owner
376    ... Wilmas Asylum,Barney
377    ... """)
378
379    >>> processor.doImport('newcomers.csv', ['name', 'owner'],
380    ...                    mode='update', user='Bob')
[4879]381    (1, 0)
[4837]382
383    >>> wilma.owner
384    u'Barney'
385
386We can however, not leave out the 'location field' ('name' in our
387case), as this one tells us which entry to update:
388
389    >>> open('newcomers.csv', 'wb').write(
390    ... """name,dinoports,owner
391    ... 2,Wilma
392    ... """)
393
394    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
395    ...                    mode='update', user='Bob')
396    Traceback (most recent call last):
397    ...
398    FatalCSVError: Need at least columns 'name' for import!
399
400This time we get even an exception!
401
402We can tell to set dinoports to ``None`` although this is not a
403number, as we declared the field not required in the interface:
404
405    >>> open('newcomers.csv', 'wb').write(
406    ... """name,dinoports,owner
407    ... "Wilmas Asylum",,"Wilma"
408    ... """)
409
410    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
411    ...                    mode='update', user='Bob')
[4879]412    (1, 0)
[4837]413
414    >>> wilma.dinoports is None
415    True
416
417Generally, empty strings are considered as ``None``:
418
419    >>> open('newcomers.csv', 'wb').write(
420    ... """name,dinoports,owner
421    ... "Wilmas Asylum","","Wilma"
422    ... """)
423
424    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
425    ...                    mode='update', user='Bob')
[4879]426    (1, 0)
[4837]427
428    >>> wilma.dinoports is None
429    True
430
431Removing entries
432----------------
433
434In 'remove' mode we can delete entries. Here validity of values in
435non-location fields doesn't matter because those fields are ignored.
436
437    >>> open('newcomers.csv', 'wb').write(
438    ... """name,dinoports,owner
439    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
440    ... """)
441
442    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
443    ...                    mode='remove', user='Bob')
[4879]444    (1, 0)
[4837]445
446    >>> sorted(stoneville.keys())
447    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
448
449Oops! Wilma is gone.
450
451
452Clean up:
453
454    >>> import os
455    >>> os.unlink('newcomers.csv')
456    >>> os.unlink('newcomers.csv.create.pending')
[4886]457    >>> os.unlink('stoneville.log')
Note: See TracBrowser for help on using the repository browser.