source: waeup/trunk/src/waeup/utils/batching.txt @ 4879

Last change on this file since 4879 was 4879, checked in by uli, 15 years ago

Update tests.

File size: 13.2 KB
Line 
1:mod:`waeup.utils.batching` -- Batch processing
2***********************************************
3
4Batch processing is much more than pure data import.
5
6:test-layer: functional
7
8Overview
9========
10
11Basically, it means processing CSV files in order to mass-create,
12mass-remove, or mass-update data.
13
14So you can feed CSV files to importers or processors, that are part of
15the batch-processing mechanism.
16
17Importers/Processors
18--------------------
19
20Each CSV file processor
21
22* accepts a single data type identified by an interface.
23
24* knows about the places inside a site (University) where to store,
25  remove or update the data.
26
27* can check headers before processing data.
28
29* supports the mode 'create', 'update', 'remove'.
30
31* creates logs and failed-data csv files.
32
33Output
34------
35
36The results of processing are written to logfiles. Beside this a new
37CSV file is created during processing, containing only those data
38sets, that could not be processed.
39
40This new CSV file is called like the input file, appended by mode and
41'.pending'. So, when the input file is named 'foo.csv' and something
42went wrong during processing, then a file 'foo.csv.create.pending'
43will be generated (if the operation mode was 'create'). The .pending
44file is a CSV file that contains the failed rows appended by a column
45``--ERRROR--`` in which the reasons for processing failures are
46listed.
47
48It looks like this::
49 
50     -----+      +---------+
51    /     |      |         |              +------+
52   | .csv +----->|Batch-   |              |      |
53   |      |      |processor+----changes-->| ZODB |
54   |  +------+   |         |              |      |
55   +--|      |   |         +              +------+
56      | Mode +-->|         |                 -------+
57      |      |   |         +----outputs-+-> /       |
58      |      |   +---------+            |  |.pending|
59      +------+   ^                      |  |        |
60                 |                      |  +--------+
61           +-----++                     v
62           |Inter-|                  -----+
63           |face  |                 /     |
64           +------+                | .msg |
65                                   |      |
66                                   +------+
67
68
69Creating a batch processor
70==========================
71
72We create an own batch processor for an own datatype. This datatype
73must be based on an interface that the batcher can use for converting
74data.
75
76Founding Stoneville
77-------------------
78
79We start with the interface:
80
81    >>> from zope.interface import Interface
82    >>> from zope import schema
83    >>> class ICave(Interface):
84    ...   """A cave."""
85    ...   name = schema.TextLine(
86    ...     title = u'Cave name',
87    ...     default = u'Unnamed',
88    ...     required = True)
89    ...   dinoports = schema.Int(
90    ...     title = u'Number of DinoPorts (tm)',
91    ...     required = False,
92    ...     default = 1)
93    ...   owner = schema.TextLine(
94    ...     title = u'Owner name',
95    ...     required = True,
96    ...     missing_value = 'Fred Estates Inc.')
97    ...   taxpayer = schema.Bool(
98    ...     title = u'Payes taxes',
99    ...     required = True,
100    ...     default = False)
101
102Now a class that implements this interface:
103
104    >>> import grok
105    >>> class Cave(object):
106    ...   grok.implements(ICave)
107    ...   def __init__(self, name=u'Unnamed', dinoports=2,
108    ...                owner='Fred Estates Inc.', taxpayer=False):
109    ...     self.name = name
110    ...     self.dinoports = 2
111    ...     self.owner = owner
112    ...     self.taxpayer = taxpayer
113
114We also provide a factory for caves. Strictly speaking, this not
115necessary but makes the batch processor we create afterwards, better
116understandable.
117
118    >>> from zope.component import getGlobalSiteManager
119    >>> from zope.component.factory import Factory
120    >>> from zope.component.interfaces import IFactory
121    >>> gsm = getGlobalSiteManager()
122    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
123    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
124
125Now we can create caves using a factory:
126
127    >>> from zope.component import createObject
128    >>> createObject('Lovely Cave')
129    <Cave object at 0x...>
130
131This is nice, but we still lack a place, where we can place all the
132lovely caves we want to sell.
133
134Furthermore, as a replacement for a real site, we define a place where
135all caves can be stored: Stoneville! This is a lovely place for
136upperclass cavemen (which are the only ones that can afford more than
137one dinoport).
138
139We found Stoneville:
140
141    >>> stoneville = dict()
142
143Everything in place.
144
145Now, to improve local health conditions, imagine we want to populate
146Stoneville with lots of new happy dino-hunting natives that slept on
147the bare ground in former times and had no idea of
148bathrooms. Disgusting, isn't it?
149
150Lots of cavemen need lots of caves.
151
152Of course we can do something like:
153
154    >>> cave1 = createObject('Lovely Cave')
155    >>> cave1.name = "Fred's home"
156    >>> cave1.owner = "Fred"
157    >>> stoneville[cave1.name] = cave1
158
159and Stoneville has exactly
160
161    >>> len(stoneville)
162    1
163
164inhabitant. But we don't want to do this for hundreds or thousands of
165citizens-to-be, do we?
166
167It is much easier to create a simple CSV list, where we put in all the
168data and let a batch processor do the job.
169
170The list is already here:
171
172    >>> open('newcomers.csv', 'wb').write(
173    ... """name,dinoports,owner,taxpayer
174    ... Barneys Home,2,Barney,1
175    ... Wilmas Asylum,1,Wilma,1
176    ... Freds Dinoburgers,10,Fred,0
177    ... Joeys Drive-in,110,Joey,0
178    ... """)
179
180All we need, is a batch processor now.
181
182    >>> from waeup.utils.batching import BatchProcessor
183    >>> class CaveProcessor(BatchProcessor):
184    ...   util_name = 'caveprocessor'
185    ...   grok.name(util_name)
186    ...   name = 'Cave Processor'
187    ...   iface = ICave
188    ...   location_fields = ['name']
189    ...   factory_name = 'Lovely Cave'
190    ...
191    ...   def parentsExist(self, row, site):
192    ...     return True
193    ...
194    ...   def getParent(self, row, site):
195    ...     return stoneville
196    ...
197    ...   def entryExists(self, row, site):
198    ...     return row['name'] in stoneville.keys()
199    ...
200    ...   def getEntry(self, row, site):
201    ...     if not self.entryExists(row, site):
202    ...       return None
203    ...     return stoneville[row['name']]
204    ...
205    ...   def delEntry(self, row, site):
206    ...     del stoneville[row['name']]
207    ...
208    ...   def addEntry(self, obj, row, site):
209    ...     stoneville[row['name']] = obj
210    ...
211    ...   def updateEntry(self, obj, row, site):
212    ...     for key, value in row.items():
213    ...       setattr(obj, key, value)
214
215Create the fellows:
216
217    >>> processor = CaveProcessor()
218    >>> processor.doImport('newcomers.csv',
219    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
220    ...                    mode='create', user='Bob')
221    (4, 0)
222
223The result means: four entries were processed and no warnings
224occured. Let's check:
225
226    >>> sorted(stoneville.keys())
227    [u'Barneys Home', ..., u'Wilmas Asylum']
228
229The values of the Cave instances have correct type:
230
231    >>> barney = stoneville['Barneys Home']
232    >>> barney.dinoports
233    2
234
235which is a number, not a string.
236
237Apparently, when calling the processor, we gave some more info than
238only the CSV filepath. What does it all mean?
239
240While the first argument is the path to the CSV file, we also have to
241give an ordered list of headernames. These replace the header field
242names that are actually in the file. This way we can override faulty
243headers.
244
245The ``mode`` paramter tells what kind of operation we want to perform:
246``create``, ``update``, or ``remove`` data.
247
248The ``user`` parameter finally is optional and only used for logging.
249
250We can, by the way, see the results of our run in a logfile which is
251named ``newcomers.csv.create.msg``:
252
253    >>> print open('newcomers.csv.create.msg').read()
254    Source: newcomers.csv
255    Mode: create
256    Date: ...
257    User: Bob
258    Failed datasets: newcomers.csv.create.pending
259    Processing time: ... s (... s/item)
260    Processed: 4 lines (4 successful/ 0 failed)
261    <BLANKLINE>
262
263As we can see, the processing was successful. Otherwise, all problems
264could be read here as we can see, if we do the same operation again:
265
266    >>> processor.doImport('newcomers.csv',
267    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
268    ...                    mode='create', user='Bob')
269    (4, 4)
270
271The log file will tell us this in more detail:
272
273    >>> print open('newcomers.csv.create.msg').read()
274    Source: newcomers.csv
275    Mode: create
276    Date: ...
277    User: Bob
278    Failed datasets: newcomers.csv.create.pending
279    Processing time: ... s (... s/item)
280    Processed: 4 lines (0 successful/ 4 failed)
281
282This time a new file was created, which keeps all the rows we could not
283process and an additional column with error messages:
284
285    >>> print open('newcomers.csv.create.pending').read()
286    owner,name,taxpayer,dinoports,--ERRORS--
287    Barney,Barneys Home,1,2,This object already exists. Skipping.
288    Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
289    Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
290    Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
291
292This way we can correct the faulty entries and afterwards retry without
293having the already processed rows in the way.
294
295We also notice, that the values of the taxpayer column are returned as
296in the input file. There we wrote '1' for ``True`` and '0' for
297``False`` (which is accepted by the converters).
298
299
300Updating entries
301----------------
302
303To update entries, we just call the batchprocessor in a different
304mode:
305
306    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
307    ...                    mode='update', user='Bob')
308    (4, 0)
309
310Now we want to tell, that Wilma got an extra port for her second dino:
311
312    >>> open('newcomers.csv', 'wb').write(
313    ... """name,dinoports,owner
314    ... Wilmas Asylum,2,Wilma
315    ... """)
316
317    >>> wilma = stoneville['Wilmas Asylum']
318    >>> wilma.dinoports
319    1
320
321We start the processor:
322
323    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
324    ...                    mode='update', user='Bob')
325    (1, 0)
326
327    >>> wilma = stoneville['Wilmas Asylum']
328    >>> wilma.dinoports
329    2
330
331Wilma's number of dinoports raised.
332
333If we try to update an unexisting entry, an error occurs:
334
335    >>> open('newcomers.csv', 'wb').write(
336    ... """name,dinoports,owner
337    ... NOT-WILMAS-ASYLUM,2,Wilma
338    ... """)
339
340    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
341    ...                    mode='update', user='Bob')
342    (1, 1)
343   
344Also invalid values will be spotted:
345
346    >>> open('newcomers.csv', 'wb').write(
347    ... """name,dinoports,owner
348    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
349    ... """)
350
351    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
352    ...                    mode='update', user='Bob')
353    (1, 1)
354
355We can also update only some cols, leaving some out. We skip the
356'dinoports' column in the next run:
357
358    >>> open('newcomers.csv', 'wb').write(
359    ... """name,owner
360    ... Wilmas Asylum,Barney
361    ... """)
362
363    >>> processor.doImport('newcomers.csv', ['name', 'owner'],
364    ...                    mode='update', user='Bob')
365    (1, 0)
366
367    >>> wilma.owner
368    u'Barney'
369
370We can however, not leave out the 'location field' ('name' in our
371case), as this one tells us which entry to update:
372
373    >>> open('newcomers.csv', 'wb').write(
374    ... """name,dinoports,owner
375    ... 2,Wilma
376    ... """)
377
378    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
379    ...                    mode='update', user='Bob')
380    Traceback (most recent call last):
381    ...
382    FatalCSVError: Need at least columns 'name' for import!
383
384This time we get even an exception!
385
386We can tell to set dinoports to ``None`` although this is not a
387number, as we declared the field not required in the interface:
388
389    >>> open('newcomers.csv', 'wb').write(
390    ... """name,dinoports,owner
391    ... "Wilmas Asylum",,"Wilma"
392    ... """)
393
394    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
395    ...                    mode='update', user='Bob')
396    (1, 0)
397
398    >>> wilma.dinoports is None
399    True
400
401Generally, empty strings are considered as ``None``:
402
403    >>> open('newcomers.csv', 'wb').write(
404    ... """name,dinoports,owner
405    ... "Wilmas Asylum","","Wilma"
406    ... """)
407
408    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
409    ...                    mode='update', user='Bob')
410    (1, 0)
411
412    >>> wilma.dinoports is None
413    True
414
415Removing entries
416----------------
417
418In 'remove' mode we can delete entries. Here validity of values in
419non-location fields doesn't matter because those fields are ignored.
420
421    >>> open('newcomers.csv', 'wb').write(
422    ... """name,dinoports,owner
423    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
424    ... """)
425
426    >>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
427    ...                    mode='remove', user='Bob')
428    (1, 0)
429
430    >>> sorted(stoneville.keys())
431    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
432
433Oops! Wilma is gone.
434
435
436Clean up:
437
438    >>> import os
439    >>> os.unlink('newcomers.csv')
440    >>> os.unlink('newcomers.csv.create.pending')
441    >>> os.unlink('newcomers.csv.create.msg')
442    >>> os.unlink('newcomers.csv.remove.msg')
443    >>> os.unlink('newcomers.csv.update.msg')
Note: See TracBrowser for help on using the repository browser.