source: main/waeup.sirp/trunk/src/waeup/sirp/utils/batching.txt @ 11102

Last change on this file since 11102 was 6824, checked in by Henrik Bettermann, 13 years ago

Skip ignored columns in failed and finished data files.

In the finished data file we now clearly see which fields have been imported. In the pending data file (failed data file) ignored columns are omitted.

We could think about saving the original import file elsewhere.

File size: 16.8 KB
RevLine 
[4921]1:mod:`waeup.sirp.utils.batching` -- Batch processing
2****************************************************
[4837]3
4Batch processing is much more than pure data import.
5
6Overview
7========
8
9Basically, it means processing CSV files in order to mass-create,
10mass-remove, or mass-update data.
11
[4847]12So you can feed CSV files to importers or processors, that are part of
13the batch-processing mechanism.
[4837]14
[4847]15Importers/Processors
16--------------------
[4837]17
[4847]18Each CSV file processor
[4837]19
20* accepts a single data type identified by an interface.
21
22* knows about the places inside a site (University) where to store,
23  remove or update the data.
24
25* can check headers before processing data.
26
27* supports the mode 'create', 'update', 'remove'.
28
[4903]29* creates log entries (optional)
[4837]30
[4903]31* creates csv files containing successful and not-successful processed
32  data respectively.
33
[4837]34Output
35------
36
[4903]37The results of processing are written to loggers, if a logger was
38given. Beside this new CSV files are created during processing:
[4837]39
[4903]40* a pending CSV file, containing datasets that could not be processed
[4837]41
[4903]42* a finished CSV file, containing datasets successfully processed.
43
44The pending file is not created if everything works fine. The
45respective path returned in that case is ``None``.
46
47The pending file (if created) is a CSV file that contains the failed
48rows appended by a column ``--ERRROR--`` in which the reasons for
49processing failures are listed.
50
51The complete paths of these files are returned. They will be in a
52temporary directory created only for this purpose. It is the caller's
53responsibility to remove the temporay directories afterwards (the
54datacenters distProcessedFiles() method takes care for that).
55
[4837]56It looks like this::
57 
58     -----+      +---------+
59    /     |      |         |              +------+
60   | .csv +----->|Batch-   |              |      |
61   |      |      |processor+----changes-->| ZODB |
62   |  +------+   |         |              |      |
63   +--|      |   |         +              +------+
64      | Mode +-->|         |                 -------+
65      |      |   |         +----outputs-+-> /       |
[4903]66      |  +----+->+---------+            |  |.pending|
67      +--|Log |  ^                      |  |        |
68         +----+  |                      |  +--------+
[4837]69           +-----++                     v
[4903]70           |Inter-|                  ----------+
71           |face  |                 /          |
72           +------+                | .finished |
73                                   |           |
74                                   +-----------+
[4837]75
76
77Creating a batch processor
78==========================
79
80We create an own batch processor for an own datatype. This datatype
81must be based on an interface that the batcher can use for converting
82data.
83
84Founding Stoneville
85-------------------
86
87We start with the interface:
88
89    >>> from zope.interface import Interface
90    >>> from zope import schema
91    >>> class ICave(Interface):
92    ...   """A cave."""
93    ...   name = schema.TextLine(
94    ...     title = u'Cave name',
95    ...     default = u'Unnamed',
96    ...     required = True)
97    ...   dinoports = schema.Int(
98    ...     title = u'Number of DinoPorts (tm)',
99    ...     required = False,
100    ...     default = 1)
101    ...   owner = schema.TextLine(
102    ...     title = u'Owner name',
103    ...     required = True,
104    ...     missing_value = 'Fred Estates Inc.')
[4871]105    ...   taxpayer = schema.Bool(
106    ...     title = u'Payes taxes',
107    ...     required = True,
108    ...     default = False)
[4837]109
110Now a class that implements this interface:
111
112    >>> import grok
113    >>> class Cave(object):
114    ...   grok.implements(ICave)
115    ...   def __init__(self, name=u'Unnamed', dinoports=2,
[4871]116    ...                owner='Fred Estates Inc.', taxpayer=False):
[4837]117    ...     self.name = name
118    ...     self.dinoports = 2
119    ...     self.owner = owner
[4871]120    ...     self.taxpayer = taxpayer
[4837]121
122We also provide a factory for caves. Strictly speaking, this not
123necessary but makes the batch processor we create afterwards, better
124understandable.
125
126    >>> from zope.component import getGlobalSiteManager
127    >>> from zope.component.factory import Factory
128    >>> from zope.component.interfaces import IFactory
129    >>> gsm = getGlobalSiteManager()
130    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
131    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
132
133Now we can create caves using a factory:
134
135    >>> from zope.component import createObject
136    >>> createObject('Lovely Cave')
137    <Cave object at 0x...>
138
139This is nice, but we still lack a place, where we can place all the
140lovely caves we want to sell.
141
142Furthermore, as a replacement for a real site, we define a place where
143all caves can be stored: Stoneville! This is a lovely place for
144upperclass cavemen (which are the only ones that can afford more than
145one dinoport).
146
147We found Stoneville:
148
149    >>> stoneville = dict()
150
151Everything in place.
152
153Now, to improve local health conditions, imagine we want to populate
154Stoneville with lots of new happy dino-hunting natives that slept on
155the bare ground in former times and had no idea of
156bathrooms. Disgusting, isn't it?
157
158Lots of cavemen need lots of caves.
159
160Of course we can do something like:
161
162    >>> cave1 = createObject('Lovely Cave')
163    >>> cave1.name = "Fred's home"
164    >>> cave1.owner = "Fred"
165    >>> stoneville[cave1.name] = cave1
166
167and Stoneville has exactly
168
169    >>> len(stoneville)
170    1
171
172inhabitant. But we don't want to do this for hundreds or thousands of
173citizens-to-be, do we?
174
175It is much easier to create a simple CSV list, where we put in all the
176data and let a batch processor do the job.
177
178The list is already here:
179
180    >>> open('newcomers.csv', 'wb').write(
[4871]181    ... """name,dinoports,owner,taxpayer
182    ... Barneys Home,2,Barney,1
183    ... Wilmas Asylum,1,Wilma,1
184    ... Freds Dinoburgers,10,Fred,0
185    ... Joeys Drive-in,110,Joey,0
[4837]186    ... """)
187
188All we need, is a batch processor now.
189
[4921]190    >>> from waeup.sirp.utils.batching import BatchProcessor
[4837]191    >>> class CaveProcessor(BatchProcessor):
192    ...   util_name = 'caveprocessor'
193    ...   grok.name(util_name)
194    ...   name = 'Cave Processor'
195    ...   iface = ICave
196    ...   location_fields = ['name']
197    ...   factory_name = 'Lovely Cave'
198    ...
199    ...   def parentsExist(self, row, site):
200    ...     return True
201    ...
202    ...   def getParent(self, row, site):
203    ...     return stoneville
204    ...
205    ...   def entryExists(self, row, site):
206    ...     return row['name'] in stoneville.keys()
207    ...
208    ...   def getEntry(self, row, site):
209    ...     if not self.entryExists(row, site):
210    ...       return None
211    ...     return stoneville[row['name']]
212    ...
213    ...   def delEntry(self, row, site):
214    ...     del stoneville[row['name']]
215    ...
216    ...   def addEntry(self, obj, row, site):
217    ...     stoneville[row['name']] = obj
218    ...
219    ...   def updateEntry(self, obj, row, site):
[4985]220    ...     # This is not strictly necessary, as the default
221    ...     # updateEntry method does exactly the same
[4837]222    ...     for key, value in row.items():
223    ...       setattr(obj, key, value)
224
[4886]225If we also want the results being logged, we must provide a logger
226(this is optional):
227
228    >>> import logging
229    >>> logger = logging.getLogger('stoneville')
230    >>> logger.setLevel(logging.DEBUG)
231    >>> logger.propagate = False
232    >>> handler = logging.FileHandler('stoneville.log', 'w')
233    >>> logger.addHandler(handler)
234
[4837]235Create the fellows:
236
237    >>> processor = CaveProcessor()
[6273]238    >>> result = processor.doImport('newcomers.csv',
[4871]239    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]240    ...                    mode='create', user='Bob', logger=logger)
[4902]241    >>> result
[4895]242    (4, 0, '/.../newcomers.finished.csv', None)
[4837]243
244The result means: four entries were processed and no warnings
[4895]245occured. Furthermore we get filepath to a CSV file with successfully
246processed entries and a filepath to a CSV file with erraneous entries.
247As everything went well, the latter is ``None``. Let's check:
[4837]248
249    >>> sorted(stoneville.keys())
250    [u'Barneys Home', ..., u'Wilmas Asylum']
251
252The values of the Cave instances have correct type:
253
254    >>> barney = stoneville['Barneys Home']
255    >>> barney.dinoports
256    2
257
258which is a number, not a string.
259
260Apparently, when calling the processor, we gave some more info than
261only the CSV filepath. What does it all mean?
262
263While the first argument is the path to the CSV file, we also have to
264give an ordered list of headernames. These replace the header field
265names that are actually in the file. This way we can override faulty
266headers.
267
268The ``mode`` paramter tells what kind of operation we want to perform:
269``create``, ``update``, or ``remove`` data.
270
271The ``user`` parameter finally is optional and only used for logging.
272
[4886]273We can, by the way, see the results of our run in a logfile if we
274provided a logger during the call:
[4837]275
[4886]276    >>> print open('stoneville.log').read()
277    --------------------
278    Bob: Batch processing finished: OK
279    Bob: Source: newcomers.csv
280    Bob: Mode: create
281    Bob: User: Bob
282    Bob: Processing time: ... s (... s/item)
283    Bob: Processed: 4 lines (4 successful/ 0 failed)
284    --------------------
[4837]285
[4902]286We cleanup the temporay dir created by doImport():
287
288    >>> import shutil
289    >>> import os
290    >>> shutil.rmtree(os.path.dirname(result[2]))
291
[4837]292As we can see, the processing was successful. Otherwise, all problems
293could be read here as we can see, if we do the same operation again:
294
[4902]295    >>> result = processor.doImport('newcomers.csv',
[4871]296    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]297    ...                    mode='create', user='Bob', logger=logger)
[4902]298    >>> result
[4895]299    (4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]300
[4895]301This time we also get a path to a .pending file.
302
[4837]303The log file will tell us this in more detail:
304
[4886]305    >>> print open('stoneville.log').read()
306    --------------------
307    ...
308    --------------------
309    Bob: Batch processing finished: FAILED
310    Bob: Source: newcomers.csv
311    Bob: Mode: create
312    Bob: User: Bob
[4895]313    Bob: Failed datasets: newcomers.pending.csv
[4886]314    Bob: Processing time: ... s (... s/item)
315    Bob: Processed: 4 lines (0 successful/ 4 failed)
316    --------------------
[4837]317
318This time a new file was created, which keeps all the rows we could not
[4877]319process and an additional column with error messages:
[4837]320
[4902]321    >>> print open(result[3]).read()
[4877]322    owner,name,taxpayer,dinoports,--ERRORS--
[6244]323    Barney,Barneys Home,1,2,This object already exists in the same container. Skipping.
324    Wilma,Wilmas Asylum,1,1,This object already exists in the same container. Skipping.
325    Fred,Freds Dinoburgers,0,10,This object already exists in the same container. Skipping.
326    Joey,Joeys Drive-in,0,110,This object already exists in the same container. Skipping.
[4837]327
328This way we can correct the faulty entries and afterwards retry without
329having the already processed rows in the way.
330
[4871]331We also notice, that the values of the taxpayer column are returned as
332in the input file. There we wrote '1' for ``True`` and '0' for
333``False`` (which is accepted by the converters).
[4837]334
[4902]335Clean up:
[4871]336
[4902]337    >>> shutil.rmtree(os.path.dirname(result[2]))
338
[4912]339
340We can also tell to ignore some cols from input by passing
341``--IGNORE--`` as col name:
342
343    >>> result = processor.doImport('newcomers.csv', ['name',
344    ...                             '--IGNORE--', '--IGNORE--'],
345    ...                    mode='update', user='Bob')
346    >>> result
347    (4, 0, '...', None)
348
349Clean up:
350
351    >>> shutil.rmtree(os.path.dirname(result[2]))
352
353If something goes wrong during processing, the respective --IGNORE--
[6824]354cols won't be populated  in the resulting pending file:
[4912]355
356    >>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
357    ...                             '--IGNORE--', '--IGNORE--'],
358    ...                    mode='create', user='Bob')
359    >>> result
360    (4, 4, '...', '...')
361
362    >>> print open(result[3], 'rb').read()
[6824]363    name,dinoports,--ERRORS--
364    Barneys Home,2,This object already exists in the same container. Skipping.
365    Wilmas Asylum,1,This object already exists in the same container. Skipping.
366    Freds Dinoburgers,10,This object already exists in the same container. Skipping.
367    Joeys Drive-in,110,This object already exists in the same container. Skipping.
[4912]368
369
370Clean up:
371
372    >>> shutil.rmtree(os.path.dirname(result[2]))
373
374
375
376
[4837]377Updating entries
378----------------
379
380To update entries, we just call the batchprocessor in a different
381mode:
382
[4902]383    >>> result = processor.doImport('newcomers.csv', ['name',
384    ...                             'dinoports', 'owner'],
[4837]385    ...                    mode='update', user='Bob')
[4902]386    >>> result
[4895]387    (4, 0, '...', None)
[4837]388
[4879]389Now we want to tell, that Wilma got an extra port for her second dino:
[4837]390
391    >>> open('newcomers.csv', 'wb').write(
392    ... """name,dinoports,owner
393    ... Wilmas Asylum,2,Wilma
394    ... """)
395
396    >>> wilma = stoneville['Wilmas Asylum']
397    >>> wilma.dinoports
398    1
399
[4902]400Clean up:
401
402    >>> shutil.rmtree(os.path.dirname(result[2]))
403
404
[4837]405We start the processor:
406
[4902]407    >>> result = processor.doImport('newcomers.csv', ['name',
408    ...                    'dinoports', 'owner'], mode='update', user='Bob')
409    >>> result
[4895]410    (1, 0, '...', None)
[4837]411
412    >>> wilma = stoneville['Wilmas Asylum']
413    >>> wilma.dinoports
414    2
415
416Wilma's number of dinoports raised.
417
[4902]418Clean up:
419
420    >>> shutil.rmtree(os.path.dirname(result[2]))
421
422
[4837]423If we try to update an unexisting entry, an error occurs:
424
425    >>> open('newcomers.csv', 'wb').write(
426    ... """name,dinoports,owner
427    ... NOT-WILMAS-ASYLUM,2,Wilma
428    ... """)
429
[4902]430    >>> result = processor.doImport('newcomers.csv', ['name',
431    ...                             'dinoports', 'owner'],
[4837]432    ...                    mode='update', user='Bob')
[4902]433    >>> result
[4895]434    (1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]435
436Clean up:
437
438    >>> shutil.rmtree(os.path.dirname(result[2]))
439
[4837]440   
441Also invalid values will be spotted:
442
443    >>> open('newcomers.csv', 'wb').write(
444    ... """name,dinoports,owner
445    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
446    ... """)
447
[4902]448    >>> result = processor.doImport('newcomers.csv', ['name',
449    ...                             'dinoports', 'owner'],
[4837]450    ...                    mode='update', user='Bob')
[4902]451    >>> result
[4895]452    (1, 1, '...', '...')
[4837]453
[4902]454Clean up:
455
456    >>> shutil.rmtree(os.path.dirname(result[2]))
457
458
[4837]459We can also update only some cols, leaving some out. We skip the
460'dinoports' column in the next run:
461
462    >>> open('newcomers.csv', 'wb').write(
463    ... """name,owner
464    ... Wilmas Asylum,Barney
465    ... """)
466
[4902]467    >>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
468    ...                             mode='update', user='Bob')
469    >>> result
[4895]470    (1, 0, '...', None)
[4837]471
472    >>> wilma.owner
473    u'Barney'
474
[4902]475Clean up:
476
477    >>> shutil.rmtree(os.path.dirname(result[2]))
478
479
[4837]480We can however, not leave out the 'location field' ('name' in our
481case), as this one tells us which entry to update:
482
483    >>> open('newcomers.csv', 'wb').write(
484    ... """name,dinoports,owner
485    ... 2,Wilma
486    ... """)
487
488    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
489    ...                    mode='update', user='Bob')
490    Traceback (most recent call last):
491    ...
492    FatalCSVError: Need at least columns 'name' for import!
493
494This time we get even an exception!
495
496We can tell to set dinoports to ``None`` although this is not a
497number, as we declared the field not required in the interface:
498
499    >>> open('newcomers.csv', 'wb').write(
500    ... """name,dinoports,owner
501    ... "Wilmas Asylum",,"Wilma"
502    ... """)
503
[4902]504    >>> result = processor.doImport('newcomers.csv', ['name',
505    ...                             'dinoports', 'owner'],
[4837]506    ...                    mode='update', user='Bob')
[4902]507    >>> result
[4895]508    (1, 0, '...', None)
[4837]509
510    >>> wilma.dinoports is None
511    True
512
[4902]513Clean up:
514
515    >>> shutil.rmtree(os.path.dirname(result[2]))
516
[4837]517Generally, empty strings are considered as ``None``:
518
519    >>> open('newcomers.csv', 'wb').write(
520    ... """name,dinoports,owner
521    ... "Wilmas Asylum","","Wilma"
522    ... """)
523
[4902]524    >>> result = processor.doImport('newcomers.csv', ['name',
525    ...                             'dinoports', 'owner'],
[4837]526    ...                    mode='update', user='Bob')
[4902]527    >>> result
[4895]528    (1, 0, '...', None)
[4837]529
530    >>> wilma.dinoports is None
531    True
532
[4902]533Clean up:
534
535    >>> shutil.rmtree(os.path.dirname(result[2]))
536
537
[4837]538Removing entries
539----------------
540
541In 'remove' mode we can delete entries. Here validity of values in
542non-location fields doesn't matter because those fields are ignored.
543
544    >>> open('newcomers.csv', 'wb').write(
545    ... """name,dinoports,owner
546    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
547    ... """)
548
[4902]549    >>> result = processor.doImport('newcomers.csv', ['name',
550    ...                             'dinoports', 'owner'],
[4837]551    ...                    mode='remove', user='Bob')
[4902]552    >>> result
[4895]553    (1, 0, '...', None)
[4837]554
555    >>> sorted(stoneville.keys())
556    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
557
558Oops! Wilma is gone.
559
[4902]560Clean up:
[4837]561
[4902]562    >>> shutil.rmtree(os.path.dirname(result[2]))
563
564
[4837]565Clean up:
566
567    >>> import os
568    >>> os.unlink('newcomers.csv')
[4886]569    >>> os.unlink('stoneville.log')
Note: See TracBrowser for help on using the repository browser.