source: main/waeup.sirp/trunk/src/waeup/sirp/utils/batching.txt @ 6503

Last change on this file since 6503 was 6273, checked in by uli, 14 years ago

Finally make the new converter work. API-wise it is as good as the old one (can import everyting, the old one could),
but design-wise it might be much more powerfull. Basically it can handle/convert all content-types for which one can
create an Add- or EditForm? successfully. In other words: if you manage to write an edit form for some content type,
then you can also create an importer for that content-type. Still finetuning needed (for dates, bool data, etc.) but
the main things work.

File size: 17.0 KB
RevLine 
[4921]1:mod:`waeup.sirp.utils.batching` -- Batch processing
2****************************************************
[4837]3
4Batch processing is much more than pure data import.
5
6:test-layer: functional
7
8Overview
9========
10
11Basically, it means processing CSV files in order to mass-create,
12mass-remove, or mass-update data.
13
[4847]14So you can feed CSV files to importers or processors, that are part of
15the batch-processing mechanism.
[4837]16
[4847]17Importers/Processors
18--------------------
[4837]19
[4847]20Each CSV file processor
[4837]21
22* accepts a single data type identified by an interface.
23
24* knows about the places inside a site (University) where to store,
25  remove or update the data.
26
27* can check headers before processing data.
28
29* supports the mode 'create', 'update', 'remove'.
30
[4903]31* creates log entries (optional)
[4837]32
[4903]33* creates csv files containing successful and not-successful processed
34  data respectively.
35
[4837]36Output
37------
38
[4903]39The results of processing are written to loggers, if a logger was
40given. Beside this new CSV files are created during processing:
[4837]41
[4903]42* a pending CSV file, containing datasets that could not be processed
[4837]43
[4903]44* a finished CSV file, containing datasets successfully processed.
45
46The pending file is not created if everything works fine. The
47respective path returned in that case is ``None``.
48
49The pending file (if created) is a CSV file that contains the failed
50rows appended by a column ``--ERRROR--`` in which the reasons for
51processing failures are listed.
52
53The complete paths of these files are returned. They will be in a
54temporary directory created only for this purpose. It is the caller's
55responsibility to remove the temporay directories afterwards (the
56datacenters distProcessedFiles() method takes care for that).
57
[4837]58It looks like this::
59 
60     -----+      +---------+
61    /     |      |         |              +------+
62   | .csv +----->|Batch-   |              |      |
63   |      |      |processor+----changes-->| ZODB |
64   |  +------+   |         |              |      |
65   +--|      |   |         +              +------+
66      | Mode +-->|         |                 -------+
67      |      |   |         +----outputs-+-> /       |
[4903]68      |  +----+->+---------+            |  |.pending|
69      +--|Log |  ^                      |  |        |
70         +----+  |                      |  +--------+
[4837]71           +-----++                     v
[4903]72           |Inter-|                  ----------+
73           |face  |                 /          |
74           +------+                | .finished |
75                                   |           |
76                                   +-----------+
[4837]77
78
79Creating a batch processor
80==========================
81
82We create an own batch processor for an own datatype. This datatype
83must be based on an interface that the batcher can use for converting
84data.
85
86Founding Stoneville
87-------------------
88
89We start with the interface:
90
91    >>> from zope.interface import Interface
92    >>> from zope import schema
93    >>> class ICave(Interface):
94    ...   """A cave."""
95    ...   name = schema.TextLine(
96    ...     title = u'Cave name',
97    ...     default = u'Unnamed',
98    ...     required = True)
99    ...   dinoports = schema.Int(
100    ...     title = u'Number of DinoPorts (tm)',
101    ...     required = False,
102    ...     default = 1)
103    ...   owner = schema.TextLine(
104    ...     title = u'Owner name',
105    ...     required = True,
106    ...     missing_value = 'Fred Estates Inc.')
[4871]107    ...   taxpayer = schema.Bool(
108    ...     title = u'Payes taxes',
109    ...     required = True,
110    ...     default = False)
[4837]111
112Now a class that implements this interface:
113
114    >>> import grok
115    >>> class Cave(object):
116    ...   grok.implements(ICave)
117    ...   def __init__(self, name=u'Unnamed', dinoports=2,
[4871]118    ...                owner='Fred Estates Inc.', taxpayer=False):
[4837]119    ...     self.name = name
120    ...     self.dinoports = 2
121    ...     self.owner = owner
[4871]122    ...     self.taxpayer = taxpayer
[4837]123
124We also provide a factory for caves. Strictly speaking, this not
125necessary but makes the batch processor we create afterwards, better
126understandable.
127
128    >>> from zope.component import getGlobalSiteManager
129    >>> from zope.component.factory import Factory
130    >>> from zope.component.interfaces import IFactory
131    >>> gsm = getGlobalSiteManager()
132    >>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
133    >>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
134
135Now we can create caves using a factory:
136
137    >>> from zope.component import createObject
138    >>> createObject('Lovely Cave')
139    <Cave object at 0x...>
140
141This is nice, but we still lack a place, where we can place all the
142lovely caves we want to sell.
143
144Furthermore, as a replacement for a real site, we define a place where
145all caves can be stored: Stoneville! This is a lovely place for
146upperclass cavemen (which are the only ones that can afford more than
147one dinoport).
148
149We found Stoneville:
150
151    >>> stoneville = dict()
152
153Everything in place.
154
155Now, to improve local health conditions, imagine we want to populate
156Stoneville with lots of new happy dino-hunting natives that slept on
157the bare ground in former times and had no idea of
158bathrooms. Disgusting, isn't it?
159
160Lots of cavemen need lots of caves.
161
162Of course we can do something like:
163
164    >>> cave1 = createObject('Lovely Cave')
165    >>> cave1.name = "Fred's home"
166    >>> cave1.owner = "Fred"
167    >>> stoneville[cave1.name] = cave1
168
169and Stoneville has exactly
170
171    >>> len(stoneville)
172    1
173
174inhabitant. But we don't want to do this for hundreds or thousands of
175citizens-to-be, do we?
176
177It is much easier to create a simple CSV list, where we put in all the
178data and let a batch processor do the job.
179
180The list is already here:
181
182    >>> open('newcomers.csv', 'wb').write(
[4871]183    ... """name,dinoports,owner,taxpayer
184    ... Barneys Home,2,Barney,1
185    ... Wilmas Asylum,1,Wilma,1
186    ... Freds Dinoburgers,10,Fred,0
187    ... Joeys Drive-in,110,Joey,0
[4837]188    ... """)
189
190All we need, is a batch processor now.
191
[4921]192    >>> from waeup.sirp.utils.batching import BatchProcessor
[4837]193    >>> class CaveProcessor(BatchProcessor):
194    ...   util_name = 'caveprocessor'
195    ...   grok.name(util_name)
196    ...   name = 'Cave Processor'
197    ...   iface = ICave
198    ...   location_fields = ['name']
199    ...   factory_name = 'Lovely Cave'
200    ...
201    ...   def parentsExist(self, row, site):
202    ...     return True
203    ...
204    ...   def getParent(self, row, site):
205    ...     return stoneville
206    ...
207    ...   def entryExists(self, row, site):
208    ...     return row['name'] in stoneville.keys()
209    ...
210    ...   def getEntry(self, row, site):
211    ...     if not self.entryExists(row, site):
212    ...       return None
213    ...     return stoneville[row['name']]
214    ...
215    ...   def delEntry(self, row, site):
216    ...     del stoneville[row['name']]
217    ...
218    ...   def addEntry(self, obj, row, site):
219    ...     stoneville[row['name']] = obj
220    ...
221    ...   def updateEntry(self, obj, row, site):
[4985]222    ...     # This is not strictly necessary, as the default
223    ...     # updateEntry method does exactly the same
[4837]224    ...     for key, value in row.items():
225    ...       setattr(obj, key, value)
226
[4886]227If we also want the results being logged, we must provide a logger
228(this is optional):
229
230    >>> import logging
231    >>> logger = logging.getLogger('stoneville')
232    >>> logger.setLevel(logging.DEBUG)
233    >>> logger.propagate = False
234    >>> handler = logging.FileHandler('stoneville.log', 'w')
235    >>> logger.addHandler(handler)
236
[4837]237Create the fellows:
238
239    >>> processor = CaveProcessor()
[6273]240    >>> result = processor.doImport('newcomers.csv',
[4871]241    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]242    ...                    mode='create', user='Bob', logger=logger)
[4902]243    >>> result
[4895]244    (4, 0, '/.../newcomers.finished.csv', None)
[4837]245
246The result means: four entries were processed and no warnings
[4895]247occured. Furthermore we get filepath to a CSV file with successfully
248processed entries and a filepath to a CSV file with erraneous entries.
249As everything went well, the latter is ``None``. Let's check:
[4837]250
251    >>> sorted(stoneville.keys())
252    [u'Barneys Home', ..., u'Wilmas Asylum']
253
254The values of the Cave instances have correct type:
255
256    >>> barney = stoneville['Barneys Home']
257    >>> barney.dinoports
258    2
259
260which is a number, not a string.
261
262Apparently, when calling the processor, we gave some more info than
263only the CSV filepath. What does it all mean?
264
265While the first argument is the path to the CSV file, we also have to
266give an ordered list of headernames. These replace the header field
267names that are actually in the file. This way we can override faulty
268headers.
269
270The ``mode`` paramter tells what kind of operation we want to perform:
271``create``, ``update``, or ``remove`` data.
272
273The ``user`` parameter finally is optional and only used for logging.
274
[4886]275We can, by the way, see the results of our run in a logfile if we
276provided a logger during the call:
[4837]277
[4886]278    >>> print open('stoneville.log').read()
279    --------------------
280    Bob: Batch processing finished: OK
281    Bob: Source: newcomers.csv
282    Bob: Mode: create
283    Bob: User: Bob
284    Bob: Processing time: ... s (... s/item)
285    Bob: Processed: 4 lines (4 successful/ 0 failed)
286    --------------------
[4837]287
[4902]288We cleanup the temporay dir created by doImport():
289
290    >>> import shutil
291    >>> import os
292    >>> shutil.rmtree(os.path.dirname(result[2]))
293
[4837]294As we can see, the processing was successful. Otherwise, all problems
295could be read here as we can see, if we do the same operation again:
296
[4902]297    >>> result = processor.doImport('newcomers.csv',
[4871]298    ...                   ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]299    ...                    mode='create', user='Bob', logger=logger)
[4902]300    >>> result
[4895]301    (4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]302
[4895]303This time we also get a path to a .pending file.
304
[4837]305The log file will tell us this in more detail:
306
[4886]307    >>> print open('stoneville.log').read()
308    --------------------
309    ...
310    --------------------
311    Bob: Batch processing finished: FAILED
312    Bob: Source: newcomers.csv
313    Bob: Mode: create
314    Bob: User: Bob
[4895]315    Bob: Failed datasets: newcomers.pending.csv
[4886]316    Bob: Processing time: ... s (... s/item)
317    Bob: Processed: 4 lines (0 successful/ 4 failed)
318    --------------------
[4837]319
320This time a new file was created, which keeps all the rows we could not
[4877]321process and an additional column with error messages:
[4837]322
[4902]323    >>> print open(result[3]).read()
[4877]324    owner,name,taxpayer,dinoports,--ERRORS--
[6244]325    Barney,Barneys Home,1,2,This object already exists in the same container. Skipping.
326    Wilma,Wilmas Asylum,1,1,This object already exists in the same container. Skipping.
327    Fred,Freds Dinoburgers,0,10,This object already exists in the same container. Skipping.
328    Joey,Joeys Drive-in,0,110,This object already exists in the same container. Skipping.
[4837]329
330This way we can correct the faulty entries and afterwards retry without
331having the already processed rows in the way.
332
[4871]333We also notice, that the values of the taxpayer column are returned as
334in the input file. There we wrote '1' for ``True`` and '0' for
335``False`` (which is accepted by the converters).
[4837]336
[4902]337Clean up:
[4871]338
[4902]339    >>> shutil.rmtree(os.path.dirname(result[2]))
340
[4912]341
342We can also tell to ignore some cols from input by passing
343``--IGNORE--`` as col name:
344
345    >>> result = processor.doImport('newcomers.csv', ['name',
346    ...                             '--IGNORE--', '--IGNORE--'],
347    ...                    mode='update', user='Bob')
348    >>> result
349    (4, 0, '...', None)
350
351Clean up:
352
353    >>> shutil.rmtree(os.path.dirname(result[2]))
354
355If something goes wrong during processing, the respective --IGNORE--
356cols will be populated correctly in the resulting pending file:
357
358    >>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
359    ...                             '--IGNORE--', '--IGNORE--'],
360    ...                    mode='create', user='Bob')
361    >>> result
362    (4, 4, '...', '...')
363
364    >>> print open(result[3], 'rb').read()
365    --IGNORE--,name,--IGNORE--,dinoports,--ERRORS--
[6244]366    Barney,Barneys Home,1,2,This object already exists in the same container. Skipping.
367    Wilma,Wilmas Asylum,1,1,This object already exists in the same container. Skipping.
368    Fred,Freds Dinoburgers,0,10,This object already exists in the same container. Skipping.
369    Joey,Joeys Drive-in,0,110,This object already exists in the same container. Skipping.
[4912]370
371The first ignored column ('owner') provides different contents than
372the second one ('taxpayer').
373
374Clean up:
375
376    >>> shutil.rmtree(os.path.dirname(result[2]))
377
378
379
380
[4837]381Updating entries
382----------------
383
384To update entries, we just call the batchprocessor in a different
385mode:
386
[4902]387    >>> result = processor.doImport('newcomers.csv', ['name',
388    ...                             'dinoports', 'owner'],
[4837]389    ...                    mode='update', user='Bob')
[4902]390    >>> result
[4895]391    (4, 0, '...', None)
[4837]392
[4879]393Now we want to tell, that Wilma got an extra port for her second dino:
[4837]394
395    >>> open('newcomers.csv', 'wb').write(
396    ... """name,dinoports,owner
397    ... Wilmas Asylum,2,Wilma
398    ... """)
399
400    >>> wilma = stoneville['Wilmas Asylum']
401    >>> wilma.dinoports
402    1
403
[4902]404Clean up:
405
406    >>> shutil.rmtree(os.path.dirname(result[2]))
407
408
[4837]409We start the processor:
410
[4902]411    >>> result = processor.doImport('newcomers.csv', ['name',
412    ...                    'dinoports', 'owner'], mode='update', user='Bob')
413    >>> result
[4895]414    (1, 0, '...', None)
[4837]415
416    >>> wilma = stoneville['Wilmas Asylum']
417    >>> wilma.dinoports
418    2
419
420Wilma's number of dinoports raised.
421
[4902]422Clean up:
423
424    >>> shutil.rmtree(os.path.dirname(result[2]))
425
426
[4837]427If we try to update an unexisting entry, an error occurs:
428
429    >>> open('newcomers.csv', 'wb').write(
430    ... """name,dinoports,owner
431    ... NOT-WILMAS-ASYLUM,2,Wilma
432    ... """)
433
[4902]434    >>> result = processor.doImport('newcomers.csv', ['name',
435    ...                             'dinoports', 'owner'],
[4837]436    ...                    mode='update', user='Bob')
[4902]437    >>> result
[4895]438    (1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]439
440Clean up:
441
442    >>> shutil.rmtree(os.path.dirname(result[2]))
443
[4837]444   
445Also invalid values will be spotted:
446
447    >>> open('newcomers.csv', 'wb').write(
448    ... """name,dinoports,owner
449    ... Wilmas Asylum,NOT-A-NUMBER,Wilma
450    ... """)
451
[4902]452    >>> result = processor.doImport('newcomers.csv', ['name',
453    ...                             'dinoports', 'owner'],
[4837]454    ...                    mode='update', user='Bob')
[4902]455    >>> result
[4895]456    (1, 1, '...', '...')
[4837]457
[4902]458Clean up:
459
460    >>> shutil.rmtree(os.path.dirname(result[2]))
461
462
[4837]463We can also update only some cols, leaving some out. We skip the
464'dinoports' column in the next run:
465
466    >>> open('newcomers.csv', 'wb').write(
467    ... """name,owner
468    ... Wilmas Asylum,Barney
469    ... """)
470
[4902]471    >>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
472    ...                             mode='update', user='Bob')
473    >>> result
[4895]474    (1, 0, '...', None)
[4837]475
476    >>> wilma.owner
477    u'Barney'
478
[4902]479Clean up:
480
481    >>> shutil.rmtree(os.path.dirname(result[2]))
482
483
[4837]484We can however, not leave out the 'location field' ('name' in our
485case), as this one tells us which entry to update:
486
487    >>> open('newcomers.csv', 'wb').write(
488    ... """name,dinoports,owner
489    ... 2,Wilma
490    ... """)
491
492    >>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
493    ...                    mode='update', user='Bob')
494    Traceback (most recent call last):
495    ...
496    FatalCSVError: Need at least columns 'name' for import!
497
498This time we get even an exception!
499
500We can tell to set dinoports to ``None`` although this is not a
501number, as we declared the field not required in the interface:
502
503    >>> open('newcomers.csv', 'wb').write(
504    ... """name,dinoports,owner
505    ... "Wilmas Asylum",,"Wilma"
506    ... """)
507
[4902]508    >>> result = processor.doImport('newcomers.csv', ['name',
509    ...                             'dinoports', 'owner'],
[4837]510    ...                    mode='update', user='Bob')
[4902]511    >>> result
[4895]512    (1, 0, '...', None)
[4837]513
514    >>> wilma.dinoports is None
515    True
516
[4902]517Clean up:
518
519    >>> shutil.rmtree(os.path.dirname(result[2]))
520
[4837]521Generally, empty strings are considered as ``None``:
522
523    >>> open('newcomers.csv', 'wb').write(
524    ... """name,dinoports,owner
525    ... "Wilmas Asylum","","Wilma"
526    ... """)
527
[4902]528    >>> result = processor.doImport('newcomers.csv', ['name',
529    ...                             'dinoports', 'owner'],
[4837]530    ...                    mode='update', user='Bob')
[4902]531    >>> result
[4895]532    (1, 0, '...', None)
[4837]533
534    >>> wilma.dinoports is None
535    True
536
[4902]537Clean up:
538
539    >>> shutil.rmtree(os.path.dirname(result[2]))
540
541
[4837]542Removing entries
543----------------
544
545In 'remove' mode we can delete entries. Here validity of values in
546non-location fields doesn't matter because those fields are ignored.
547
548    >>> open('newcomers.csv', 'wb').write(
549    ... """name,dinoports,owner
550    ... "Wilmas Asylum","ILLEGAL-NUMBER",""
551    ... """)
552
[4902]553    >>> result = processor.doImport('newcomers.csv', ['name',
554    ...                             'dinoports', 'owner'],
[4837]555    ...                    mode='remove', user='Bob')
[4902]556    >>> result
[4895]557    (1, 0, '...', None)
[4837]558
559    >>> sorted(stoneville.keys())
560    [u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
561
562Oops! Wilma is gone.
563
[4902]564Clean up:
[4837]565
[4902]566    >>> shutil.rmtree(os.path.dirname(result[2]))
567
568
[4837]569Clean up:
570
571    >>> import os
572    >>> os.unlink('newcomers.csv')
[4886]573    >>> os.unlink('stoneville.log')
Note: See TracBrowser for help on using the repository browser.