Context navigation

source: waeup/trunk/src/waeup/utils/batching.txt @ 4895

Last change on this file since 4895 was 4895, checked in by uli, 15 years ago
Update tests.
File size: 14.1 KB

Line
1	:mod:`waeup.utils.batching` -- Batch processing
2	***********************************************
3
4	Batch processing is much more than pure data import.
5
6	:test-layer: functional
7
8	Overview
9	========
10
11	Basically, it means processing CSV files in order to mass-create,
12	mass-remove, or mass-update data.
13
14	So you can feed CSV files to importers or processors, that are part of
15	the batch-processing mechanism.
16
17	Importers/Processors
18	--------------------
19
20	Each CSV file processor
21
22	* accepts a single data type identified by an interface.
23
24	* knows about the places inside a site (University) where to store,
25	remove or update the data.
26
27	* can check headers before processing data.
28
29	* supports the mode 'create', 'update', 'remove'.
30
31	* creates logs and failed-data csv files.
32
33	Output
34	------
35
36	The results of processing are written to logfiles. Beside this a new
37	CSV file is created during processing, containing only those data
38	sets, that could not be processed.
39
40	This new CSV file is called like the input file, appended by mode and
41	'.pending'. So, when the input file is named 'foo.csv' and something
42	went wrong during processing, then a file 'foo.csv.create.pending'
43	will be generated (if the operation mode was 'create'). The .pending
44	file is a CSV file that contains the failed rows appended by a column
45	``--ERRROR--`` in which the reasons for processing failures are
46	listed.
47
48	It looks like this::
49
50	-----+ +---------+
51	/ \| \| \| +------+
52	\| .csv +----->\|Batch- \| \| \|
53	\| \| \|processor+----changes-->\| ZODB \|
54	\| +------+ \| \| \| \|
55	+--\| \| \| + +------+
56	\| Mode +-->\| \| -------+
57	\| \| \| +----outputs-+-> / \|
58	\| \| +---------+ \| \|.pending\|
59	+------+ ^ \| \| \|
60	\| \| +--------+
61	+-----++ v
62	\|Inter-\| -----+
63	\|face \| / \|
64	+------+ \| .msg \|
65	\| \|
66	+------+
67
68
69	Creating a batch processor
70	==========================
71
72	We create an own batch processor for an own datatype. This datatype
73	must be based on an interface that the batcher can use for converting
74	data.
75
76	Founding Stoneville
77	-------------------
78
79	We start with the interface:
80
81	>>> from zope.interface import Interface
82	>>> from zope import schema
83	>>> class ICave(Interface):
84	... """A cave."""
85	... name = schema.TextLine(
86	... title = u'Cave name',
87	... default = u'Unnamed',
88	... required = True)
89	... dinoports = schema.Int(
90	... title = u'Number of DinoPorts (tm)',
91	... required = False,
92	... default = 1)
93	... owner = schema.TextLine(
94	... title = u'Owner name',
95	... required = True,
96	... missing_value = 'Fred Estates Inc.')
97	... taxpayer = schema.Bool(
98	... title = u'Payes taxes',
99	... required = True,
100	... default = False)
101
102	Now a class that implements this interface:
103
104	>>> import grok
105	>>> class Cave(object):
106	... grok.implements(ICave)
107	... def __init__(self, name=u'Unnamed', dinoports=2,
108	... owner='Fred Estates Inc.', taxpayer=False):
109	... self.name = name
110	... self.dinoports = 2
111	... self.owner = owner
112	... self.taxpayer = taxpayer
113
114	We also provide a factory for caves. Strictly speaking, this not
115	necessary but makes the batch processor we create afterwards, better
116	understandable.
117
118	>>> from zope.component import getGlobalSiteManager
119	>>> from zope.component.factory import Factory
120	>>> from zope.component.interfaces import IFactory
121	>>> gsm = getGlobalSiteManager()
122	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
123	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
124
125	Now we can create caves using a factory:
126
127	>>> from zope.component import createObject
128	>>> createObject('Lovely Cave')
129	<Cave object at 0x...>
130
131	This is nice, but we still lack a place, where we can place all the
132	lovely caves we want to sell.
133
134	Furthermore, as a replacement for a real site, we define a place where
135	all caves can be stored: Stoneville! This is a lovely place for
136	upperclass cavemen (which are the only ones that can afford more than
137	one dinoport).
138
139	We found Stoneville:
140
141	>>> stoneville = dict()
142
143	Everything in place.
144
145	Now, to improve local health conditions, imagine we want to populate
146	Stoneville with lots of new happy dino-hunting natives that slept on
147	the bare ground in former times and had no idea of
148	bathrooms. Disgusting, isn't it?
149
150	Lots of cavemen need lots of caves.
151
152	Of course we can do something like:
153
154	>>> cave1 = createObject('Lovely Cave')
155	>>> cave1.name = "Fred's home"
156	>>> cave1.owner = "Fred"
157	>>> stoneville[cave1.name] = cave1
158
159	and Stoneville has exactly
160
161	>>> len(stoneville)
162	1
163
164	inhabitant. But we don't want to do this for hundreds or thousands of
165	citizens-to-be, do we?
166
167	It is much easier to create a simple CSV list, where we put in all the
168	data and let a batch processor do the job.
169
170	The list is already here:
171
172	>>> open('newcomers.csv', 'wb').write(
173	... """name,dinoports,owner,taxpayer
174	... Barneys Home,2,Barney,1
175	... Wilmas Asylum,1,Wilma,1
176	... Freds Dinoburgers,10,Fred,0
177	... Joeys Drive-in,110,Joey,0
178	... """)
179
180	All we need, is a batch processor now.
181
182	>>> from waeup.utils.batching import BatchProcessor
183	>>> class CaveProcessor(BatchProcessor):
184	... util_name = 'caveprocessor'
185	... grok.name(util_name)
186	... name = 'Cave Processor'
187	... iface = ICave
188	... location_fields = ['name']
189	... factory_name = 'Lovely Cave'
190	...
191	... def parentsExist(self, row, site):
192	... return True
193	...
194	... def getParent(self, row, site):
195	... return stoneville
196	...
197	... def entryExists(self, row, site):
198	... return row['name'] in stoneville.keys()
199	...
200	... def getEntry(self, row, site):
201	... if not self.entryExists(row, site):
202	... return None
203	... return stoneville[row['name']]
204	...
205	... def delEntry(self, row, site):
206	... del stoneville[row['name']]
207	...
208	... def addEntry(self, obj, row, site):
209	... stoneville[row['name']] = obj
210	...
211	... def updateEntry(self, obj, row, site):
212	... for key, value in row.items():
213	... setattr(obj, key, value)
214
215	If we also want the results being logged, we must provide a logger
216	(this is optional):
217
218	>>> import logging
219	>>> logger = logging.getLogger('stoneville')
220	>>> logger.setLevel(logging.DEBUG)
221	>>> logger.propagate = False
222	>>> handler = logging.FileHandler('stoneville.log', 'w')
223	>>> logger.addHandler(handler)
224
225	Create the fellows:
226
227	>>> processor = CaveProcessor()
228	>>> processor.doImport('newcomers.csv',
229	... ['name', 'dinoports', 'owner', 'taxpayer'],
230	... mode='create', user='Bob', logger=logger)
231	(4, 0, '/.../newcomers.finished.csv', None)
232
233	The result means: four entries were processed and no warnings
234	occured. Furthermore we get filepath to a CSV file with successfully
235	processed entries and a filepath to a CSV file with erraneous entries.
236	As everything went well, the latter is ``None``. Let's check:
237
238	>>> sorted(stoneville.keys())
239	[u'Barneys Home', ..., u'Wilmas Asylum']
240
241	The values of the Cave instances have correct type:
242
243	>>> barney = stoneville['Barneys Home']
244	>>> barney.dinoports
245	2
246
247	which is a number, not a string.
248
249	Apparently, when calling the processor, we gave some more info than
250	only the CSV filepath. What does it all mean?
251
252	While the first argument is the path to the CSV file, we also have to
253	give an ordered list of headernames. These replace the header field
254	names that are actually in the file. This way we can override faulty
255	headers.
256
257	The ``mode`` paramter tells what kind of operation we want to perform:
258	``create``, ``update``, or ``remove`` data.
259
260	The ``user`` parameter finally is optional and only used for logging.
261
262	We can, by the way, see the results of our run in a logfile if we
263	provided a logger during the call:
264
265	>>> #print open('newcomers.csv.create.msg').read()
266	>>> print open('stoneville.log').read()
267	--------------------
268	Bob: Batch processing finished: OK
269	Bob: Source: newcomers.csv
270	Bob: Mode: create
271	Bob: User: Bob
272	Bob: Processing time: ... s (... s/item)
273	Bob: Processed: 4 lines (4 successful/ 0 failed)
274	--------------------
275
276	As we can see, the processing was successful. Otherwise, all problems
277	could be read here as we can see, if we do the same operation again:
278
279	>>> processor.doImport('newcomers.csv',
280	... ['name', 'dinoports', 'owner', 'taxpayer'],
281	... mode='create', user='Bob', logger=logger)
282	(4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
283
284	This time we also get a path to a .pending file.
285
286	The log file will tell us this in more detail:
287
288	>>> #print open('newcomers.csv.create.msg').read()
289	>>> print open('stoneville.log').read()
290	--------------------
291	...
292	--------------------
293	Bob: Batch processing finished: FAILED
294	Bob: Source: newcomers.csv
295	Bob: Mode: create
296	Bob: User: Bob
297	Bob: Failed datasets: newcomers.pending.csv
298	Bob: Processing time: ... s (... s/item)
299	Bob: Processed: 4 lines (0 successful/ 4 failed)
300	--------------------
301
302	This time a new file was created, which keeps all the rows we could not
303	process and an additional column with error messages:
304
305	>>> print open('newcomers.pending.csv').read()
306	owner,name,taxpayer,dinoports,--ERRORS--
307	Barney,Barneys Home,1,2,This object already exists. Skipping.
308	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
309	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
310	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
311
312	This way we can correct the faulty entries and afterwards retry without
313	having the already processed rows in the way.
314
315	We also notice, that the values of the taxpayer column are returned as
316	in the input file. There we wrote '1' for ``True`` and '0' for
317	``False`` (which is accepted by the converters).
318
319
320	Updating entries
321	----------------
322
323	To update entries, we just call the batchprocessor in a different
324	mode:
325
326	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
327	... mode='update', user='Bob')
328	(4, 0, '...', None)
329
330	Now we want to tell, that Wilma got an extra port for her second dino:
331
332	>>> open('newcomers.csv', 'wb').write(
333	... """name,dinoports,owner
334	... Wilmas Asylum,2,Wilma
335	... """)
336
337	>>> wilma = stoneville['Wilmas Asylum']
338	>>> wilma.dinoports
339	1
340
341	We start the processor:
342
343	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
344	... mode='update', user='Bob')
345	(1, 0, '...', None)
346
347	>>> wilma = stoneville['Wilmas Asylum']
348	>>> wilma.dinoports
349	2
350
351	Wilma's number of dinoports raised.
352
353	If we try to update an unexisting entry, an error occurs:
354
355	>>> open('newcomers.csv', 'wb').write(
356	... """name,dinoports,owner
357	... NOT-WILMAS-ASYLUM,2,Wilma
358	... """)
359
360	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
361	... mode='update', user='Bob')
362	(1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
363
364	Also invalid values will be spotted:
365
366	>>> open('newcomers.csv', 'wb').write(
367	... """name,dinoports,owner
368	... Wilmas Asylum,NOT-A-NUMBER,Wilma
369	... """)
370
371	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
372	... mode='update', user='Bob')
373	(1, 1, '...', '...')
374
375	We can also update only some cols, leaving some out. We skip the
376	'dinoports' column in the next run:
377
378	>>> open('newcomers.csv', 'wb').write(
379	... """name,owner
380	... Wilmas Asylum,Barney
381	... """)
382
383	>>> processor.doImport('newcomers.csv', ['name', 'owner'],
384	... mode='update', user='Bob')
385	(1, 0, '...', None)
386
387	>>> wilma.owner
388	u'Barney'
389
390	We can however, not leave out the 'location field' ('name' in our
391	case), as this one tells us which entry to update:
392
393	>>> open('newcomers.csv', 'wb').write(
394	... """name,dinoports,owner
395	... 2,Wilma
396	... """)
397
398	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
399	... mode='update', user='Bob')
400	Traceback (most recent call last):
401	...
402	FatalCSVError: Need at least columns 'name' for import!
403
404	This time we get even an exception!
405
406	We can tell to set dinoports to ``None`` although this is not a
407	number, as we declared the field not required in the interface:
408
409	>>> open('newcomers.csv', 'wb').write(
410	... """name,dinoports,owner
411	... "Wilmas Asylum",,"Wilma"
412	... """)
413
414	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
415	... mode='update', user='Bob')
416	(1, 0, '...', None)
417
418	>>> wilma.dinoports is None
419	True
420
421	Generally, empty strings are considered as ``None``:
422
423	>>> open('newcomers.csv', 'wb').write(
424	... """name,dinoports,owner
425	... "Wilmas Asylum","","Wilma"
426	... """)
427
428	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
429	... mode='update', user='Bob')
430	(1, 0, '...', None)
431
432	>>> wilma.dinoports is None
433	True
434
435	Removing entries
436	----------------
437
438	In 'remove' mode we can delete entries. Here validity of values in
439	non-location fields doesn't matter because those fields are ignored.
440
441	>>> open('newcomers.csv', 'wb').write(
442	... """name,dinoports,owner
443	... "Wilmas Asylum","ILLEGAL-NUMBER",""
444	... """)
445
446	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
447	... mode='remove', user='Bob')
448	(1, 0, '...', None)
449
450	>>> sorted(stoneville.keys())
451	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
452
453	Oops! Wilma is gone.
454
455
456	Clean up:
457
458	>>> import os
459	>>> os.unlink('newcomers.csv')
460	>>> os.unlink('newcomers.finished.csv')
461	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: