Context navigation

source: waeup/trunk/src/waeup/utils/batching.txt @ 4886

Last change on this file since 4886 was 4886, checked in by uli, 15 years ago
Update tests.
File size: 13.7 KB

Line
1	:mod:`waeup.utils.batching` -- Batch processing
2	***********************************************
3
4	Batch processing is much more than pure data import.
5
6	:test-layer: functional
7
8	Overview
9	========
10
11	Basically, it means processing CSV files in order to mass-create,
12	mass-remove, or mass-update data.
13
14	So you can feed CSV files to importers or processors, that are part of
15	the batch-processing mechanism.
16
17	Importers/Processors
18	--------------------
19
20	Each CSV file processor
21
22	* accepts a single data type identified by an interface.
23
24	* knows about the places inside a site (University) where to store,
25	remove or update the data.
26
27	* can check headers before processing data.
28
29	* supports the mode 'create', 'update', 'remove'.
30
31	* creates logs and failed-data csv files.
32
33	Output
34	------
35
36	The results of processing are written to logfiles. Beside this a new
37	CSV file is created during processing, containing only those data
38	sets, that could not be processed.
39
40	This new CSV file is called like the input file, appended by mode and
41	'.pending'. So, when the input file is named 'foo.csv' and something
42	went wrong during processing, then a file 'foo.csv.create.pending'
43	will be generated (if the operation mode was 'create'). The .pending
44	file is a CSV file that contains the failed rows appended by a column
45	``--ERRROR--`` in which the reasons for processing failures are
46	listed.
47
48	It looks like this::
49
50	-----+ +---------+
51	/ \| \| \| +------+
52	\| .csv +----->\|Batch- \| \| \|
53	\| \| \|processor+----changes-->\| ZODB \|
54	\| +------+ \| \| \| \|
55	+--\| \| \| + +------+
56	\| Mode +-->\| \| -------+
57	\| \| \| +----outputs-+-> / \|
58	\| \| +---------+ \| \|.pending\|
59	+------+ ^ \| \| \|
60	\| \| +--------+
61	+-----++ v
62	\|Inter-\| -----+
63	\|face \| / \|
64	+------+ \| .msg \|
65	\| \|
66	+------+
67
68
69	Creating a batch processor
70	==========================
71
72	We create an own batch processor for an own datatype. This datatype
73	must be based on an interface that the batcher can use for converting
74	data.
75
76	Founding Stoneville
77	-------------------
78
79	We start with the interface:
80
81	>>> from zope.interface import Interface
82	>>> from zope import schema
83	>>> class ICave(Interface):
84	... """A cave."""
85	... name = schema.TextLine(
86	... title = u'Cave name',
87	... default = u'Unnamed',
88	... required = True)
89	... dinoports = schema.Int(
90	... title = u'Number of DinoPorts (tm)',
91	... required = False,
92	... default = 1)
93	... owner = schema.TextLine(
94	... title = u'Owner name',
95	... required = True,
96	... missing_value = 'Fred Estates Inc.')
97	... taxpayer = schema.Bool(
98	... title = u'Payes taxes',
99	... required = True,
100	... default = False)
101
102	Now a class that implements this interface:
103
104	>>> import grok
105	>>> class Cave(object):
106	... grok.implements(ICave)
107	... def __init__(self, name=u'Unnamed', dinoports=2,
108	... owner='Fred Estates Inc.', taxpayer=False):
109	... self.name = name
110	... self.dinoports = 2
111	... self.owner = owner
112	... self.taxpayer = taxpayer
113
114	We also provide a factory for caves. Strictly speaking, this not
115	necessary but makes the batch processor we create afterwards, better
116	understandable.
117
118	>>> from zope.component import getGlobalSiteManager
119	>>> from zope.component.factory import Factory
120	>>> from zope.component.interfaces import IFactory
121	>>> gsm = getGlobalSiteManager()
122	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
123	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
124
125	Now we can create caves using a factory:
126
127	>>> from zope.component import createObject
128	>>> createObject('Lovely Cave')
129	<Cave object at 0x...>
130
131	This is nice, but we still lack a place, where we can place all the
132	lovely caves we want to sell.
133
134	Furthermore, as a replacement for a real site, we define a place where
135	all caves can be stored: Stoneville! This is a lovely place for
136	upperclass cavemen (which are the only ones that can afford more than
137	one dinoport).
138
139	We found Stoneville:
140
141	>>> stoneville = dict()
142
143	Everything in place.
144
145	Now, to improve local health conditions, imagine we want to populate
146	Stoneville with lots of new happy dino-hunting natives that slept on
147	the bare ground in former times and had no idea of
148	bathrooms. Disgusting, isn't it?
149
150	Lots of cavemen need lots of caves.
151
152	Of course we can do something like:
153
154	>>> cave1 = createObject('Lovely Cave')
155	>>> cave1.name = "Fred's home"
156	>>> cave1.owner = "Fred"
157	>>> stoneville[cave1.name] = cave1
158
159	and Stoneville has exactly
160
161	>>> len(stoneville)
162	1
163
164	inhabitant. But we don't want to do this for hundreds or thousands of
165	citizens-to-be, do we?
166
167	It is much easier to create a simple CSV list, where we put in all the
168	data and let a batch processor do the job.
169
170	The list is already here:
171
172	>>> open('newcomers.csv', 'wb').write(
173	... """name,dinoports,owner,taxpayer
174	... Barneys Home,2,Barney,1
175	... Wilmas Asylum,1,Wilma,1
176	... Freds Dinoburgers,10,Fred,0
177	... Joeys Drive-in,110,Joey,0
178	... """)
179
180	All we need, is a batch processor now.
181
182	>>> from waeup.utils.batching import BatchProcessor
183	>>> class CaveProcessor(BatchProcessor):
184	... util_name = 'caveprocessor'
185	... grok.name(util_name)
186	... name = 'Cave Processor'
187	... iface = ICave
188	... location_fields = ['name']
189	... factory_name = 'Lovely Cave'
190	...
191	... def parentsExist(self, row, site):
192	... return True
193	...
194	... def getParent(self, row, site):
195	... return stoneville
196	...
197	... def entryExists(self, row, site):
198	... return row['name'] in stoneville.keys()
199	...
200	... def getEntry(self, row, site):
201	... if not self.entryExists(row, site):
202	... return None
203	... return stoneville[row['name']]
204	...
205	... def delEntry(self, row, site):
206	... del stoneville[row['name']]
207	...
208	... def addEntry(self, obj, row, site):
209	... stoneville[row['name']] = obj
210	...
211	... def updateEntry(self, obj, row, site):
212	... for key, value in row.items():
213	... setattr(obj, key, value)
214
215	If we also want the results being logged, we must provide a logger
216	(this is optional):
217
218	>>> import logging
219	>>> logger = logging.getLogger('stoneville')
220	>>> logger.setLevel(logging.DEBUG)
221	>>> logger.propagate = False
222	>>> handler = logging.FileHandler('stoneville.log', 'w')
223	>>> logger.addHandler(handler)
224
225	Create the fellows:
226
227	>>> processor = CaveProcessor()
228	>>> processor.doImport('newcomers.csv',
229	... ['name', 'dinoports', 'owner', 'taxpayer'],
230	... mode='create', user='Bob', logger=logger)
231	(4, 0)
232
233	The result means: four entries were processed and no warnings
234	occured. Let's check:
235
236	>>> sorted(stoneville.keys())
237	[u'Barneys Home', ..., u'Wilmas Asylum']
238
239	The values of the Cave instances have correct type:
240
241	>>> barney = stoneville['Barneys Home']
242	>>> barney.dinoports
243	2
244
245	which is a number, not a string.
246
247	Apparently, when calling the processor, we gave some more info than
248	only the CSV filepath. What does it all mean?
249
250	While the first argument is the path to the CSV file, we also have to
251	give an ordered list of headernames. These replace the header field
252	names that are actually in the file. This way we can override faulty
253	headers.
254
255	The ``mode`` paramter tells what kind of operation we want to perform:
256	``create``, ``update``, or ``remove`` data.
257
258	The ``user`` parameter finally is optional and only used for logging.
259
260	We can, by the way, see the results of our run in a logfile if we
261	provided a logger during the call:
262
263	>>> #print open('newcomers.csv.create.msg').read()
264	>>> print open('stoneville.log').read()
265	--------------------
266	Bob: Batch processing finished: OK
267	Bob: Source: newcomers.csv
268	Bob: Mode: create
269	Bob: User: Bob
270	Bob: Processing time: ... s (... s/item)
271	Bob: Processed: 4 lines (4 successful/ 0 failed)
272	--------------------
273
274	As we can see, the processing was successful. Otherwise, all problems
275	could be read here as we can see, if we do the same operation again:
276
277	>>> processor.doImport('newcomers.csv',
278	... ['name', 'dinoports', 'owner', 'taxpayer'],
279	... mode='create', user='Bob', logger=logger)
280	(4, 4)
281
282	The log file will tell us this in more detail:
283
284	>>> #print open('newcomers.csv.create.msg').read()
285	>>> print open('stoneville.log').read()
286	--------------------
287	...
288	--------------------
289	Bob: Batch processing finished: FAILED
290	Bob: Source: newcomers.csv
291	Bob: Mode: create
292	Bob: User: Bob
293	Bob: Failed datasets: newcomers.csv.create.pending
294	Bob: Processing time: ... s (... s/item)
295	Bob: Processed: 4 lines (0 successful/ 4 failed)
296	--------------------
297
298	This time a new file was created, which keeps all the rows we could not
299	process and an additional column with error messages:
300
301	>>> print open('newcomers.csv.create.pending').read()
302	owner,name,taxpayer,dinoports,--ERRORS--
303	Barney,Barneys Home,1,2,This object already exists. Skipping.
304	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
305	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
306	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
307
308	This way we can correct the faulty entries and afterwards retry without
309	having the already processed rows in the way.
310
311	We also notice, that the values of the taxpayer column are returned as
312	in the input file. There we wrote '1' for ``True`` and '0' for
313	``False`` (which is accepted by the converters).
314
315
316	Updating entries
317	----------------
318
319	To update entries, we just call the batchprocessor in a different
320	mode:
321
322	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
323	... mode='update', user='Bob')
324	(4, 0)
325
326	Now we want to tell, that Wilma got an extra port for her second dino:
327
328	>>> open('newcomers.csv', 'wb').write(
329	... """name,dinoports,owner
330	... Wilmas Asylum,2,Wilma
331	... """)
332
333	>>> wilma = stoneville['Wilmas Asylum']
334	>>> wilma.dinoports
335	1
336
337	We start the processor:
338
339	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
340	... mode='update', user='Bob')
341	(1, 0)
342
343	>>> wilma = stoneville['Wilmas Asylum']
344	>>> wilma.dinoports
345	2
346
347	Wilma's number of dinoports raised.
348
349	If we try to update an unexisting entry, an error occurs:
350
351	>>> open('newcomers.csv', 'wb').write(
352	... """name,dinoports,owner
353	... NOT-WILMAS-ASYLUM,2,Wilma
354	... """)
355
356	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
357	... mode='update', user='Bob')
358	(1, 1)
359
360	Also invalid values will be spotted:
361
362	>>> open('newcomers.csv', 'wb').write(
363	... """name,dinoports,owner
364	... Wilmas Asylum,NOT-A-NUMBER,Wilma
365	... """)
366
367	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
368	... mode='update', user='Bob')
369	(1, 1)
370
371	We can also update only some cols, leaving some out. We skip the
372	'dinoports' column in the next run:
373
374	>>> open('newcomers.csv', 'wb').write(
375	... """name,owner
376	... Wilmas Asylum,Barney
377	... """)
378
379	>>> processor.doImport('newcomers.csv', ['name', 'owner'],
380	... mode='update', user='Bob')
381	(1, 0)
382
383	>>> wilma.owner
384	u'Barney'
385
386	We can however, not leave out the 'location field' ('name' in our
387	case), as this one tells us which entry to update:
388
389	>>> open('newcomers.csv', 'wb').write(
390	... """name,dinoports,owner
391	... 2,Wilma
392	... """)
393
394	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
395	... mode='update', user='Bob')
396	Traceback (most recent call last):
397	...
398	FatalCSVError: Need at least columns 'name' for import!
399
400	This time we get even an exception!
401
402	We can tell to set dinoports to ``None`` although this is not a
403	number, as we declared the field not required in the interface:
404
405	>>> open('newcomers.csv', 'wb').write(
406	... """name,dinoports,owner
407	... "Wilmas Asylum",,"Wilma"
408	... """)
409
410	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
411	... mode='update', user='Bob')
412	(1, 0)
413
414	>>> wilma.dinoports is None
415	True
416
417	Generally, empty strings are considered as ``None``:
418
419	>>> open('newcomers.csv', 'wb').write(
420	... """name,dinoports,owner
421	... "Wilmas Asylum","","Wilma"
422	... """)
423
424	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
425	... mode='update', user='Bob')
426	(1, 0)
427
428	>>> wilma.dinoports is None
429	True
430
431	Removing entries
432	----------------
433
434	In 'remove' mode we can delete entries. Here validity of values in
435	non-location fields doesn't matter because those fields are ignored.
436
437	>>> open('newcomers.csv', 'wb').write(
438	... """name,dinoports,owner
439	... "Wilmas Asylum","ILLEGAL-NUMBER",""
440	... """)
441
442	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
443	... mode='remove', user='Bob')
444	(1, 0)
445
446	>>> sorted(stoneville.keys())
447	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
448
449	Oops! Wilma is gone.
450
451
452	Clean up:
453
454	>>> import os
455	>>> os.unlink('newcomers.csv')
456	>>> os.unlink('newcomers.csv.create.pending')
457	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: