Context navigation

source: waeup/trunk/src/waeup/utils/batching.txt @ 4879

Last change on this file since 4879 was 4879, checked in by uli, 15 years ago
Update tests.
File size: 13.2 KB

Line
1	:mod:`waeup.utils.batching` -- Batch processing
2	***********************************************
3
4	Batch processing is much more than pure data import.
5
6	:test-layer: functional
7
8	Overview
9	========
10
11	Basically, it means processing CSV files in order to mass-create,
12	mass-remove, or mass-update data.
13
14	So you can feed CSV files to importers or processors, that are part of
15	the batch-processing mechanism.
16
17	Importers/Processors
18	--------------------
19
20	Each CSV file processor
21
22	* accepts a single data type identified by an interface.
23
24	* knows about the places inside a site (University) where to store,
25	remove or update the data.
26
27	* can check headers before processing data.
28
29	* supports the mode 'create', 'update', 'remove'.
30
31	* creates logs and failed-data csv files.
32
33	Output
34	------
35
36	The results of processing are written to logfiles. Beside this a new
37	CSV file is created during processing, containing only those data
38	sets, that could not be processed.
39
40	This new CSV file is called like the input file, appended by mode and
41	'.pending'. So, when the input file is named 'foo.csv' and something
42	went wrong during processing, then a file 'foo.csv.create.pending'
43	will be generated (if the operation mode was 'create'). The .pending
44	file is a CSV file that contains the failed rows appended by a column
45	``--ERRROR--`` in which the reasons for processing failures are
46	listed.
47
48	It looks like this::
49
50	-----+ +---------+
51	/ \| \| \| +------+
52	\| .csv +----->\|Batch- \| \| \|
53	\| \| \|processor+----changes-->\| ZODB \|
54	\| +------+ \| \| \| \|
55	+--\| \| \| + +------+
56	\| Mode +-->\| \| -------+
57	\| \| \| +----outputs-+-> / \|
58	\| \| +---------+ \| \|.pending\|
59	+------+ ^ \| \| \|
60	\| \| +--------+
61	+-----++ v
62	\|Inter-\| -----+
63	\|face \| / \|
64	+------+ \| .msg \|
65	\| \|
66	+------+
67
68
69	Creating a batch processor
70	==========================
71
72	We create an own batch processor for an own datatype. This datatype
73	must be based on an interface that the batcher can use for converting
74	data.
75
76	Founding Stoneville
77	-------------------
78
79	We start with the interface:
80
81	>>> from zope.interface import Interface
82	>>> from zope import schema
83	>>> class ICave(Interface):
84	... """A cave."""
85	... name = schema.TextLine(
86	... title = u'Cave name',
87	... default = u'Unnamed',
88	... required = True)
89	... dinoports = schema.Int(
90	... title = u'Number of DinoPorts (tm)',
91	... required = False,
92	... default = 1)
93	... owner = schema.TextLine(
94	... title = u'Owner name',
95	... required = True,
96	... missing_value = 'Fred Estates Inc.')
97	... taxpayer = schema.Bool(
98	... title = u'Payes taxes',
99	... required = True,
100	... default = False)
101
102	Now a class that implements this interface:
103
104	>>> import grok
105	>>> class Cave(object):
106	... grok.implements(ICave)
107	... def __init__(self, name=u'Unnamed', dinoports=2,
108	... owner='Fred Estates Inc.', taxpayer=False):
109	... self.name = name
110	... self.dinoports = 2
111	... self.owner = owner
112	... self.taxpayer = taxpayer
113
114	We also provide a factory for caves. Strictly speaking, this not
115	necessary but makes the batch processor we create afterwards, better
116	understandable.
117
118	>>> from zope.component import getGlobalSiteManager
119	>>> from zope.component.factory import Factory
120	>>> from zope.component.interfaces import IFactory
121	>>> gsm = getGlobalSiteManager()
122	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
123	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
124
125	Now we can create caves using a factory:
126
127	>>> from zope.component import createObject
128	>>> createObject('Lovely Cave')
129	<Cave object at 0x...>
130
131	This is nice, but we still lack a place, where we can place all the
132	lovely caves we want to sell.
133
134	Furthermore, as a replacement for a real site, we define a place where
135	all caves can be stored: Stoneville! This is a lovely place for
136	upperclass cavemen (which are the only ones that can afford more than
137	one dinoport).
138
139	We found Stoneville:
140
141	>>> stoneville = dict()
142
143	Everything in place.
144
145	Now, to improve local health conditions, imagine we want to populate
146	Stoneville with lots of new happy dino-hunting natives that slept on
147	the bare ground in former times and had no idea of
148	bathrooms. Disgusting, isn't it?
149
150	Lots of cavemen need lots of caves.
151
152	Of course we can do something like:
153
154	>>> cave1 = createObject('Lovely Cave')
155	>>> cave1.name = "Fred's home"
156	>>> cave1.owner = "Fred"
157	>>> stoneville[cave1.name] = cave1
158
159	and Stoneville has exactly
160
161	>>> len(stoneville)
162	1
163
164	inhabitant. But we don't want to do this for hundreds or thousands of
165	citizens-to-be, do we?
166
167	It is much easier to create a simple CSV list, where we put in all the
168	data and let a batch processor do the job.
169
170	The list is already here:
171
172	>>> open('newcomers.csv', 'wb').write(
173	... """name,dinoports,owner,taxpayer
174	... Barneys Home,2,Barney,1
175	... Wilmas Asylum,1,Wilma,1
176	... Freds Dinoburgers,10,Fred,0
177	... Joeys Drive-in,110,Joey,0
178	... """)
179
180	All we need, is a batch processor now.
181
182	>>> from waeup.utils.batching import BatchProcessor
183	>>> class CaveProcessor(BatchProcessor):
184	... util_name = 'caveprocessor'
185	... grok.name(util_name)
186	... name = 'Cave Processor'
187	... iface = ICave
188	... location_fields = ['name']
189	... factory_name = 'Lovely Cave'
190	...
191	... def parentsExist(self, row, site):
192	... return True
193	...
194	... def getParent(self, row, site):
195	... return stoneville
196	...
197	... def entryExists(self, row, site):
198	... return row['name'] in stoneville.keys()
199	...
200	... def getEntry(self, row, site):
201	... if not self.entryExists(row, site):
202	... return None
203	... return stoneville[row['name']]
204	...
205	... def delEntry(self, row, site):
206	... del stoneville[row['name']]
207	...
208	... def addEntry(self, obj, row, site):
209	... stoneville[row['name']] = obj
210	...
211	... def updateEntry(self, obj, row, site):
212	... for key, value in row.items():
213	... setattr(obj, key, value)
214
215	Create the fellows:
216
217	>>> processor = CaveProcessor()
218	>>> processor.doImport('newcomers.csv',
219	... ['name', 'dinoports', 'owner', 'taxpayer'],
220	... mode='create', user='Bob')
221	(4, 0)
222
223	The result means: four entries were processed and no warnings
224	occured. Let's check:
225
226	>>> sorted(stoneville.keys())
227	[u'Barneys Home', ..., u'Wilmas Asylum']
228
229	The values of the Cave instances have correct type:
230
231	>>> barney = stoneville['Barneys Home']
232	>>> barney.dinoports
233	2
234
235	which is a number, not a string.
236
237	Apparently, when calling the processor, we gave some more info than
238	only the CSV filepath. What does it all mean?
239
240	While the first argument is the path to the CSV file, we also have to
241	give an ordered list of headernames. These replace the header field
242	names that are actually in the file. This way we can override faulty
243	headers.
244
245	The ``mode`` paramter tells what kind of operation we want to perform:
246	``create``, ``update``, or ``remove`` data.
247
248	The ``user`` parameter finally is optional and only used for logging.
249
250	We can, by the way, see the results of our run in a logfile which is
251	named ``newcomers.csv.create.msg``:
252
253	>>> print open('newcomers.csv.create.msg').read()
254	Source: newcomers.csv
255	Mode: create
256	Date: ...
257	User: Bob
258	Failed datasets: newcomers.csv.create.pending
259	Processing time: ... s (... s/item)
260	Processed: 4 lines (4 successful/ 0 failed)
261	<BLANKLINE>
262
263	As we can see, the processing was successful. Otherwise, all problems
264	could be read here as we can see, if we do the same operation again:
265
266	>>> processor.doImport('newcomers.csv',
267	... ['name', 'dinoports', 'owner', 'taxpayer'],
268	... mode='create', user='Bob')
269	(4, 4)
270
271	The log file will tell us this in more detail:
272
273	>>> print open('newcomers.csv.create.msg').read()
274	Source: newcomers.csv
275	Mode: create
276	Date: ...
277	User: Bob
278	Failed datasets: newcomers.csv.create.pending
279	Processing time: ... s (... s/item)
280	Processed: 4 lines (0 successful/ 4 failed)
281
282	This time a new file was created, which keeps all the rows we could not
283	process and an additional column with error messages:
284
285	>>> print open('newcomers.csv.create.pending').read()
286	owner,name,taxpayer,dinoports,--ERRORS--
287	Barney,Barneys Home,1,2,This object already exists. Skipping.
288	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
289	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
290	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
291
292	This way we can correct the faulty entries and afterwards retry without
293	having the already processed rows in the way.
294
295	We also notice, that the values of the taxpayer column are returned as
296	in the input file. There we wrote '1' for ``True`` and '0' for
297	``False`` (which is accepted by the converters).
298
299
300	Updating entries
301	----------------
302
303	To update entries, we just call the batchprocessor in a different
304	mode:
305
306	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
307	... mode='update', user='Bob')
308	(4, 0)
309
310	Now we want to tell, that Wilma got an extra port for her second dino:
311
312	>>> open('newcomers.csv', 'wb').write(
313	... """name,dinoports,owner
314	... Wilmas Asylum,2,Wilma
315	... """)
316
317	>>> wilma = stoneville['Wilmas Asylum']
318	>>> wilma.dinoports
319	1
320
321	We start the processor:
322
323	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
324	... mode='update', user='Bob')
325	(1, 0)
326
327	>>> wilma = stoneville['Wilmas Asylum']
328	>>> wilma.dinoports
329	2
330
331	Wilma's number of dinoports raised.
332
333	If we try to update an unexisting entry, an error occurs:
334
335	>>> open('newcomers.csv', 'wb').write(
336	... """name,dinoports,owner
337	... NOT-WILMAS-ASYLUM,2,Wilma
338	... """)
339
340	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
341	... mode='update', user='Bob')
342	(1, 1)
343
344	Also invalid values will be spotted:
345
346	>>> open('newcomers.csv', 'wb').write(
347	... """name,dinoports,owner
348	... Wilmas Asylum,NOT-A-NUMBER,Wilma
349	... """)
350
351	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
352	... mode='update', user='Bob')
353	(1, 1)
354
355	We can also update only some cols, leaving some out. We skip the
356	'dinoports' column in the next run:
357
358	>>> open('newcomers.csv', 'wb').write(
359	... """name,owner
360	... Wilmas Asylum,Barney
361	... """)
362
363	>>> processor.doImport('newcomers.csv', ['name', 'owner'],
364	... mode='update', user='Bob')
365	(1, 0)
366
367	>>> wilma.owner
368	u'Barney'
369
370	We can however, not leave out the 'location field' ('name' in our
371	case), as this one tells us which entry to update:
372
373	>>> open('newcomers.csv', 'wb').write(
374	... """name,dinoports,owner
375	... 2,Wilma
376	... """)
377
378	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
379	... mode='update', user='Bob')
380	Traceback (most recent call last):
381	...
382	FatalCSVError: Need at least columns 'name' for import!
383
384	This time we get even an exception!
385
386	We can tell to set dinoports to ``None`` although this is not a
387	number, as we declared the field not required in the interface:
388
389	>>> open('newcomers.csv', 'wb').write(
390	... """name,dinoports,owner
391	... "Wilmas Asylum",,"Wilma"
392	... """)
393
394	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
395	... mode='update', user='Bob')
396	(1, 0)
397
398	>>> wilma.dinoports is None
399	True
400
401	Generally, empty strings are considered as ``None``:
402
403	>>> open('newcomers.csv', 'wb').write(
404	... """name,dinoports,owner
405	... "Wilmas Asylum","","Wilma"
406	... """)
407
408	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
409	... mode='update', user='Bob')
410	(1, 0)
411
412	>>> wilma.dinoports is None
413	True
414
415	Removing entries
416	----------------
417
418	In 'remove' mode we can delete entries. Here validity of values in
419	non-location fields doesn't matter because those fields are ignored.
420
421	>>> open('newcomers.csv', 'wb').write(
422	... """name,dinoports,owner
423	... "Wilmas Asylum","ILLEGAL-NUMBER",""
424	... """)
425
426	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
427	... mode='remove', user='Bob')
428	(1, 0)
429
430	>>> sorted(stoneville.keys())
431	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
432
433	Oops! Wilma is gone.
434
435
436	Clean up:
437
438	>>> import os
439	>>> os.unlink('newcomers.csv')
440	>>> os.unlink('newcomers.csv.create.pending')
441	>>> os.unlink('newcomers.csv.create.msg')
442	>>> os.unlink('newcomers.csv.remove.msg')
443	>>> os.unlink('newcomers.csv.update.msg')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: