Context navigation

source: waeup/trunk/src/waeup/utils/batching.txt @ 4886

Last change on this file since 4886 was 4886, checked in by uli, 15 years ago
Update tests.
File size: 13.7 KB

Rev	Line
[4837]	1	:mod:`waeup.utils.batching` -- Batch processing
	2	***********************************************
	3
	4	Batch processing is much more than pure data import.
	5
	6	:test-layer: functional
	7
	8	Overview
	9	========
	10
	11	Basically, it means processing CSV files in order to mass-create,
	12	mass-remove, or mass-update data.
	13
[4847]	14	So you can feed CSV files to importers or processors, that are part of
	15	the batch-processing mechanism.
[4837]	16
[4847]	17	Importers/Processors
	18	--------------------
[4837]	19
[4847]	20	Each CSV file processor
[4837]	21
	22	* accepts a single data type identified by an interface.
	23
	24	* knows about the places inside a site (University) where to store,
	25	remove or update the data.
	26
	27	* can check headers before processing data.
	28
	29	* supports the mode 'create', 'update', 'remove'.
	30
	31	* creates logs and failed-data csv files.
	32
	33	Output
	34	------
	35
	36	The results of processing are written to logfiles. Beside this a new
	37	CSV file is created during processing, containing only those data
	38	sets, that could not be processed.
	39
	40	This new CSV file is called like the input file, appended by mode and
	41	'.pending'. So, when the input file is named 'foo.csv' and something
	42	went wrong during processing, then a file 'foo.csv.create.pending'
[4879]	43	will be generated (if the operation mode was 'create'). The .pending
	44	file is a CSV file that contains the failed rows appended by a column
	45	``--ERRROR--`` in which the reasons for processing failures are
	46	listed.
[4837]	47
	48	It looks like this::
	49
	50	-----+ +---------+
	51	/ \| \| \| +------+
	52	\| .csv +----->\|Batch- \| \| \|
	53	\| \| \|processor+----changes-->\| ZODB \|
	54	\| +------+ \| \| \| \|
	55	+--\| \| \| + +------+
	56	\| Mode +-->\| \| -------+
	57	\| \| \| +----outputs-+-> / \|
	58	\| \| +---------+ \| \|.pending\|
	59	+------+ ^ \| \| \|
	60	\| \| +--------+
	61	+-----++ v
	62	\|Inter-\| -----+
	63	\|face \| / \|
	64	+------+ \| .msg \|
	65	\| \|
	66	+------+
	67
	68
	69	Creating a batch processor
	70	==========================
	71
	72	We create an own batch processor for an own datatype. This datatype
	73	must be based on an interface that the batcher can use for converting
	74	data.
	75
	76	Founding Stoneville
	77	-------------------
	78
	79	We start with the interface:
	80
	81	>>> from zope.interface import Interface
	82	>>> from zope import schema
	83	>>> class ICave(Interface):
	84	... """A cave."""
	85	... name = schema.TextLine(
	86	... title = u'Cave name',
	87	... default = u'Unnamed',
	88	... required = True)
	89	... dinoports = schema.Int(
	90	... title = u'Number of DinoPorts (tm)',
	91	... required = False,
	92	... default = 1)
	93	... owner = schema.TextLine(
	94	... title = u'Owner name',
	95	... required = True,
	96	... missing_value = 'Fred Estates Inc.')
[4871]	97	... taxpayer = schema.Bool(
	98	... title = u'Payes taxes',
	99	... required = True,
	100	... default = False)
[4837]	101
	102	Now a class that implements this interface:
	103
	104	>>> import grok
	105	>>> class Cave(object):
	106	... grok.implements(ICave)
	107	... def __init__(self, name=u'Unnamed', dinoports=2,
[4871]	108	... owner='Fred Estates Inc.', taxpayer=False):
[4837]	109	... self.name = name
	110	... self.dinoports = 2
	111	... self.owner = owner
[4871]	112	... self.taxpayer = taxpayer
[4837]	113
	114	We also provide a factory for caves. Strictly speaking, this not
	115	necessary but makes the batch processor we create afterwards, better
	116	understandable.
	117
	118	>>> from zope.component import getGlobalSiteManager
	119	>>> from zope.component.factory import Factory
	120	>>> from zope.component.interfaces import IFactory
	121	>>> gsm = getGlobalSiteManager()
	122	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
	123	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
	124
	125	Now we can create caves using a factory:
	126
	127	>>> from zope.component import createObject
	128	>>> createObject('Lovely Cave')
	129	<Cave object at 0x...>
	130
	131	This is nice, but we still lack a place, where we can place all the
	132	lovely caves we want to sell.
	133
	134	Furthermore, as a replacement for a real site, we define a place where
	135	all caves can be stored: Stoneville! This is a lovely place for
	136	upperclass cavemen (which are the only ones that can afford more than
	137	one dinoport).
	138
	139	We found Stoneville:
	140
	141	>>> stoneville = dict()
	142
	143	Everything in place.
	144
	145	Now, to improve local health conditions, imagine we want to populate
	146	Stoneville with lots of new happy dino-hunting natives that slept on
	147	the bare ground in former times and had no idea of
	148	bathrooms. Disgusting, isn't it?
	149
	150	Lots of cavemen need lots of caves.
	151
	152	Of course we can do something like:
	153
	154	>>> cave1 = createObject('Lovely Cave')
	155	>>> cave1.name = "Fred's home"
	156	>>> cave1.owner = "Fred"
	157	>>> stoneville[cave1.name] = cave1
	158
	159	and Stoneville has exactly
	160
	161	>>> len(stoneville)
	162	1
	163
	164	inhabitant. But we don't want to do this for hundreds or thousands of
	165	citizens-to-be, do we?
	166
	167	It is much easier to create a simple CSV list, where we put in all the
	168	data and let a batch processor do the job.
	169
	170	The list is already here:
	171
	172	>>> open('newcomers.csv', 'wb').write(
[4871]	173	... """name,dinoports,owner,taxpayer
	174	... Barneys Home,2,Barney,1
	175	... Wilmas Asylum,1,Wilma,1
	176	... Freds Dinoburgers,10,Fred,0
	177	... Joeys Drive-in,110,Joey,0
[4837]	178	... """)
	179
	180	All we need, is a batch processor now.
	181
	182	>>> from waeup.utils.batching import BatchProcessor
	183	>>> class CaveProcessor(BatchProcessor):
	184	... util_name = 'caveprocessor'
	185	... grok.name(util_name)
	186	... name = 'Cave Processor'
	187	... iface = ICave
	188	... location_fields = ['name']
	189	... factory_name = 'Lovely Cave'
	190	...
	191	... def parentsExist(self, row, site):
	192	... return True
	193	...
	194	... def getParent(self, row, site):
	195	... return stoneville
	196	...
	197	... def entryExists(self, row, site):
	198	... return row['name'] in stoneville.keys()
	199	...
	200	... def getEntry(self, row, site):
	201	... if not self.entryExists(row, site):
	202	... return None
	203	... return stoneville[row['name']]
	204	...
	205	... def delEntry(self, row, site):
	206	... del stoneville[row['name']]
	207	...
	208	... def addEntry(self, obj, row, site):
	209	... stoneville[row['name']] = obj
	210	...
	211	... def updateEntry(self, obj, row, site):
	212	... for key, value in row.items():
	213	... setattr(obj, key, value)
	214
[4886]	215	If we also want the results being logged, we must provide a logger
	216	(this is optional):
	217
	218	>>> import logging
	219	>>> logger = logging.getLogger('stoneville')
	220	>>> logger.setLevel(logging.DEBUG)
	221	>>> logger.propagate = False
	222	>>> handler = logging.FileHandler('stoneville.log', 'w')
	223	>>> logger.addHandler(handler)
	224
[4837]	225	Create the fellows:
	226
	227	>>> processor = CaveProcessor()
[4871]	228	>>> processor.doImport('newcomers.csv',
	229	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	230	... mode='create', user='Bob', logger=logger)
[4879]	231	(4, 0)
[4837]	232
	233	The result means: four entries were processed and no warnings
	234	occured. Let's check:
	235
	236	>>> sorted(stoneville.keys())
	237	[u'Barneys Home', ..., u'Wilmas Asylum']
	238
	239	The values of the Cave instances have correct type:
	240
	241	>>> barney = stoneville['Barneys Home']
	242	>>> barney.dinoports
	243	2
	244
	245	which is a number, not a string.
	246
	247	Apparently, when calling the processor, we gave some more info than
	248	only the CSV filepath. What does it all mean?
	249
	250	While the first argument is the path to the CSV file, we also have to
	251	give an ordered list of headernames. These replace the header field
	252	names that are actually in the file. This way we can override faulty
	253	headers.
	254
	255	The ``mode`` paramter tells what kind of operation we want to perform:
	256	``create``, ``update``, or ``remove`` data.
	257
	258	The ``user`` parameter finally is optional and only used for logging.
	259
[4886]	260	We can, by the way, see the results of our run in a logfile if we
	261	provided a logger during the call:
[4837]	262
[4886]	263	>>> #print open('newcomers.csv.create.msg').read()
	264	>>> print open('stoneville.log').read()
	265	--------------------
	266	Bob: Batch processing finished: OK
	267	Bob: Source: newcomers.csv
	268	Bob: Mode: create
	269	Bob: User: Bob
	270	Bob: Processing time: ... s (... s/item)
	271	Bob: Processed: 4 lines (4 successful/ 0 failed)
	272	--------------------
[4837]	273
	274	As we can see, the processing was successful. Otherwise, all problems
	275	could be read here as we can see, if we do the same operation again:
	276
[4871]	277	>>> processor.doImport('newcomers.csv',
	278	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	279	... mode='create', user='Bob', logger=logger)
[4879]	280	(4, 4)
[4837]	281
	282	The log file will tell us this in more detail:
	283
[4886]	284	>>> #print open('newcomers.csv.create.msg').read()
	285	>>> print open('stoneville.log').read()
	286	--------------------
	287	...
	288	--------------------
	289	Bob: Batch processing finished: FAILED
	290	Bob: Source: newcomers.csv
	291	Bob: Mode: create
	292	Bob: User: Bob
	293	Bob: Failed datasets: newcomers.csv.create.pending
	294	Bob: Processing time: ... s (... s/item)
	295	Bob: Processed: 4 lines (0 successful/ 4 failed)
	296	--------------------
[4837]	297
	298	This time a new file was created, which keeps all the rows we could not
[4877]	299	process and an additional column with error messages:
[4837]	300
	301	>>> print open('newcomers.csv.create.pending').read()
[4877]	302	owner,name,taxpayer,dinoports,--ERRORS--
	303	Barney,Barneys Home,1,2,This object already exists. Skipping.
	304	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
	305	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
	306	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
[4837]	307
	308	This way we can correct the faulty entries and afterwards retry without
	309	having the already processed rows in the way.
	310
[4871]	311	We also notice, that the values of the taxpayer column are returned as
	312	in the input file. There we wrote '1' for ``True`` and '0' for
	313	``False`` (which is accepted by the converters).
[4837]	314
[4871]	315
[4837]	316	Updating entries
	317	----------------
	318
	319	To update entries, we just call the batchprocessor in a different
	320	mode:
	321
	322	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	323	... mode='update', user='Bob')
[4879]	324	(4, 0)
[4837]	325
[4879]	326	Now we want to tell, that Wilma got an extra port for her second dino:
[4837]	327
	328	>>> open('newcomers.csv', 'wb').write(
	329	... """name,dinoports,owner
	330	... Wilmas Asylum,2,Wilma
	331	... """)
	332
	333	>>> wilma = stoneville['Wilmas Asylum']
	334	>>> wilma.dinoports
	335	1
	336
	337	We start the processor:
	338
	339	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	340	... mode='update', user='Bob')
[4879]	341	(1, 0)
[4837]	342
	343	>>> wilma = stoneville['Wilmas Asylum']
	344	>>> wilma.dinoports
	345	2
	346
	347	Wilma's number of dinoports raised.
	348
	349	If we try to update an unexisting entry, an error occurs:
	350
	351	>>> open('newcomers.csv', 'wb').write(
	352	... """name,dinoports,owner
	353	... NOT-WILMAS-ASYLUM,2,Wilma
	354	... """)
	355
	356	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	357	... mode='update', user='Bob')
[4879]	358	(1, 1)
[4837]	359
	360	Also invalid values will be spotted:
	361
	362	>>> open('newcomers.csv', 'wb').write(
	363	... """name,dinoports,owner
	364	... Wilmas Asylum,NOT-A-NUMBER,Wilma
	365	... """)
	366
	367	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	368	... mode='update', user='Bob')
[4879]	369	(1, 1)
[4837]	370
	371	We can also update only some cols, leaving some out. We skip the
	372	'dinoports' column in the next run:
	373
	374	>>> open('newcomers.csv', 'wb').write(
	375	... """name,owner
	376	... Wilmas Asylum,Barney
	377	... """)
	378
	379	>>> processor.doImport('newcomers.csv', ['name', 'owner'],
	380	... mode='update', user='Bob')
[4879]	381	(1, 0)
[4837]	382
	383	>>> wilma.owner
	384	u'Barney'
	385
	386	We can however, not leave out the 'location field' ('name' in our
	387	case), as this one tells us which entry to update:
	388
	389	>>> open('newcomers.csv', 'wb').write(
	390	... """name,dinoports,owner
	391	... 2,Wilma
	392	... """)
	393
	394	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
	395	... mode='update', user='Bob')
	396	Traceback (most recent call last):
	397	...
	398	FatalCSVError: Need at least columns 'name' for import!
	399
	400	This time we get even an exception!
	401
	402	We can tell to set dinoports to ``None`` although this is not a
	403	number, as we declared the field not required in the interface:
	404
	405	>>> open('newcomers.csv', 'wb').write(
	406	... """name,dinoports,owner
	407	... "Wilmas Asylum",,"Wilma"
	408	... """)
	409
	410	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	411	... mode='update', user='Bob')
[4879]	412	(1, 0)
[4837]	413
	414	>>> wilma.dinoports is None
	415	True
	416
	417	Generally, empty strings are considered as ``None``:
	418
	419	>>> open('newcomers.csv', 'wb').write(
	420	... """name,dinoports,owner
	421	... "Wilmas Asylum","","Wilma"
	422	... """)
	423
	424	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	425	... mode='update', user='Bob')
[4879]	426	(1, 0)
[4837]	427
	428	>>> wilma.dinoports is None
	429	True
	430
	431	Removing entries
	432	----------------
	433
	434	In 'remove' mode we can delete entries. Here validity of values in
	435	non-location fields doesn't matter because those fields are ignored.
	436
	437	>>> open('newcomers.csv', 'wb').write(
	438	... """name,dinoports,owner
	439	... "Wilmas Asylum","ILLEGAL-NUMBER",""
	440	... """)
	441
	442	>>> processor.doImport('newcomers.csv', ['name', 'dinoports', 'owner'],
	443	... mode='remove', user='Bob')
[4879]	444	(1, 0)
[4837]	445
	446	>>> sorted(stoneville.keys())
	447	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
	448
	449	Oops! Wilma is gone.
	450
	451
	452	Clean up:
	453
	454	>>> import os
	455	>>> os.unlink('newcomers.csv')
	456	>>> os.unlink('newcomers.csv.create.pending')
[4886]	457	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: