Context navigation

source: main/waeup.sirp/trunk/src/waeup/sirp/utils/batching.txt @ 5669

Last change on this file since 5669 was 4985, checked in by uli, 15 years ago
Add comment.
File size: 16.8 KB

Rev	Line
[4921]	1	:mod:`waeup.sirp.utils.batching` -- Batch processing
	2	****************************************************
[4837]	3
	4	Batch processing is much more than pure data import.
	5
	6	:test-layer: functional
	7
	8	Overview
	9	========
	10
	11	Basically, it means processing CSV files in order to mass-create,
	12	mass-remove, or mass-update data.
	13
[4847]	14	So you can feed CSV files to importers or processors, that are part of
	15	the batch-processing mechanism.
[4837]	16
[4847]	17	Importers/Processors
	18	--------------------
[4837]	19
[4847]	20	Each CSV file processor
[4837]	21
	22	* accepts a single data type identified by an interface.
	23
	24	* knows about the places inside a site (University) where to store,
	25	remove or update the data.
	26
	27	* can check headers before processing data.
	28
	29	* supports the mode 'create', 'update', 'remove'.
	30
[4903]	31	* creates log entries (optional)
[4837]	32
[4903]	33	* creates csv files containing successful and not-successful processed
	34	data respectively.
	35
[4837]	36	Output
	37	------
	38
[4903]	39	The results of processing are written to loggers, if a logger was
	40	given. Beside this new CSV files are created during processing:
[4837]	41
[4903]	42	* a pending CSV file, containing datasets that could not be processed
[4837]	43
[4903]	44	* a finished CSV file, containing datasets successfully processed.
	45
	46	The pending file is not created if everything works fine. The
	47	respective path returned in that case is ``None``.
	48
	49	The pending file (if created) is a CSV file that contains the failed
	50	rows appended by a column ``--ERRROR--`` in which the reasons for
	51	processing failures are listed.
	52
	53	The complete paths of these files are returned. They will be in a
	54	temporary directory created only for this purpose. It is the caller's
	55	responsibility to remove the temporay directories afterwards (the
	56	datacenters distProcessedFiles() method takes care for that).
	57
[4837]	58	It looks like this::
	59
	60	-----+ +---------+
	61	/ \| \| \| +------+
	62	\| .csv +----->\|Batch- \| \| \|
	63	\| \| \|processor+----changes-->\| ZODB \|
	64	\| +------+ \| \| \| \|
	65	+--\| \| \| + +------+
	66	\| Mode +-->\| \| -------+
	67	\| \| \| +----outputs-+-> / \|
[4903]	68	\| +----+->+---------+ \| \|.pending\|
	69	+--\|Log \| ^ \| \| \|
	70	+----+ \| \| +--------+
[4837]	71	+-----++ v
[4903]	72	\|Inter-\| ----------+
	73	\|face \| / \|
	74	+------+ \| .finished \|
	75	\| \|
	76	+-----------+
[4837]	77
	78
	79	Creating a batch processor
	80	==========================
	81
	82	We create an own batch processor for an own datatype. This datatype
	83	must be based on an interface that the batcher can use for converting
	84	data.
	85
	86	Founding Stoneville
	87	-------------------
	88
	89	We start with the interface:
	90
	91	>>> from zope.interface import Interface
	92	>>> from zope import schema
	93	>>> class ICave(Interface):
	94	... """A cave."""
	95	... name = schema.TextLine(
	96	... title = u'Cave name',
	97	... default = u'Unnamed',
	98	... required = True)
	99	... dinoports = schema.Int(
	100	... title = u'Number of DinoPorts (tm)',
	101	... required = False,
	102	... default = 1)
	103	... owner = schema.TextLine(
	104	... title = u'Owner name',
	105	... required = True,
	106	... missing_value = 'Fred Estates Inc.')
[4871]	107	... taxpayer = schema.Bool(
	108	... title = u'Payes taxes',
	109	... required = True,
	110	... default = False)
[4837]	111
	112	Now a class that implements this interface:
	113
	114	>>> import grok
	115	>>> class Cave(object):
	116	... grok.implements(ICave)
	117	... def __init__(self, name=u'Unnamed', dinoports=2,
[4871]	118	... owner='Fred Estates Inc.', taxpayer=False):
[4837]	119	... self.name = name
	120	... self.dinoports = 2
	121	... self.owner = owner
[4871]	122	... self.taxpayer = taxpayer
[4837]	123
	124	We also provide a factory for caves. Strictly speaking, this not
	125	necessary but makes the batch processor we create afterwards, better
	126	understandable.
	127
	128	>>> from zope.component import getGlobalSiteManager
	129	>>> from zope.component.factory import Factory
	130	>>> from zope.component.interfaces import IFactory
	131	>>> gsm = getGlobalSiteManager()
	132	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
	133	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
	134
	135	Now we can create caves using a factory:
	136
	137	>>> from zope.component import createObject
	138	>>> createObject('Lovely Cave')
	139	<Cave object at 0x...>
	140
	141	This is nice, but we still lack a place, where we can place all the
	142	lovely caves we want to sell.
	143
	144	Furthermore, as a replacement for a real site, we define a place where
	145	all caves can be stored: Stoneville! This is a lovely place for
	146	upperclass cavemen (which are the only ones that can afford more than
	147	one dinoport).
	148
	149	We found Stoneville:
	150
	151	>>> stoneville = dict()
	152
	153	Everything in place.
	154
	155	Now, to improve local health conditions, imagine we want to populate
	156	Stoneville with lots of new happy dino-hunting natives that slept on
	157	the bare ground in former times and had no idea of
	158	bathrooms. Disgusting, isn't it?
	159
	160	Lots of cavemen need lots of caves.
	161
	162	Of course we can do something like:
	163
	164	>>> cave1 = createObject('Lovely Cave')
	165	>>> cave1.name = "Fred's home"
	166	>>> cave1.owner = "Fred"
	167	>>> stoneville[cave1.name] = cave1
	168
	169	and Stoneville has exactly
	170
	171	>>> len(stoneville)
	172	1
	173
	174	inhabitant. But we don't want to do this for hundreds or thousands of
	175	citizens-to-be, do we?
	176
	177	It is much easier to create a simple CSV list, where we put in all the
	178	data and let a batch processor do the job.
	179
	180	The list is already here:
	181
	182	>>> open('newcomers.csv', 'wb').write(
[4871]	183	... """name,dinoports,owner,taxpayer
	184	... Barneys Home,2,Barney,1
	185	... Wilmas Asylum,1,Wilma,1
	186	... Freds Dinoburgers,10,Fred,0
	187	... Joeys Drive-in,110,Joey,0
[4837]	188	... """)
	189
	190	All we need, is a batch processor now.
	191
[4921]	192	>>> from waeup.sirp.utils.batching import BatchProcessor
[4837]	193	>>> class CaveProcessor(BatchProcessor):
	194	... util_name = 'caveprocessor'
	195	... grok.name(util_name)
	196	... name = 'Cave Processor'
	197	... iface = ICave
	198	... location_fields = ['name']
	199	... factory_name = 'Lovely Cave'
	200	...
	201	... def parentsExist(self, row, site):
	202	... return True
	203	...
	204	... def getParent(self, row, site):
	205	... return stoneville
	206	...
	207	... def entryExists(self, row, site):
	208	... return row['name'] in stoneville.keys()
	209	...
	210	... def getEntry(self, row, site):
	211	... if not self.entryExists(row, site):
	212	... return None
	213	... return stoneville[row['name']]
	214	...
	215	... def delEntry(self, row, site):
	216	... del stoneville[row['name']]
	217	...
	218	... def addEntry(self, obj, row, site):
	219	... stoneville[row['name']] = obj
	220	...
	221	... def updateEntry(self, obj, row, site):
[4985]	222	... # This is not strictly necessary, as the default
	223	... # updateEntry method does exactly the same
[4837]	224	... for key, value in row.items():
	225	... setattr(obj, key, value)
	226
[4886]	227	If we also want the results being logged, we must provide a logger
	228	(this is optional):
	229
	230	>>> import logging
	231	>>> logger = logging.getLogger('stoneville')
	232	>>> logger.setLevel(logging.DEBUG)
	233	>>> logger.propagate = False
	234	>>> handler = logging.FileHandler('stoneville.log', 'w')
	235	>>> logger.addHandler(handler)
	236
[4837]	237	Create the fellows:
	238
	239	>>> processor = CaveProcessor()
[4902]	240	>>> result = processor.doImport('newcomers.csv',
[4871]	241	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	242	... mode='create', user='Bob', logger=logger)
[4902]	243	>>> result
[4895]	244	(4, 0, '/.../newcomers.finished.csv', None)
[4837]	245
	246	The result means: four entries were processed and no warnings
[4895]	247	occured. Furthermore we get filepath to a CSV file with successfully
	248	processed entries and a filepath to a CSV file with erraneous entries.
	249	As everything went well, the latter is ``None``. Let's check:
[4837]	250
	251	>>> sorted(stoneville.keys())
	252	[u'Barneys Home', ..., u'Wilmas Asylum']
	253
	254	The values of the Cave instances have correct type:
	255
	256	>>> barney = stoneville['Barneys Home']
	257	>>> barney.dinoports
	258	2
	259
	260	which is a number, not a string.
	261
	262	Apparently, when calling the processor, we gave some more info than
	263	only the CSV filepath. What does it all mean?
	264
	265	While the first argument is the path to the CSV file, we also have to
	266	give an ordered list of headernames. These replace the header field
	267	names that are actually in the file. This way we can override faulty
	268	headers.
	269
	270	The ``mode`` paramter tells what kind of operation we want to perform:
	271	``create``, ``update``, or ``remove`` data.
	272
	273	The ``user`` parameter finally is optional and only used for logging.
	274
[4886]	275	We can, by the way, see the results of our run in a logfile if we
	276	provided a logger during the call:
[4837]	277
[4886]	278	>>> print open('stoneville.log').read()
	279	--------------------
	280	Bob: Batch processing finished: OK
	281	Bob: Source: newcomers.csv
	282	Bob: Mode: create
	283	Bob: User: Bob
	284	Bob: Processing time: ... s (... s/item)
	285	Bob: Processed: 4 lines (4 successful/ 0 failed)
	286	--------------------
[4837]	287
[4902]	288	We cleanup the temporay dir created by doImport():
	289
	290	>>> import shutil
	291	>>> import os
	292	>>> shutil.rmtree(os.path.dirname(result[2]))
	293
[4837]	294	As we can see, the processing was successful. Otherwise, all problems
	295	could be read here as we can see, if we do the same operation again:
	296
[4902]	297	>>> result = processor.doImport('newcomers.csv',
[4871]	298	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	299	... mode='create', user='Bob', logger=logger)
[4902]	300	>>> result
[4895]	301	(4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]	302
[4895]	303	This time we also get a path to a .pending file.
	304
[4837]	305	The log file will tell us this in more detail:
	306
[4886]	307	>>> print open('stoneville.log').read()
	308	--------------------
	309	...
	310	--------------------
	311	Bob: Batch processing finished: FAILED
	312	Bob: Source: newcomers.csv
	313	Bob: Mode: create
	314	Bob: User: Bob
[4895]	315	Bob: Failed datasets: newcomers.pending.csv
[4886]	316	Bob: Processing time: ... s (... s/item)
	317	Bob: Processed: 4 lines (0 successful/ 4 failed)
	318	--------------------
[4837]	319
	320	This time a new file was created, which keeps all the rows we could not
[4877]	321	process and an additional column with error messages:
[4837]	322
[4902]	323	>>> print open(result[3]).read()
[4877]	324	owner,name,taxpayer,dinoports,--ERRORS--
	325	Barney,Barneys Home,1,2,This object already exists. Skipping.
	326	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
	327	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
	328	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
[4837]	329
	330	This way we can correct the faulty entries and afterwards retry without
	331	having the already processed rows in the way.
	332
[4871]	333	We also notice, that the values of the taxpayer column are returned as
	334	in the input file. There we wrote '1' for ``True`` and '0' for
	335	``False`` (which is accepted by the converters).
[4837]	336
[4902]	337	Clean up:
[4871]	338
[4902]	339	>>> shutil.rmtree(os.path.dirname(result[2]))
	340
[4912]	341
	342	We can also tell to ignore some cols from input by passing
	343	``--IGNORE--`` as col name:
	344
	345	>>> result = processor.doImport('newcomers.csv', ['name',
	346	... '--IGNORE--', '--IGNORE--'],
	347	... mode='update', user='Bob')
	348	>>> result
	349	(4, 0, '...', None)
	350
	351	Clean up:
	352
	353	>>> shutil.rmtree(os.path.dirname(result[2]))
	354
	355	If something goes wrong during processing, the respective --IGNORE--
	356	cols will be populated correctly in the resulting pending file:
	357
	358	>>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
	359	... '--IGNORE--', '--IGNORE--'],
	360	... mode='create', user='Bob')
	361	>>> result
	362	(4, 4, '...', '...')
	363
	364	>>> print open(result[3], 'rb').read()
	365	--IGNORE--,name,--IGNORE--,dinoports,--ERRORS--
	366	Barney,Barneys Home,1,2,This object already exists. Skipping.
	367	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
	368	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
	369	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
	370
	371	The first ignored column ('owner') provides different contents than
	372	the second one ('taxpayer').
	373
	374	Clean up:
	375
	376	>>> shutil.rmtree(os.path.dirname(result[2]))
	377
	378
	379
	380
[4837]	381	Updating entries
	382	----------------
	383
	384	To update entries, we just call the batchprocessor in a different
	385	mode:
	386
[4902]	387	>>> result = processor.doImport('newcomers.csv', ['name',
	388	... 'dinoports', 'owner'],
[4837]	389	... mode='update', user='Bob')
[4902]	390	>>> result
[4895]	391	(4, 0, '...', None)
[4837]	392
[4879]	393	Now we want to tell, that Wilma got an extra port for her second dino:
[4837]	394
	395	>>> open('newcomers.csv', 'wb').write(
	396	... """name,dinoports,owner
	397	... Wilmas Asylum,2,Wilma
	398	... """)
	399
	400	>>> wilma = stoneville['Wilmas Asylum']
	401	>>> wilma.dinoports
	402	1
	403
[4902]	404	Clean up:
	405
	406	>>> shutil.rmtree(os.path.dirname(result[2]))
	407
	408
[4837]	409	We start the processor:
	410
[4902]	411	>>> result = processor.doImport('newcomers.csv', ['name',
	412	... 'dinoports', 'owner'], mode='update', user='Bob')
	413	>>> result
[4895]	414	(1, 0, '...', None)
[4837]	415
	416	>>> wilma = stoneville['Wilmas Asylum']
	417	>>> wilma.dinoports
	418	2
	419
	420	Wilma's number of dinoports raised.
	421
[4902]	422	Clean up:
	423
	424	>>> shutil.rmtree(os.path.dirname(result[2]))
	425
	426
[4837]	427	If we try to update an unexisting entry, an error occurs:
	428
	429	>>> open('newcomers.csv', 'wb').write(
	430	... """name,dinoports,owner
	431	... NOT-WILMAS-ASYLUM,2,Wilma
	432	... """)
	433
[4902]	434	>>> result = processor.doImport('newcomers.csv', ['name',
	435	... 'dinoports', 'owner'],
[4837]	436	... mode='update', user='Bob')
[4902]	437	>>> result
[4895]	438	(1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]	439
	440	Clean up:
	441
	442	>>> shutil.rmtree(os.path.dirname(result[2]))
	443
[4837]	444
	445	Also invalid values will be spotted:
	446
	447	>>> open('newcomers.csv', 'wb').write(
	448	... """name,dinoports,owner
	449	... Wilmas Asylum,NOT-A-NUMBER,Wilma
	450	... """)
	451
[4902]	452	>>> result = processor.doImport('newcomers.csv', ['name',
	453	... 'dinoports', 'owner'],
[4837]	454	... mode='update', user='Bob')
[4902]	455	>>> result
[4895]	456	(1, 1, '...', '...')
[4837]	457
[4902]	458	Clean up:
	459
	460	>>> shutil.rmtree(os.path.dirname(result[2]))
	461
	462
[4837]	463	We can also update only some cols, leaving some out. We skip the
	464	'dinoports' column in the next run:
	465
	466	>>> open('newcomers.csv', 'wb').write(
	467	... """name,owner
	468	... Wilmas Asylum,Barney
	469	... """)
	470
[4902]	471	>>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
	472	... mode='update', user='Bob')
	473	>>> result
[4895]	474	(1, 0, '...', None)
[4837]	475
	476	>>> wilma.owner
	477	u'Barney'
	478
[4902]	479	Clean up:
	480
	481	>>> shutil.rmtree(os.path.dirname(result[2]))
	482
	483
[4837]	484	We can however, not leave out the 'location field' ('name' in our
	485	case), as this one tells us which entry to update:
	486
	487	>>> open('newcomers.csv', 'wb').write(
	488	... """name,dinoports,owner
	489	... 2,Wilma
	490	... """)
	491
	492	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
	493	... mode='update', user='Bob')
	494	Traceback (most recent call last):
	495	...
	496	FatalCSVError: Need at least columns 'name' for import!
	497
	498	This time we get even an exception!
	499
	500	We can tell to set dinoports to ``None`` although this is not a
	501	number, as we declared the field not required in the interface:
	502
	503	>>> open('newcomers.csv', 'wb').write(
	504	... """name,dinoports,owner
	505	... "Wilmas Asylum",,"Wilma"
	506	... """)
	507
[4902]	508	>>> result = processor.doImport('newcomers.csv', ['name',
	509	... 'dinoports', 'owner'],
[4837]	510	... mode='update', user='Bob')
[4902]	511	>>> result
[4895]	512	(1, 0, '...', None)
[4837]	513
	514	>>> wilma.dinoports is None
	515	True
	516
[4902]	517	Clean up:
	518
	519	>>> shutil.rmtree(os.path.dirname(result[2]))
	520
[4837]	521	Generally, empty strings are considered as ``None``:
	522
	523	>>> open('newcomers.csv', 'wb').write(
	524	... """name,dinoports,owner
	525	... "Wilmas Asylum","","Wilma"
	526	... """)
	527
[4902]	528	>>> result = processor.doImport('newcomers.csv', ['name',
	529	... 'dinoports', 'owner'],
[4837]	530	... mode='update', user='Bob')
[4902]	531	>>> result
[4895]	532	(1, 0, '...', None)
[4837]	533
	534	>>> wilma.dinoports is None
	535	True
	536
[4902]	537	Clean up:
	538
	539	>>> shutil.rmtree(os.path.dirname(result[2]))
	540
	541
[4837]	542	Removing entries
	543	----------------
	544
	545	In 'remove' mode we can delete entries. Here validity of values in
	546	non-location fields doesn't matter because those fields are ignored.
	547
	548	>>> open('newcomers.csv', 'wb').write(
	549	... """name,dinoports,owner
	550	... "Wilmas Asylum","ILLEGAL-NUMBER",""
	551	... """)
	552
[4902]	553	>>> result = processor.doImport('newcomers.csv', ['name',
	554	... 'dinoports', 'owner'],
[4837]	555	... mode='remove', user='Bob')
[4902]	556	>>> result
[4895]	557	(1, 0, '...', None)
[4837]	558
	559	>>> sorted(stoneville.keys())
	560	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
	561
	562	Oops! Wilma is gone.
	563
[4902]	564	Clean up:
[4837]	565
[4902]	566	>>> shutil.rmtree(os.path.dirname(result[2]))
	567
	568
[4837]	569	Clean up:
	570
	571	>>> import os
	572	>>> os.unlink('newcomers.csv')
[4886]	573	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: