Context navigation

source: main/waeup.kofa/trunk/src/waeup/kofa/utils/batching.txt @ 8613

Last change on this file since 8613 was 8330, checked in by Henrik Bettermann, 12 years ago
When using catalogs existing objects must not necessarily be in the same container.
File size: 16.7 KB

Rev	Line
[7811]	1	:mod:`waeup.kofa.utils.batching` -- Batch processing
[4921]	2	****************************************************
[4837]	3
	4	Batch processing is much more than pure data import.
	5
	6	Overview
	7	========
	8
	9	Basically, it means processing CSV files in order to mass-create,
	10	mass-remove, or mass-update data.
	11
[7933]	12	So you can feed CSV files to processors, that are part of
[4847]	13	the batch-processing mechanism.
[4837]	14
[7933]	15	Processors
	16	----------
[4837]	17
[4847]	18	Each CSV file processor
[4837]	19
	20	* accepts a single data type identified by an interface.
	21
	22	* knows about the places inside a site (University) where to store,
	23	remove or update the data.
	24
	25	* can check headers before processing data.
	26
	27	* supports the mode 'create', 'update', 'remove'.
	28
[4903]	29	* creates log entries (optional)
[4837]	30
[4903]	31	* creates csv files containing successful and not-successful processed
	32	data respectively.
	33
[4837]	34	Output
	35	------
	36
[4903]	37	The results of processing are written to loggers, if a logger was
	38	given. Beside this new CSV files are created during processing:
[4837]	39
[4903]	40	* a pending CSV file, containing datasets that could not be processed
[4837]	41
[4903]	42	* a finished CSV file, containing datasets successfully processed.
	43
	44	The pending file is not created if everything works fine. The
	45	respective path returned in that case is ``None``.
	46
	47	The pending file (if created) is a CSV file that contains the failed
	48	rows appended by a column ``--ERRROR--`` in which the reasons for
	49	processing failures are listed.
	50
	51	The complete paths of these files are returned. They will be in a
	52	temporary directory created only for this purpose. It is the caller's
	53	responsibility to remove the temporay directories afterwards (the
	54	datacenters distProcessedFiles() method takes care for that).
	55
[4837]	56	It looks like this::
	57
	58	-----+ +---------+
	59	/ \| \| \| +------+
	60	\| .csv +----->\|Batch- \| \| \|
	61	\| \| \|processor+----changes-->\| ZODB \|
	62	\| +------+ \| \| \| \|
	63	+--\| \| \| + +------+
	64	\| Mode +-->\| \| -------+
	65	\| \| \| +----outputs-+-> / \|
[4903]	66	\| +----+->+---------+ \| \|.pending\|
	67	+--\|Log \| ^ \| \| \|
	68	+----+ \| \| +--------+
[4837]	69	+-----++ v
[4903]	70	\|Inter-\| ----------+
	71	\|face \| / \|
	72	+------+ \| .finished \|
	73	\| \|
	74	+-----------+
[4837]	75
	76
	77	Creating a batch processor
	78	==========================
	79
	80	We create an own batch processor for an own datatype. This datatype
	81	must be based on an interface that the batcher can use for converting
	82	data.
	83
	84	Founding Stoneville
	85	-------------------
	86
	87	We start with the interface:
	88
	89	>>> from zope.interface import Interface
	90	>>> from zope import schema
	91	>>> class ICave(Interface):
	92	... """A cave."""
	93	... name = schema.TextLine(
	94	... title = u'Cave name',
	95	... default = u'Unnamed',
	96	... required = True)
	97	... dinoports = schema.Int(
	98	... title = u'Number of DinoPorts (tm)',
	99	... required = False,
	100	... default = 1)
	101	... owner = schema.TextLine(
	102	... title = u'Owner name',
	103	... required = True,
	104	... missing_value = 'Fred Estates Inc.')
[4871]	105	... taxpayer = schema.Bool(
	106	... title = u'Payes taxes',
	107	... required = True,
	108	... default = False)
[4837]	109
	110	Now a class that implements this interface:
	111
	112	>>> import grok
	113	>>> class Cave(object):
	114	... grok.implements(ICave)
	115	... def __init__(self, name=u'Unnamed', dinoports=2,
[4871]	116	... owner='Fred Estates Inc.', taxpayer=False):
[4837]	117	... self.name = name
	118	... self.dinoports = 2
	119	... self.owner = owner
[4871]	120	... self.taxpayer = taxpayer
[4837]	121
	122	We also provide a factory for caves. Strictly speaking, this not
	123	necessary but makes the batch processor we create afterwards, better
	124	understandable.
	125
	126	>>> from zope.component import getGlobalSiteManager
	127	>>> from zope.component.factory import Factory
	128	>>> from zope.component.interfaces import IFactory
	129	>>> gsm = getGlobalSiteManager()
	130	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
	131	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
	132
	133	Now we can create caves using a factory:
	134
	135	>>> from zope.component import createObject
	136	>>> createObject('Lovely Cave')
	137	<Cave object at 0x...>
	138
	139	This is nice, but we still lack a place, where we can place all the
	140	lovely caves we want to sell.
	141
	142	Furthermore, as a replacement for a real site, we define a place where
	143	all caves can be stored: Stoneville! This is a lovely place for
	144	upperclass cavemen (which are the only ones that can afford more than
	145	one dinoport).
	146
	147	We found Stoneville:
	148
	149	>>> stoneville = dict()
	150
	151	Everything in place.
	152
	153	Now, to improve local health conditions, imagine we want to populate
	154	Stoneville with lots of new happy dino-hunting natives that slept on
	155	the bare ground in former times and had no idea of
	156	bathrooms. Disgusting, isn't it?
	157
	158	Lots of cavemen need lots of caves.
	159
	160	Of course we can do something like:
	161
	162	>>> cave1 = createObject('Lovely Cave')
	163	>>> cave1.name = "Fred's home"
	164	>>> cave1.owner = "Fred"
	165	>>> stoneville[cave1.name] = cave1
	166
	167	and Stoneville has exactly
	168
	169	>>> len(stoneville)
	170	1
	171
	172	inhabitant. But we don't want to do this for hundreds or thousands of
	173	citizens-to-be, do we?
	174
	175	It is much easier to create a simple CSV list, where we put in all the
	176	data and let a batch processor do the job.
	177
	178	The list is already here:
	179
	180	>>> open('newcomers.csv', 'wb').write(
[4871]	181	... """name,dinoports,owner,taxpayer
	182	... Barneys Home,2,Barney,1
	183	... Wilmas Asylum,1,Wilma,1
	184	... Freds Dinoburgers,10,Fred,0
	185	... Joeys Drive-in,110,Joey,0
[4837]	186	... """)
	187
	188	All we need, is a batch processor now.
	189
[7811]	190	>>> from waeup.kofa.utils.batching import BatchProcessor
[8224]	191	>>> from waeup.kofa.interfaces import IGNORE_MARKER
[4837]	192	>>> class CaveProcessor(BatchProcessor):
	193	... util_name = 'caveprocessor'
	194	... grok.name(util_name)
	195	... name = 'Cave Processor'
	196	... iface = ICave
	197	... location_fields = ['name']
	198	... factory_name = 'Lovely Cave'
	199	...
	200	... def parentsExist(self, row, site):
	201	... return True
	202	...
	203	... def getParent(self, row, site):
	204	... return stoneville
	205	...
	206	... def entryExists(self, row, site):
	207	... return row['name'] in stoneville.keys()
	208	...
	209	... def getEntry(self, row, site):
	210	... if not self.entryExists(row, site):
	211	... return None
	212	... return stoneville[row['name']]
	213	...
	214	... def delEntry(self, row, site):
	215	... del stoneville[row['name']]
	216	...
	217	... def addEntry(self, obj, row, site):
	218	... stoneville[row['name']] = obj
	219	...
	220	... def updateEntry(self, obj, row, site):
[4985]	221	... # This is not strictly necessary, as the default
	222	... # updateEntry method does exactly the same
[4837]	223	... for key, value in row.items():
[8224]	224	... if value != IGNORE_MARKER:
	225	... setattr(obj, key, value)
[4837]	226
[4886]	227	If we also want the results being logged, we must provide a logger
	228	(this is optional):
	229
	230	>>> import logging
	231	>>> logger = logging.getLogger('stoneville')
	232	>>> logger.setLevel(logging.DEBUG)
	233	>>> logger.propagate = False
	234	>>> handler = logging.FileHandler('stoneville.log', 'w')
	235	>>> logger.addHandler(handler)
	236
[4837]	237	Create the fellows:
	238
	239	>>> processor = CaveProcessor()
[6273]	240	>>> result = processor.doImport('newcomers.csv',
[4871]	241	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	242	... mode='create', user='Bob', logger=logger)
[4902]	243	>>> result
[4895]	244	(4, 0, '/.../newcomers.finished.csv', None)
[4837]	245
	246	The result means: four entries were processed and no warnings
[4895]	247	occured. Furthermore we get filepath to a CSV file with successfully
	248	processed entries and a filepath to a CSV file with erraneous entries.
	249	As everything went well, the latter is ``None``. Let's check:
[4837]	250
	251	>>> sorted(stoneville.keys())
	252	[u'Barneys Home', ..., u'Wilmas Asylum']
	253
	254	The values of the Cave instances have correct type:
	255
	256	>>> barney = stoneville['Barneys Home']
	257	>>> barney.dinoports
	258	2
	259
	260	which is a number, not a string.
	261
	262	Apparently, when calling the processor, we gave some more info than
	263	only the CSV filepath. What does it all mean?
	264
	265	While the first argument is the path to the CSV file, we also have to
	266	give an ordered list of headernames. These replace the header field
	267	names that are actually in the file. This way we can override faulty
	268	headers.
	269
	270	The ``mode`` paramter tells what kind of operation we want to perform:
	271	``create``, ``update``, or ``remove`` data.
	272
	273	The ``user`` parameter finally is optional and only used for logging.
	274
[4886]	275	We can, by the way, see the results of our run in a logfile if we
	276	provided a logger during the call:
[4837]	277
[4886]	278	>>> print open('stoneville.log').read()
	279	--------------------
	280	Bob: Batch processing finished: OK
	281	Bob: Source: newcomers.csv
	282	Bob: Mode: create
	283	Bob: User: Bob
	284	Bob: Processing time: ... s (... s/item)
	285	Bob: Processed: 4 lines (4 successful/ 0 failed)
	286	--------------------
[4837]	287
[4902]	288	We cleanup the temporay dir created by doImport():
	289
	290	>>> import shutil
	291	>>> import os
	292	>>> shutil.rmtree(os.path.dirname(result[2]))
	293
[4837]	294	As we can see, the processing was successful. Otherwise, all problems
	295	could be read here as we can see, if we do the same operation again:
	296
[4902]	297	>>> result = processor.doImport('newcomers.csv',
[4871]	298	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	299	... mode='create', user='Bob', logger=logger)
[4902]	300	>>> result
[4895]	301	(4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]	302
[4895]	303	This time we also get a path to a .pending file.
	304
[4837]	305	The log file will tell us this in more detail:
	306
[4886]	307	>>> print open('stoneville.log').read()
	308	--------------------
	309	...
	310	--------------------
	311	Bob: Batch processing finished: FAILED
	312	Bob: Source: newcomers.csv
	313	Bob: Mode: create
	314	Bob: User: Bob
[4895]	315	Bob: Failed datasets: newcomers.pending.csv
[4886]	316	Bob: Processing time: ... s (... s/item)
	317	Bob: Processed: 4 lines (0 successful/ 4 failed)
	318	--------------------
[4837]	319
	320	This time a new file was created, which keeps all the rows we could not
[4877]	321	process and an additional column with error messages:
[4837]	322
[4902]	323	>>> print open(result[3]).read()
[4877]	324	owner,name,taxpayer,dinoports,--ERRORS--
[8330]	325	Barney,Barneys Home,1,2,This object already exists. Skipping.
	326	Wilma,Wilmas Asylum,1,1,This object already exists. Skipping.
	327	Fred,Freds Dinoburgers,0,10,This object already exists. Skipping.
	328	Joey,Joeys Drive-in,0,110,This object already exists. Skipping.
[4837]	329
	330	This way we can correct the faulty entries and afterwards retry without
	331	having the already processed rows in the way.
	332
[4871]	333	We also notice, that the values of the taxpayer column are returned as
	334	in the input file. There we wrote '1' for ``True`` and '0' for
	335	``False`` (which is accepted by the converters).
[4837]	336
[4902]	337	Clean up:
[4871]	338
[4902]	339	>>> shutil.rmtree(os.path.dirname(result[2]))
	340
[4912]	341
	342	We can also tell to ignore some cols from input by passing
	343	``--IGNORE--`` as col name:
	344
	345	>>> result = processor.doImport('newcomers.csv', ['name',
	346	... '--IGNORE--', '--IGNORE--'],
	347	... mode='update', user='Bob')
	348	>>> result
	349	(4, 0, '...', None)
	350
	351	Clean up:
	352
	353	>>> shutil.rmtree(os.path.dirname(result[2]))
	354
	355	If something goes wrong during processing, the respective --IGNORE--
[6824]	356	cols won't be populated in the resulting pending file:
[4912]	357
	358	>>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
	359	... '--IGNORE--', '--IGNORE--'],
	360	... mode='create', user='Bob')
	361	>>> result
	362	(4, 4, '...', '...')
	363
	364	>>> print open(result[3], 'rb').read()
[6824]	365	name,dinoports,--ERRORS--
[8330]	366	Barneys Home,2,This object already exists. Skipping.
	367	Wilmas Asylum,1,This object already exists. Skipping.
	368	Freds Dinoburgers,10,This object already exists. Skipping.
	369	Joeys Drive-in,110,This object already exists. Skipping.
[4912]	370
	371
	372	Clean up:
	373
	374	>>> shutil.rmtree(os.path.dirname(result[2]))
	375
	376
	377
	378
[4837]	379	Updating entries
	380	----------------
	381
	382	To update entries, we just call the batchprocessor in a different
	383	mode:
	384
[4902]	385	>>> result = processor.doImport('newcomers.csv', ['name',
	386	... 'dinoports', 'owner'],
[4837]	387	... mode='update', user='Bob')
[4902]	388	>>> result
[4895]	389	(4, 0, '...', None)
[4837]	390
[4879]	391	Now we want to tell, that Wilma got an extra port for her second dino:
[4837]	392
	393	>>> open('newcomers.csv', 'wb').write(
	394	... """name,dinoports,owner
	395	... Wilmas Asylum,2,Wilma
	396	... """)
	397
	398	>>> wilma = stoneville['Wilmas Asylum']
	399	>>> wilma.dinoports
	400	1
	401
[4902]	402	Clean up:
	403
	404	>>> shutil.rmtree(os.path.dirname(result[2]))
	405
	406
[4837]	407	We start the processor:
	408
[4902]	409	>>> result = processor.doImport('newcomers.csv', ['name',
	410	... 'dinoports', 'owner'], mode='update', user='Bob')
	411	>>> result
[4895]	412	(1, 0, '...', None)
[4837]	413
	414	>>> wilma = stoneville['Wilmas Asylum']
	415	>>> wilma.dinoports
	416	2
	417
	418	Wilma's number of dinoports raised.
	419
[4902]	420	Clean up:
	421
	422	>>> shutil.rmtree(os.path.dirname(result[2]))
	423
	424
[4837]	425	If we try to update an unexisting entry, an error occurs:
	426
	427	>>> open('newcomers.csv', 'wb').write(
	428	... """name,dinoports,owner
	429	... NOT-WILMAS-ASYLUM,2,Wilma
	430	... """)
	431
[4902]	432	>>> result = processor.doImport('newcomers.csv', ['name',
	433	... 'dinoports', 'owner'],
[4837]	434	... mode='update', user='Bob')
[4902]	435	>>> result
[4895]	436	(1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]	437
	438	Clean up:
	439
	440	>>> shutil.rmtree(os.path.dirname(result[2]))
	441
[4837]	442
	443	Also invalid values will be spotted:
	444
	445	>>> open('newcomers.csv', 'wb').write(
	446	... """name,dinoports,owner
	447	... Wilmas Asylum,NOT-A-NUMBER,Wilma
	448	... """)
	449
[4902]	450	>>> result = processor.doImport('newcomers.csv', ['name',
	451	... 'dinoports', 'owner'],
[4837]	452	... mode='update', user='Bob')
[4902]	453	>>> result
[4895]	454	(1, 1, '...', '...')
[4837]	455
[4902]	456	Clean up:
	457
	458	>>> shutil.rmtree(os.path.dirname(result[2]))
	459
	460
[4837]	461	We can also update only some cols, leaving some out. We skip the
	462	'dinoports' column in the next run:
	463
	464	>>> open('newcomers.csv', 'wb').write(
	465	... """name,owner
	466	... Wilmas Asylum,Barney
	467	... """)
	468
[4902]	469	>>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
	470	... mode='update', user='Bob')
	471	>>> result
[4895]	472	(1, 0, '...', None)
[4837]	473
	474	>>> wilma.owner
	475	u'Barney'
	476
[4902]	477	Clean up:
	478
	479	>>> shutil.rmtree(os.path.dirname(result[2]))
	480
	481
[4837]	482	We can however, not leave out the 'location field' ('name' in our
	483	case), as this one tells us which entry to update:
	484
	485	>>> open('newcomers.csv', 'wb').write(
	486	... """name,dinoports,owner
	487	... 2,Wilma
	488	... """)
	489
	490	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
	491	... mode='update', user='Bob')
	492	Traceback (most recent call last):
	493	...
	494	FatalCSVError: Need at least columns 'name' for import!
	495
	496	This time we get even an exception!
	497
[8227]	498	Generally, empty strings are considered as ``None``:
[4837]	499
	500	>>> open('newcomers.csv', 'wb').write(
	501	... """name,dinoports,owner
[8227]	502	... "Wilmas Asylum","","Wilma"
[4837]	503	... """)
	504
[4902]	505	>>> result = processor.doImport('newcomers.csv', ['name',
	506	... 'dinoports', 'owner'],
[8227]	507	... mode='update', user='Bob')
[4902]	508	>>> result
[4895]	509	(1, 0, '...', None)
[4837]	510
[8227]	511	>>> wilma.dinoports
	512	2
[4837]	513
[4902]	514	Clean up:
	515
	516	>>> shutil.rmtree(os.path.dirname(result[2]))
	517
[8227]	518	We can tell to set dinoports to ``None`` although this is not a
	519	number, as we declared the field not required in the interface:
[4837]	520
	521	>>> open('newcomers.csv', 'wb').write(
	522	... """name,dinoports,owner
[8227]	523	... "Wilmas Asylum","XXX","Wilma"
[4837]	524	... """)
	525
[4902]	526	>>> result = processor.doImport('newcomers.csv', ['name',
	527	... 'dinoports', 'owner'],
[8227]	528	... mode='update', user='Bob', ignore_empty=False)
[4902]	529	>>> result
[4895]	530	(1, 0, '...', None)
[4837]	531
	532	>>> wilma.dinoports is None
	533	True
	534
[4902]	535	Clean up:
	536
	537	>>> shutil.rmtree(os.path.dirname(result[2]))
	538
[4837]	539	Removing entries
	540	----------------
	541
	542	In 'remove' mode we can delete entries. Here validity of values in
	543	non-location fields doesn't matter because those fields are ignored.
	544
	545	>>> open('newcomers.csv', 'wb').write(
	546	... """name,dinoports,owner
	547	... "Wilmas Asylum","ILLEGAL-NUMBER",""
	548	... """)
	549
[4902]	550	>>> result = processor.doImport('newcomers.csv', ['name',
	551	... 'dinoports', 'owner'],
[4837]	552	... mode='remove', user='Bob')
[4902]	553	>>> result
[4895]	554	(1, 0, '...', None)
[4837]	555
	556	>>> sorted(stoneville.keys())
	557	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
	558
	559	Oops! Wilma is gone.
	560
[4902]	561	Clean up:
[4837]	562
[4902]	563	>>> shutil.rmtree(os.path.dirname(result[2]))
	564
	565
[4837]	566	Clean up:
	567
	568	>>> import os
	569	>>> os.unlink('newcomers.csv')
[4886]	570	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: