Context navigation

source: main/waeup.kofa/trunk/src/waeup/kofa/utils/batching.txt @ 8059

Last change on this file since 8059 was 7933, checked in by Henrik Bettermann, 13 years ago
Rename importers to processors.
File size: 16.7 KB

Rev	Line
[7811]	1	:mod:`waeup.kofa.utils.batching` -- Batch processing
[4921]	2	****************************************************
[4837]	3
	4	Batch processing is much more than pure data import.
	5
	6	Overview
	7	========
	8
	9	Basically, it means processing CSV files in order to mass-create,
	10	mass-remove, or mass-update data.
	11
[7933]	12	So you can feed CSV files to processors, that are part of
[4847]	13	the batch-processing mechanism.
[4837]	14
[7933]	15	Processors
	16	----------
[4837]	17
[4847]	18	Each CSV file processor
[4837]	19
	20	* accepts a single data type identified by an interface.
	21
	22	* knows about the places inside a site (University) where to store,
	23	remove or update the data.
	24
	25	* can check headers before processing data.
	26
	27	* supports the mode 'create', 'update', 'remove'.
	28
[4903]	29	* creates log entries (optional)
[4837]	30
[4903]	31	* creates csv files containing successful and not-successful processed
	32	data respectively.
	33
[4837]	34	Output
	35	------
	36
[4903]	37	The results of processing are written to loggers, if a logger was
	38	given. Beside this new CSV files are created during processing:
[4837]	39
[4903]	40	* a pending CSV file, containing datasets that could not be processed
[4837]	41
[4903]	42	* a finished CSV file, containing datasets successfully processed.
	43
	44	The pending file is not created if everything works fine. The
	45	respective path returned in that case is ``None``.
	46
	47	The pending file (if created) is a CSV file that contains the failed
	48	rows appended by a column ``--ERRROR--`` in which the reasons for
	49	processing failures are listed.
	50
	51	The complete paths of these files are returned. They will be in a
	52	temporary directory created only for this purpose. It is the caller's
	53	responsibility to remove the temporay directories afterwards (the
	54	datacenters distProcessedFiles() method takes care for that).
	55
[4837]	56	It looks like this::
	57
	58	-----+ +---------+
	59	/ \| \| \| +------+
	60	\| .csv +----->\|Batch- \| \| \|
	61	\| \| \|processor+----changes-->\| ZODB \|
	62	\| +------+ \| \| \| \|
	63	+--\| \| \| + +------+
	64	\| Mode +-->\| \| -------+
	65	\| \| \| +----outputs-+-> / \|
[4903]	66	\| +----+->+---------+ \| \|.pending\|
	67	+--\|Log \| ^ \| \| \|
	68	+----+ \| \| +--------+
[4837]	69	+-----++ v
[4903]	70	\|Inter-\| ----------+
	71	\|face \| / \|
	72	+------+ \| .finished \|
	73	\| \|
	74	+-----------+
[4837]	75
	76
	77	Creating a batch processor
	78	==========================
	79
	80	We create an own batch processor for an own datatype. This datatype
	81	must be based on an interface that the batcher can use for converting
	82	data.
	83
	84	Founding Stoneville
	85	-------------------
	86
	87	We start with the interface:
	88
	89	>>> from zope.interface import Interface
	90	>>> from zope import schema
	91	>>> class ICave(Interface):
	92	... """A cave."""
	93	... name = schema.TextLine(
	94	... title = u'Cave name',
	95	... default = u'Unnamed',
	96	... required = True)
	97	... dinoports = schema.Int(
	98	... title = u'Number of DinoPorts (tm)',
	99	... required = False,
	100	... default = 1)
	101	... owner = schema.TextLine(
	102	... title = u'Owner name',
	103	... required = True,
	104	... missing_value = 'Fred Estates Inc.')
[4871]	105	... taxpayer = schema.Bool(
	106	... title = u'Payes taxes',
	107	... required = True,
	108	... default = False)
[4837]	109
	110	Now a class that implements this interface:
	111
	112	>>> import grok
	113	>>> class Cave(object):
	114	... grok.implements(ICave)
	115	... def __init__(self, name=u'Unnamed', dinoports=2,
[4871]	116	... owner='Fred Estates Inc.', taxpayer=False):
[4837]	117	... self.name = name
	118	... self.dinoports = 2
	119	... self.owner = owner
[4871]	120	... self.taxpayer = taxpayer
[4837]	121
	122	We also provide a factory for caves. Strictly speaking, this not
	123	necessary but makes the batch processor we create afterwards, better
	124	understandable.
	125
	126	>>> from zope.component import getGlobalSiteManager
	127	>>> from zope.component.factory import Factory
	128	>>> from zope.component.interfaces import IFactory
	129	>>> gsm = getGlobalSiteManager()
	130	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
	131	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
	132
	133	Now we can create caves using a factory:
	134
	135	>>> from zope.component import createObject
	136	>>> createObject('Lovely Cave')
	137	<Cave object at 0x...>
	138
	139	This is nice, but we still lack a place, where we can place all the
	140	lovely caves we want to sell.
	141
	142	Furthermore, as a replacement for a real site, we define a place where
	143	all caves can be stored: Stoneville! This is a lovely place for
	144	upperclass cavemen (which are the only ones that can afford more than
	145	one dinoport).
	146
	147	We found Stoneville:
	148
	149	>>> stoneville = dict()
	150
	151	Everything in place.
	152
	153	Now, to improve local health conditions, imagine we want to populate
	154	Stoneville with lots of new happy dino-hunting natives that slept on
	155	the bare ground in former times and had no idea of
	156	bathrooms. Disgusting, isn't it?
	157
	158	Lots of cavemen need lots of caves.
	159
	160	Of course we can do something like:
	161
	162	>>> cave1 = createObject('Lovely Cave')
	163	>>> cave1.name = "Fred's home"
	164	>>> cave1.owner = "Fred"
	165	>>> stoneville[cave1.name] = cave1
	166
	167	and Stoneville has exactly
	168
	169	>>> len(stoneville)
	170	1
	171
	172	inhabitant. But we don't want to do this for hundreds or thousands of
	173	citizens-to-be, do we?
	174
	175	It is much easier to create a simple CSV list, where we put in all the
	176	data and let a batch processor do the job.
	177
	178	The list is already here:
	179
	180	>>> open('newcomers.csv', 'wb').write(
[4871]	181	... """name,dinoports,owner,taxpayer
	182	... Barneys Home,2,Barney,1
	183	... Wilmas Asylum,1,Wilma,1
	184	... Freds Dinoburgers,10,Fred,0
	185	... Joeys Drive-in,110,Joey,0
[4837]	186	... """)
	187
	188	All we need, is a batch processor now.
	189
[7811]	190	>>> from waeup.kofa.utils.batching import BatchProcessor
[4837]	191	>>> class CaveProcessor(BatchProcessor):
	192	... util_name = 'caveprocessor'
	193	... grok.name(util_name)
	194	... name = 'Cave Processor'
	195	... iface = ICave
	196	... location_fields = ['name']
	197	... factory_name = 'Lovely Cave'
	198	...
	199	... def parentsExist(self, row, site):
	200	... return True
	201	...
	202	... def getParent(self, row, site):
	203	... return stoneville
	204	...
	205	... def entryExists(self, row, site):
	206	... return row['name'] in stoneville.keys()
	207	...
	208	... def getEntry(self, row, site):
	209	... if not self.entryExists(row, site):
	210	... return None
	211	... return stoneville[row['name']]
	212	...
	213	... def delEntry(self, row, site):
	214	... del stoneville[row['name']]
	215	...
	216	... def addEntry(self, obj, row, site):
	217	... stoneville[row['name']] = obj
	218	...
	219	... def updateEntry(self, obj, row, site):
[4985]	220	... # This is not strictly necessary, as the default
	221	... # updateEntry method does exactly the same
[4837]	222	... for key, value in row.items():
	223	... setattr(obj, key, value)
	224
[4886]	225	If we also want the results being logged, we must provide a logger
	226	(this is optional):
	227
	228	>>> import logging
	229	>>> logger = logging.getLogger('stoneville')
	230	>>> logger.setLevel(logging.DEBUG)
	231	>>> logger.propagate = False
	232	>>> handler = logging.FileHandler('stoneville.log', 'w')
	233	>>> logger.addHandler(handler)
	234
[4837]	235	Create the fellows:
	236
	237	>>> processor = CaveProcessor()
[6273]	238	>>> result = processor.doImport('newcomers.csv',
[4871]	239	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	240	... mode='create', user='Bob', logger=logger)
[4902]	241	>>> result
[4895]	242	(4, 0, '/.../newcomers.finished.csv', None)
[4837]	243
	244	The result means: four entries were processed and no warnings
[4895]	245	occured. Furthermore we get filepath to a CSV file with successfully
	246	processed entries and a filepath to a CSV file with erraneous entries.
	247	As everything went well, the latter is ``None``. Let's check:
[4837]	248
	249	>>> sorted(stoneville.keys())
	250	[u'Barneys Home', ..., u'Wilmas Asylum']
	251
	252	The values of the Cave instances have correct type:
	253
	254	>>> barney = stoneville['Barneys Home']
	255	>>> barney.dinoports
	256	2
	257
	258	which is a number, not a string.
	259
	260	Apparently, when calling the processor, we gave some more info than
	261	only the CSV filepath. What does it all mean?
	262
	263	While the first argument is the path to the CSV file, we also have to
	264	give an ordered list of headernames. These replace the header field
	265	names that are actually in the file. This way we can override faulty
	266	headers.
	267
	268	The ``mode`` paramter tells what kind of operation we want to perform:
	269	``create``, ``update``, or ``remove`` data.
	270
	271	The ``user`` parameter finally is optional and only used for logging.
	272
[4886]	273	We can, by the way, see the results of our run in a logfile if we
	274	provided a logger during the call:
[4837]	275
[4886]	276	>>> print open('stoneville.log').read()
	277	--------------------
	278	Bob: Batch processing finished: OK
	279	Bob: Source: newcomers.csv
	280	Bob: Mode: create
	281	Bob: User: Bob
	282	Bob: Processing time: ... s (... s/item)
	283	Bob: Processed: 4 lines (4 successful/ 0 failed)
	284	--------------------
[4837]	285
[4902]	286	We cleanup the temporay dir created by doImport():
	287
	288	>>> import shutil
	289	>>> import os
	290	>>> shutil.rmtree(os.path.dirname(result[2]))
	291
[4837]	292	As we can see, the processing was successful. Otherwise, all problems
	293	could be read here as we can see, if we do the same operation again:
	294
[4902]	295	>>> result = processor.doImport('newcomers.csv',
[4871]	296	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	297	... mode='create', user='Bob', logger=logger)
[4902]	298	>>> result
[4895]	299	(4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]	300
[4895]	301	This time we also get a path to a .pending file.
	302
[4837]	303	The log file will tell us this in more detail:
	304
[4886]	305	>>> print open('stoneville.log').read()
	306	--------------------
	307	...
	308	--------------------
	309	Bob: Batch processing finished: FAILED
	310	Bob: Source: newcomers.csv
	311	Bob: Mode: create
	312	Bob: User: Bob
[4895]	313	Bob: Failed datasets: newcomers.pending.csv
[4886]	314	Bob: Processing time: ... s (... s/item)
	315	Bob: Processed: 4 lines (0 successful/ 4 failed)
	316	--------------------
[4837]	317
	318	This time a new file was created, which keeps all the rows we could not
[4877]	319	process and an additional column with error messages:
[4837]	320
[4902]	321	>>> print open(result[3]).read()
[4877]	322	owner,name,taxpayer,dinoports,--ERRORS--
[6244]	323	Barney,Barneys Home,1,2,This object already exists in the same container. Skipping.
	324	Wilma,Wilmas Asylum,1,1,This object already exists in the same container. Skipping.
	325	Fred,Freds Dinoburgers,0,10,This object already exists in the same container. Skipping.
	326	Joey,Joeys Drive-in,0,110,This object already exists in the same container. Skipping.
[4837]	327
	328	This way we can correct the faulty entries and afterwards retry without
	329	having the already processed rows in the way.
	330
[4871]	331	We also notice, that the values of the taxpayer column are returned as
	332	in the input file. There we wrote '1' for ``True`` and '0' for
	333	``False`` (which is accepted by the converters).
[4837]	334
[4902]	335	Clean up:
[4871]	336
[4902]	337	>>> shutil.rmtree(os.path.dirname(result[2]))
	338
[4912]	339
	340	We can also tell to ignore some cols from input by passing
	341	``--IGNORE--`` as col name:
	342
	343	>>> result = processor.doImport('newcomers.csv', ['name',
	344	... '--IGNORE--', '--IGNORE--'],
	345	... mode='update', user='Bob')
	346	>>> result
	347	(4, 0, '...', None)
	348
	349	Clean up:
	350
	351	>>> shutil.rmtree(os.path.dirname(result[2]))
	352
	353	If something goes wrong during processing, the respective --IGNORE--
[6824]	354	cols won't be populated in the resulting pending file:
[4912]	355
	356	>>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
	357	... '--IGNORE--', '--IGNORE--'],
	358	... mode='create', user='Bob')
	359	>>> result
	360	(4, 4, '...', '...')
	361
	362	>>> print open(result[3], 'rb').read()
[6824]	363	name,dinoports,--ERRORS--
	364	Barneys Home,2,This object already exists in the same container. Skipping.
	365	Wilmas Asylum,1,This object already exists in the same container. Skipping.
	366	Freds Dinoburgers,10,This object already exists in the same container. Skipping.
	367	Joeys Drive-in,110,This object already exists in the same container. Skipping.
[4912]	368
	369
	370	Clean up:
	371
	372	>>> shutil.rmtree(os.path.dirname(result[2]))
	373
	374
	375
	376
[4837]	377	Updating entries
	378	----------------
	379
	380	To update entries, we just call the batchprocessor in a different
	381	mode:
	382
[4902]	383	>>> result = processor.doImport('newcomers.csv', ['name',
	384	... 'dinoports', 'owner'],
[4837]	385	... mode='update', user='Bob')
[4902]	386	>>> result
[4895]	387	(4, 0, '...', None)
[4837]	388
[4879]	389	Now we want to tell, that Wilma got an extra port for her second dino:
[4837]	390
	391	>>> open('newcomers.csv', 'wb').write(
	392	... """name,dinoports,owner
	393	... Wilmas Asylum,2,Wilma
	394	... """)
	395
	396	>>> wilma = stoneville['Wilmas Asylum']
	397	>>> wilma.dinoports
	398	1
	399
[4902]	400	Clean up:
	401
	402	>>> shutil.rmtree(os.path.dirname(result[2]))
	403
	404
[4837]	405	We start the processor:
	406
[4902]	407	>>> result = processor.doImport('newcomers.csv', ['name',
	408	... 'dinoports', 'owner'], mode='update', user='Bob')
	409	>>> result
[4895]	410	(1, 0, '...', None)
[4837]	411
	412	>>> wilma = stoneville['Wilmas Asylum']
	413	>>> wilma.dinoports
	414	2
	415
	416	Wilma's number of dinoports raised.
	417
[4902]	418	Clean up:
	419
	420	>>> shutil.rmtree(os.path.dirname(result[2]))
	421
	422
[4837]	423	If we try to update an unexisting entry, an error occurs:
	424
	425	>>> open('newcomers.csv', 'wb').write(
	426	... """name,dinoports,owner
	427	... NOT-WILMAS-ASYLUM,2,Wilma
	428	... """)
	429
[4902]	430	>>> result = processor.doImport('newcomers.csv', ['name',
	431	... 'dinoports', 'owner'],
[4837]	432	... mode='update', user='Bob')
[4902]	433	>>> result
[4895]	434	(1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]	435
	436	Clean up:
	437
	438	>>> shutil.rmtree(os.path.dirname(result[2]))
	439
[4837]	440
	441	Also invalid values will be spotted:
	442
	443	>>> open('newcomers.csv', 'wb').write(
	444	... """name,dinoports,owner
	445	... Wilmas Asylum,NOT-A-NUMBER,Wilma
	446	... """)
	447
[4902]	448	>>> result = processor.doImport('newcomers.csv', ['name',
	449	... 'dinoports', 'owner'],
[4837]	450	... mode='update', user='Bob')
[4902]	451	>>> result
[4895]	452	(1, 1, '...', '...')
[4837]	453
[4902]	454	Clean up:
	455
	456	>>> shutil.rmtree(os.path.dirname(result[2]))
	457
	458
[4837]	459	We can also update only some cols, leaving some out. We skip the
	460	'dinoports' column in the next run:
	461
	462	>>> open('newcomers.csv', 'wb').write(
	463	... """name,owner
	464	... Wilmas Asylum,Barney
	465	... """)
	466
[4902]	467	>>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
	468	... mode='update', user='Bob')
	469	>>> result
[4895]	470	(1, 0, '...', None)
[4837]	471
	472	>>> wilma.owner
	473	u'Barney'
	474
[4902]	475	Clean up:
	476
	477	>>> shutil.rmtree(os.path.dirname(result[2]))
	478
	479
[4837]	480	We can however, not leave out the 'location field' ('name' in our
	481	case), as this one tells us which entry to update:
	482
	483	>>> open('newcomers.csv', 'wb').write(
	484	... """name,dinoports,owner
	485	... 2,Wilma
	486	... """)
	487
	488	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
	489	... mode='update', user='Bob')
	490	Traceback (most recent call last):
	491	...
	492	FatalCSVError: Need at least columns 'name' for import!
	493
	494	This time we get even an exception!
	495
	496	We can tell to set dinoports to ``None`` although this is not a
	497	number, as we declared the field not required in the interface:
	498
	499	>>> open('newcomers.csv', 'wb').write(
	500	... """name,dinoports,owner
	501	... "Wilmas Asylum",,"Wilma"
	502	... """)
	503
[4902]	504	>>> result = processor.doImport('newcomers.csv', ['name',
	505	... 'dinoports', 'owner'],
[4837]	506	... mode='update', user='Bob')
[4902]	507	>>> result
[4895]	508	(1, 0, '...', None)
[4837]	509
	510	>>> wilma.dinoports is None
	511	True
	512
[4902]	513	Clean up:
	514
	515	>>> shutil.rmtree(os.path.dirname(result[2]))
	516
[4837]	517	Generally, empty strings are considered as ``None``:
	518
	519	>>> open('newcomers.csv', 'wb').write(
	520	... """name,dinoports,owner
	521	... "Wilmas Asylum","","Wilma"
	522	... """)
	523
[4902]	524	>>> result = processor.doImport('newcomers.csv', ['name',
	525	... 'dinoports', 'owner'],
[4837]	526	... mode='update', user='Bob')
[4902]	527	>>> result
[4895]	528	(1, 0, '...', None)
[4837]	529
	530	>>> wilma.dinoports is None
	531	True
	532
[4902]	533	Clean up:
	534
	535	>>> shutil.rmtree(os.path.dirname(result[2]))
	536
	537
[4837]	538	Removing entries
	539	----------------
	540
	541	In 'remove' mode we can delete entries. Here validity of values in
	542	non-location fields doesn't matter because those fields are ignored.
	543
	544	>>> open('newcomers.csv', 'wb').write(
	545	... """name,dinoports,owner
	546	... "Wilmas Asylum","ILLEGAL-NUMBER",""
	547	... """)
	548
[4902]	549	>>> result = processor.doImport('newcomers.csv', ['name',
	550	... 'dinoports', 'owner'],
[4837]	551	... mode='remove', user='Bob')
[4902]	552	>>> result
[4895]	553	(1, 0, '...', None)
[4837]	554
	555	>>> sorted(stoneville.keys())
	556	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
	557
	558	Oops! Wilma is gone.
	559
[4902]	560	Clean up:
[4837]	561
[4902]	562	>>> shutil.rmtree(os.path.dirname(result[2]))
	563
	564
[4837]	565	Clean up:
	566
	567	>>> import os
	568	>>> os.unlink('newcomers.csv')
[4886]	569	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Download in other formats: