Context navigation

← Previous revision
Latest revision
Next revision →
Normal
Revision log

batchprocessing.txt @ 14086

Last change on this file since 14086 was 12946, checked in by Henrik Bettermann, 10 years ago
Rename doctests again.
File size: 16.2 KB

Rev	Line
[12920]	1	Batch Processing
	2	****************
[4837]	3
	4	Batch processing is much more than pure data import.
	5
	6	Overview
	7	========
	8
	9	Basically, it means processing CSV files in order to mass-create,
	10	mass-remove, or mass-update data.
	11
[7933]	12	So you can feed CSV files to processors, that are part of
[4847]	13	the batch-processing mechanism.
[4837]	14
[7933]	15	Processors
	16	----------
[4837]	17
[4847]	18	Each CSV file processor
[4837]	19
	20	* accepts a single data type identified by an interface.
	21
	22	* knows about the places inside a site (University) where to store,
	23	remove or update the data.
	24
	25	* can check headers before processing data.
	26
	27	* supports the mode 'create', 'update', 'remove'.
	28
[4903]	29	* creates log entries (optional)
[4837]	30
[4903]	31	* creates csv files containing successful and not-successful processed
	32	data respectively.
	33
[4837]	34	Output
	35	------
	36
[4903]	37	The results of processing are written to loggers, if a logger was
	38	given. Beside this new CSV files are created during processing:
[4837]	39
[4903]	40	* a pending CSV file, containing datasets that could not be processed
[4837]	41
[4903]	42	* a finished CSV file, containing datasets successfully processed.
	43
	44	The pending file is not created if everything works fine. The
	45	respective path returned in that case is ``None``.
	46
	47	The pending file (if created) is a CSV file that contains the failed
	48	rows appended by a column ``--ERRROR--`` in which the reasons for
	49	processing failures are listed.
	50
	51	The complete paths of these files are returned. They will be in a
	52	temporary directory created only for this purpose. It is the caller's
	53	responsibility to remove the temporay directories afterwards (the
	54	datacenters distProcessedFiles() method takes care for that).
	55
[4837]	56	It looks like this::
	57
	58	-----+ +---------+
	59	/ \| \| \| +------+
	60	\| .csv +----->\|Batch- \| \| \|
	61	\| \| \|processor+----changes-->\| ZODB \|
	62	\| +------+ \| \| \| \|
	63	+--\| \| \| + +------+
	64	\| Mode +-->\| \| -------+
	65	\| \| \| +----outputs-+-> / \|
[4903]	66	\| +----+->+---------+ \| \|.pending\|
	67	+--\|Log \| ^ \| \| \|
	68	+----+ \| \| +--------+
[4837]	69	+-----++ v
[4903]	70	\|Inter-\| ----------+
	71	\|face \| / \|
	72	+------+ \| .finished \|
	73	\| \|
	74	+-----------+
[4837]	75
	76
[12920]	77	Creating a Batch Processor
[4837]	78	==========================
	79
	80	We create an own batch processor for an own datatype. This datatype
	81	must be based on an interface that the batcher can use for converting
	82	data.
	83
	84	Founding Stoneville
	85	-------------------
	86
	87	We start with the interface:
	88
	89	>>> from zope.interface import Interface
	90	>>> from zope import schema
	91	>>> class ICave(Interface):
	92	... """A cave."""
	93	... name = schema.TextLine(
	94	... title = u'Cave name',
	95	... default = u'Unnamed',
	96	... required = True)
	97	... dinoports = schema.Int(
	98	... title = u'Number of DinoPorts (tm)',
	99	... required = False,
	100	... default = 1)
	101	... owner = schema.TextLine(
	102	... title = u'Owner name',
	103	... required = True,
	104	... missing_value = 'Fred Estates Inc.')
[4871]	105	... taxpayer = schema.Bool(
	106	... title = u'Payes taxes',
	107	... required = True,
	108	... default = False)
[4837]	109
	110	Now a class that implements this interface:
	111
	112	>>> import grok
	113	>>> class Cave(object):
	114	... grok.implements(ICave)
	115	... def __init__(self, name=u'Unnamed', dinoports=2,
[4871]	116	... owner='Fred Estates Inc.', taxpayer=False):
[4837]	117	... self.name = name
	118	... self.dinoports = 2
	119	... self.owner = owner
[4871]	120	... self.taxpayer = taxpayer
[4837]	121
	122	We also provide a factory for caves. Strictly speaking, this not
	123	necessary but makes the batch processor we create afterwards, better
	124	understandable.
	125
	126	>>> from zope.component import getGlobalSiteManager
	127	>>> from zope.component.factory import Factory
	128	>>> from zope.component.interfaces import IFactory
	129	>>> gsm = getGlobalSiteManager()
	130	>>> cave_maker = Factory(Cave, 'A cave', 'Buy caves here!')
	131	>>> gsm.registerUtility(cave_maker, IFactory, 'Lovely Cave')
	132
	133	Now we can create caves using a factory:
	134
	135	>>> from zope.component import createObject
	136	>>> createObject('Lovely Cave')
	137	<Cave object at 0x...>
	138
	139	This is nice, but we still lack a place, where we can place all the
	140	lovely caves we want to sell.
	141
	142	Furthermore, as a replacement for a real site, we define a place where
	143	all caves can be stored: Stoneville! This is a lovely place for
	144	upperclass cavemen (which are the only ones that can afford more than
	145	one dinoport).
	146
	147	We found Stoneville:
	148
	149	>>> stoneville = dict()
	150
	151	Everything in place.
	152
	153	Now, to improve local health conditions, imagine we want to populate
	154	Stoneville with lots of new happy dino-hunting natives that slept on
	155	the bare ground in former times and had no idea of
	156	bathrooms. Disgusting, isn't it?
	157
	158	Lots of cavemen need lots of caves.
	159
	160	Of course we can do something like:
	161
	162	>>> cave1 = createObject('Lovely Cave')
	163	>>> cave1.name = "Fred's home"
	164	>>> cave1.owner = "Fred"
	165	>>> stoneville[cave1.name] = cave1
	166
	167	and Stoneville has exactly
	168
	169	>>> len(stoneville)
	170	1
	171
	172	inhabitant. But we don't want to do this for hundreds or thousands of
	173	citizens-to-be, do we?
	174
	175	It is much easier to create a simple CSV list, where we put in all the
	176	data and let a batch processor do the job.
	177
	178	The list is already here:
	179
	180	>>> open('newcomers.csv', 'wb').write(
[4871]	181	... """name,dinoports,owner,taxpayer
	182	... Barneys Home,2,Barney,1
	183	... Wilmas Asylum,1,Wilma,1
	184	... Freds Dinoburgers,10,Fred,0
	185	... Joeys Drive-in,110,Joey,0
[4837]	186	... """)
	187
	188	All we need, is a batch processor now.
	189
[7811]	190	>>> from waeup.kofa.utils.batching import BatchProcessor
[8224]	191	>>> from waeup.kofa.interfaces import IGNORE_MARKER
[4837]	192	>>> class CaveProcessor(BatchProcessor):
	193	... util_name = 'caveprocessor'
	194	... grok.name(util_name)
	195	... name = 'Cave Processor'
	196	... iface = ICave
	197	... location_fields = ['name']
	198	... factory_name = 'Lovely Cave'
	199	...
	200	... def parentsExist(self, row, site):
	201	... return True
	202	...
	203	... def getParent(self, row, site):
	204	... return stoneville
	205	...
	206	... def entryExists(self, row, site):
	207	... return row['name'] in stoneville.keys()
	208	...
	209	... def getEntry(self, row, site):
	210	... if not self.entryExists(row, site):
	211	... return None
	212	... return stoneville[row['name']]
	213	...
	214	... def delEntry(self, row, site):
	215	... del stoneville[row['name']]
	216	...
	217	... def addEntry(self, obj, row, site):
	218	... stoneville[row['name']] = obj
	219	...
[9706]	220	... def updateEntry(self, obj, row, site, filename):
[4985]	221	... # This is not strictly necessary, as the default
	222	... # updateEntry method does exactly the same
[4837]	223	... for key, value in row.items():
[8224]	224	... if value != IGNORE_MARKER:
	225	... setattr(obj, key, value)
[4837]	226
[4886]	227	If we also want the results being logged, we must provide a logger
	228	(this is optional):
	229
	230	>>> import logging
	231	>>> logger = logging.getLogger('stoneville')
	232	>>> logger.setLevel(logging.DEBUG)
	233	>>> logger.propagate = False
	234	>>> handler = logging.FileHandler('stoneville.log', 'w')
	235	>>> logger.addHandler(handler)
	236
[4837]	237	Create the fellows:
	238
	239	>>> processor = CaveProcessor()
[6273]	240	>>> result = processor.doImport('newcomers.csv',
[4871]	241	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	242	... mode='create', user='Bob', logger=logger)
[4902]	243	>>> result
[4895]	244	(4, 0, '/.../newcomers.finished.csv', None)
[4837]	245
	246	The result means: four entries were processed and no warnings
[4895]	247	occured. Furthermore we get filepath to a CSV file with successfully
	248	processed entries and a filepath to a CSV file with erraneous entries.
	249	As everything went well, the latter is ``None``. Let's check:
[4837]	250
	251	>>> sorted(stoneville.keys())
	252	[u'Barneys Home', ..., u'Wilmas Asylum']
	253
	254	The values of the Cave instances have correct type:
	255
	256	>>> barney = stoneville['Barneys Home']
	257	>>> barney.dinoports
	258	2
	259
	260	which is a number, not a string.
	261
	262	Apparently, when calling the processor, we gave some more info than
	263	only the CSV filepath. What does it all mean?
	264
	265	While the first argument is the path to the CSV file, we also have to
	266	give an ordered list of headernames. These replace the header field
	267	names that are actually in the file. This way we can override faulty
	268	headers.
	269
	270	The ``mode`` paramter tells what kind of operation we want to perform:
	271	``create``, ``update``, or ``remove`` data.
	272
	273	The ``user`` parameter finally is optional and only used for logging.
	274
[4886]	275	We can, by the way, see the results of our run in a logfile if we
	276	provided a logger during the call:
[4837]	277
[4886]	278	>>> print open('stoneville.log').read()
[9739]	279	processed: newcomers.csv, create mode, 4 lines (4 successful/ 0 failed), ... s (... s/item)
[4837]	280
[9739]	281
[4902]	282	We cleanup the temporay dir created by doImport():
	283
	284	>>> import shutil
	285	>>> import os
	286	>>> shutil.rmtree(os.path.dirname(result[2]))
	287
[4837]	288	As we can see, the processing was successful. Otherwise, all problems
	289	could be read here as we can see, if we do the same operation again:
	290
[4902]	291	>>> result = processor.doImport('newcomers.csv',
[4871]	292	... ['name', 'dinoports', 'owner', 'taxpayer'],
[4886]	293	... mode='create', user='Bob', logger=logger)
[4902]	294	>>> result
[4895]	295	(4, 4, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4837]	296
[4895]	297	This time we also get a path to a .pending file.
	298
[4837]	299	The log file will tell us this in more detail:
	300
[4886]	301	>>> print open('stoneville.log').read()
[9739]	302	processed: newcomers.csv, create mode, 4 lines (4 successful/ 0 failed), ... s (... s/item)
	303	processed: newcomers.csv, create mode, 4 lines (0 successful/ 4 failed), ... s (... s/item)
[4837]	304
[9739]	305
[4837]	306	This time a new file was created, which keeps all the rows we could not
[4877]	307	process and an additional column with error messages:
[4837]	308
[4902]	309	>>> print open(result[3]).read()
[4877]	310	owner,name,taxpayer,dinoports,--ERRORS--
[12868]	311	Barney,Barneys Home,1,2,This object already exists.
	312	Wilma,Wilmas Asylum,1,1,This object already exists.
	313	Fred,Freds Dinoburgers,0,10,This object already exists.
	314	Joey,Joeys Drive-in,0,110,This object already exists.
[4837]	315
	316	This way we can correct the faulty entries and afterwards retry without
	317	having the already processed rows in the way.
	318
[4871]	319	We also notice, that the values of the taxpayer column are returned as
	320	in the input file. There we wrote '1' for ``True`` and '0' for
	321	``False`` (which is accepted by the converters).
[4837]	322
[4902]	323	Clean up:
[4871]	324
[4902]	325	>>> shutil.rmtree(os.path.dirname(result[2]))
	326
[4912]	327
	328	We can also tell to ignore some cols from input by passing
	329	``--IGNORE--`` as col name:
	330
	331	>>> result = processor.doImport('newcomers.csv', ['name',
	332	... '--IGNORE--', '--IGNORE--'],
	333	... mode='update', user='Bob')
	334	>>> result
	335	(4, 0, '...', None)
	336
	337	Clean up:
	338
	339	>>> shutil.rmtree(os.path.dirname(result[2]))
	340
	341	If something goes wrong during processing, the respective --IGNORE--
[6824]	342	cols won't be populated in the resulting pending file:
[4912]	343
	344	>>> result = processor.doImport('newcomers.csv', ['name', 'dinoports',
	345	... '--IGNORE--', '--IGNORE--'],
	346	... mode='create', user='Bob')
	347	>>> result
	348	(4, 4, '...', '...')
	349
	350	>>> print open(result[3], 'rb').read()
[6824]	351	name,dinoports,--ERRORS--
[12868]	352	Barneys Home,2,This object already exists.
	353	Wilmas Asylum,1,This object already exists.
	354	Freds Dinoburgers,10,This object already exists.
	355	Joeys Drive-in,110,This object already exists.
[4912]	356
	357
	358	Clean up:
	359
	360	>>> shutil.rmtree(os.path.dirname(result[2]))
	361
	362
[12920]	363	Updating Entries
[4837]	364	----------------
	365
	366	To update entries, we just call the batchprocessor in a different
	367	mode:
	368
[4902]	369	>>> result = processor.doImport('newcomers.csv', ['name',
	370	... 'dinoports', 'owner'],
[4837]	371	... mode='update', user='Bob')
[4902]	372	>>> result
[4895]	373	(4, 0, '...', None)
[4837]	374
[4879]	375	Now we want to tell, that Wilma got an extra port for her second dino:
[4837]	376
	377	>>> open('newcomers.csv', 'wb').write(
	378	... """name,dinoports,owner
	379	... Wilmas Asylum,2,Wilma
	380	... """)
	381
	382	>>> wilma = stoneville['Wilmas Asylum']
	383	>>> wilma.dinoports
	384	1
	385
[4902]	386	Clean up:
	387
	388	>>> shutil.rmtree(os.path.dirname(result[2]))
	389
	390
[4837]	391	We start the processor:
	392
[4902]	393	>>> result = processor.doImport('newcomers.csv', ['name',
	394	... 'dinoports', 'owner'], mode='update', user='Bob')
	395	>>> result
[4895]	396	(1, 0, '...', None)
[4837]	397
	398	>>> wilma = stoneville['Wilmas Asylum']
	399	>>> wilma.dinoports
	400	2
	401
	402	Wilma's number of dinoports raised.
	403
[4902]	404	Clean up:
	405
	406	>>> shutil.rmtree(os.path.dirname(result[2]))
	407
	408
[4837]	409	If we try to update an unexisting entry, an error occurs:
	410
	411	>>> open('newcomers.csv', 'wb').write(
	412	... """name,dinoports,owner
	413	... NOT-WILMAS-ASYLUM,2,Wilma
	414	... """)
	415
[4902]	416	>>> result = processor.doImport('newcomers.csv', ['name',
	417	... 'dinoports', 'owner'],
[4837]	418	... mode='update', user='Bob')
[4902]	419	>>> result
[4895]	420	(1, 1, '/.../newcomers.finished.csv', '/.../newcomers.pending.csv')
[4902]	421
	422	Clean up:
	423
	424	>>> shutil.rmtree(os.path.dirname(result[2]))
	425
[4837]	426
	427	Also invalid values will be spotted:
	428
	429	>>> open('newcomers.csv', 'wb').write(
	430	... """name,dinoports,owner
	431	... Wilmas Asylum,NOT-A-NUMBER,Wilma
	432	... """)
	433
[4902]	434	>>> result = processor.doImport('newcomers.csv', ['name',
	435	... 'dinoports', 'owner'],
[4837]	436	... mode='update', user='Bob')
[4902]	437	>>> result
[4895]	438	(1, 1, '...', '...')
[4837]	439
[4902]	440	Clean up:
	441
	442	>>> shutil.rmtree(os.path.dirname(result[2]))
	443
	444
[4837]	445	We can also update only some cols, leaving some out. We skip the
	446	'dinoports' column in the next run:
	447
	448	>>> open('newcomers.csv', 'wb').write(
	449	... """name,owner
	450	... Wilmas Asylum,Barney
	451	... """)
	452
[4902]	453	>>> result = processor.doImport('newcomers.csv', ['name', 'owner'],
	454	... mode='update', user='Bob')
	455	>>> result
[4895]	456	(1, 0, '...', None)
[4837]	457
	458	>>> wilma.owner
	459	u'Barney'
	460
[4902]	461	Clean up:
	462
	463	>>> shutil.rmtree(os.path.dirname(result[2]))
	464
	465
[4837]	466	We can however, not leave out the 'location field' ('name' in our
	467	case), as this one tells us which entry to update:
	468
	469	>>> open('newcomers.csv', 'wb').write(
	470	... """name,dinoports,owner
	471	... 2,Wilma
	472	... """)
	473
	474	>>> processor.doImport('newcomers.csv', ['dinoports', 'owner'],
	475	... mode='update', user='Bob')
	476	Traceback (most recent call last):
	477	...
	478	FatalCSVError: Need at least columns 'name' for import!
	479
	480	This time we get even an exception!
	481
[8227]	482	Generally, empty strings are considered as ``None``:
[4837]	483
	484	>>> open('newcomers.csv', 'wb').write(
	485	... """name,dinoports,owner
[8227]	486	... "Wilmas Asylum","","Wilma"
[4837]	487	... """)
	488
[4902]	489	>>> result = processor.doImport('newcomers.csv', ['name',
	490	... 'dinoports', 'owner'],
[8227]	491	... mode='update', user='Bob')
[4902]	492	>>> result
[4895]	493	(1, 0, '...', None)
[4837]	494
[8227]	495	>>> wilma.dinoports
	496	2
[4837]	497
[4902]	498	Clean up:
	499
	500	>>> shutil.rmtree(os.path.dirname(result[2]))
	501
[8227]	502	We can tell to set dinoports to ``None`` although this is not a
	503	number, as we declared the field not required in the interface:
[4837]	504
	505	>>> open('newcomers.csv', 'wb').write(
	506	... """name,dinoports,owner
[8227]	507	... "Wilmas Asylum","XXX","Wilma"
[4837]	508	... """)
	509
[4902]	510	>>> result = processor.doImport('newcomers.csv', ['name',
	511	... 'dinoports', 'owner'],
[8227]	512	... mode='update', user='Bob', ignore_empty=False)
[4902]	513	>>> result
[4895]	514	(1, 0, '...', None)
[4837]	515
	516	>>> wilma.dinoports is None
	517	True
	518
[4902]	519	Clean up:
	520
	521	>>> shutil.rmtree(os.path.dirname(result[2]))
	522
[12920]	523	Removing Entries
[4837]	524	----------------
	525
	526	In 'remove' mode we can delete entries. Here validity of values in
	527	non-location fields doesn't matter because those fields are ignored.
	528
	529	>>> open('newcomers.csv', 'wb').write(
	530	... """name,dinoports,owner
	531	... "Wilmas Asylum","ILLEGAL-NUMBER",""
	532	... """)
	533
[4902]	534	>>> result = processor.doImport('newcomers.csv', ['name',
	535	... 'dinoports', 'owner'],
[4837]	536	... mode='remove', user='Bob')
[4902]	537	>>> result
[4895]	538	(1, 0, '...', None)
[4837]	539
	540	>>> sorted(stoneville.keys())
	541	[u'Barneys Home', "Fred's home", u'Freds Dinoburgers', u'Joeys Drive-in']
	542
	543	Oops! Wilma is gone.
	544
[4902]	545	Clean up:
[4837]	546
[4902]	547	>>> shutil.rmtree(os.path.dirname(result[2]))
	548
	549
[4837]	550	Clean up:
	551
	552	>>> import os
	553	>>> os.unlink('newcomers.csv')
[4886]	554	>>> os.unlink('stoneville.log')

Note: See TracBrowser for help on using the repository browser.

Context navigation

source: main/waeup.kofa/trunk/src/waeup/kofa/doctests/batchprocessing.txt @ 14086

Download in other formats: