source: main/waeup.sirp/trunk/src/waeup/sirp/datacenter.txt @ 4928

Last change on this file since 4928 was 4920, checked in by uli, 15 years ago

Make unit tests run again with the new package layout.

File size: 13.7 KB
Line 
1WAeUP Data Center
2*****************
3
4The WAeUP data center cares for managing CSV files and importing then.
5
6:Test-Layer: unit
7
8Creating a data center
9======================
10
11A data center can be created easily:
12
13    >>> from waeup.sirp.datacenter import DataCenter
14    >>> mydatacenter = DataCenter()
15    >>> mydatacenter
16    <waeup.sirp.datacenter.DataCenter object at 0x...>
17
18Each data center has a location in file system where files are stored:
19
20    >>> storagepath = mydatacenter.storage
21    >>> storagepath
22    '/.../waeup/sirp/files'
23
24
25Managing the storage path
26-------------------------
27
28We can set another storage path:
29
30    >>> import os
31    >>> os.mkdir('newlocation')
32    >>> newpath = os.path.abspath('newlocation')
33    >>> mydatacenter.setStoragePath(newpath)
34    []
35
36The result here is a list of filenames, that could not be
37copied. Luckily, this list is empty.
38
39When we set a new storage path, we can tell to move all files in the
40old location to the new one. To see this feature in action, we first
41have to put a file into the old location:
42
43    >>> open(os.path.join(newpath, 'myfile.txt'), 'wb').write('hello')
44
45Now we can set a new location and the file will be copied:
46
47    >>> verynewpath = os.path.abspath('verynewlocation')
48    >>> os.mkdir(verynewpath)
49
50    >>> mydatacenter.setStoragePath(verynewpath, move=True)
51    []
52
53    >>> storagepath = mydatacenter.storage
54    >>> 'myfile.txt' in os.listdir(verynewpath)
55    True
56
57We remove the created file to have a clean testing environment for
58upcoming examples:
59
60    >>> os.unlink(os.path.join(storagepath, 'myfile.txt'))
61
62Uploading files
63===============
64
65We can get a list of files stored in that location:
66
67    >>> mydatacenter.getFiles()
68    []
69
70Let's put some file in the storage:
71
72    >>> import os
73    >>> filepath = os.path.join(storagepath, 'data.csv')
74    >>> open(filepath, 'wb').write('Some Content\n')
75
76Now we can find a file:
77
78    >>> mydatacenter.getFiles()
79    [<waeup.sirp.datacenter.DataCenterFile object at 0x...>]
80
81As we can see, the actual file is wrapped by a convenience wrapper,
82that enables us to fetch some data about the file. The data returned
83is formatted in strings, so that it can easily be put into output
84pages:
85
86    >>> datafile = mydatacenter.getFiles()[0]
87    >>> datafile.getSize()
88    '13 bytes'
89
90    >>> datafile.getDate() # Nearly current datetime...
91    '...'
92
93Clean up:
94
95    >>> import shutil
96    >>> shutil.rmtree(newpath)
97    >>> shutil.rmtree(verynewpath)
98
99
100Distributing processed files
101============================
102
103When files were processed by a batch processor, we can put the
104resulting files into desired destinations.
105
106We recreate the datacenter root in case it is missing:
107
108    >>> import os
109    >>> dc_root = mydatacenter.storage
110    >>> fin_dir = os.path.join(dc_root, 'finished')
111    >>> unfin_dir = os.path.join(dc_root, 'unfinished')
112
113    >>> def recreate_dc_storage():
114    ...   if os.path.exists(dc_root):
115    ...     shutil.rmtree(dc_root)
116    ...   os.mkdir(dc_root)
117    ...   mydatacenter.setStoragePath(mydatacenter.storage)
118    >>> recreate_dc_storage()
119
120We define a function that creates a set of faked result files:
121
122    >>> import os
123    >>> import tempfile
124    >>> def create_fake_results(source_basename, create_pending=True):
125    ...   tmp_dir = tempfile.mkdtemp()
126    ...   src = os.path.join(dc_root, source_basename)
127    ...   pending_src = None
128    ...   if create_pending:
129    ...     pending_src = os.path.join(tmp_dir, 'mypendingsource.csv')
130    ...   finished_src = os.path.join(tmp_dir, 'myfinishedsource.csv')
131    ...   for path in (src, pending_src, finished_src):
132    ...     if path is not None:
133    ...       open(path, 'wb').write('blah')
134    ...   return tmp_dir, src, finished_src, pending_src
135
136Now we can create the set of result files, that typically come after a
137successful processing of a regular source:
138
139Now we can try to distribute those files. Let's start with a source
140file, that was processed successfully:
141
142    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
143    ...  'mysource.csv', create_pending=False)
144    >>> mydatacenter.distProcessedFiles(True, src, finished_src,
145    ...                            pending_src)
146    >>> sorted(os.listdir(dc_root))
147    ['finished', 'logs', 'unfinished']
148
149    >>> sorted(os.listdir(fin_dir))
150    ['mysource.csv', 'mysource.finished.csv']
151
152    >>> sorted(os.listdir(unfin_dir))
153    []
154
155The created dir will be removed for us by the datacenter. This way we
156can assured, that less temporary dirs are left hanging around:
157
158    >>> os.path.exists(tmp_dir)
159    False
160
161The root dir is empty, while the original file and the file containing
162all processed data were moved to'finished/'.
163
164Now we restart, but this time we fake an erranous action:
165
166    >>> recreate_dc_storage()
167    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
168    ...  'mysource.csv')
169    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
170    ...                                 pending_src)
171    >>> sorted(os.listdir(dc_root))
172    ['finished', 'logs', 'mysource.pending.csv', 'unfinished']
173
174    >>> sorted(os.listdir(fin_dir))
175    ['mysource.finished.csv']
176
177    >>> sorted(os.listdir(unfin_dir))
178    ['mysource.csv']
179
180While the original source was moved to the 'unfinished' dir, the
181pending file went to the root and the set of already processed items
182are stored in finished/.
183
184We fake processing the pending file and assume that everything went
185well this time:
186
187    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
188    ...  'mysource.pending.csv', create_pending=False)
189    >>> mydatacenter.distProcessedFiles(True, src, finished_src,
190    ...                                 pending_src)
191
192    >>> sorted(os.listdir(dc_root))
193    ['finished', 'logs', 'unfinished']
194
195    >>> sorted(os.listdir(fin_dir))
196    ['mysource.csv', 'mysource.finished.csv']
197
198    >>> sorted(os.listdir(unfin_dir))
199    []
200
201The result is the same as in the first case shown above.
202
203We restart again, but this time we fake several non-working imports in
204a row.
205
206We start with a faulty start-import:
207
208    >>> recreate_dc_storage()
209    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
210    ...  'mysource.csv')
211    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
212    ...                                 pending_src)
213
214We try to process the pending file, which fails again:
215
216    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
217    ...  'mysource.pending.csv')
218    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
219    ...                                 pending_src)
220
221We try to process the new pending file:
222
223    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
224    ...  'mysource.pending.csv')
225    >>> mydatacenter.distProcessedFiles(False, src, finished_src,
226    ...                                 pending_src)
227
228    >>> sorted(os.listdir(dc_root))
229    ['finished', 'logs', 'mysource.pending.csv', 'unfinished']
230
231    >>> sorted(os.listdir(fin_dir))
232    ['mysource.finished.csv']
233
234    >>> sorted(os.listdir(unfin_dir))
235    ['mysource.csv']
236
237Finally, we process the pending file and everything works:
238
239    >>> tmp_dir, src, finished_src, pending_src = create_fake_results(
240    ...  'mysource.pending.csv', create_pending=False)
241    >>> mydatacenter.distProcessedFiles(True, src, finished_src,
242    ...                                 pending_src)
243
244    >>> sorted(os.listdir(dc_root))
245    ['finished', 'logs', 'unfinished']
246
247    >>> sorted(os.listdir(fin_dir))
248    ['mysource.csv', 'mysource.finished.csv']
249
250    >>> sorted(os.listdir(unfin_dir))
251    []
252
253The root dir is empty (contains no input files) and only the files in
254finished-subdirectory remain.
255
256Clean up:
257
258    >>> shutil.rmtree(verynewpath)
259
260Handling imports
261================
262
263Data centers can find objects ready for CSV imports and associate
264appropriate importers with them.
265
266Getting importers
267-----------------
268
269To do so, data centers look up their parents for the nearest ancestor,
270that implements `ICSVDataReceivers` and grab all attributes, that
271provide some importer.
272
273We therefore have to setup a proper scenario first.
274
275We start by creating a simple thing that is ready for receiving CSV
276data:
277
278    >>> class MyCSVReceiver(object):
279    ...   pass
280
281Then we create a container for such a CSV receiver:
282
283    >>> import grok
284    >>> from waeup.sirp.interfaces import ICSVDataReceivers
285    >>> from waeup.sirp.datacenter import DataCenter
286    >>> class SomeContainer(grok.Container):
287    ...   grok.implements(ICSVDataReceivers)
288    ...   def __init__(self):
289    ...     self.some_receiver = MyCSVReceiver()
290    ...     self.other_receiver = MyCSVReceiver()
291    ...     self.datacenter = DataCenter()
292
293By implementing `ICSVDataReceivers`, a pure marker interface, we
294indicate, that we want instances of this class to be searched for CSV
295receivers.
296
297This root container has two CSV receivers.
298
299The datacenter is also an attribute of our root container.
300
301Before we can go into action, we also need an importer, that is able
302to import data into instances of MyCSVReceiver:
303
304    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
305    >>> from waeup.sirp.interfaces import IWAeUPCSVImporter
306    >>> from waeup.sirp.utils.importexport import CSVImporter
307    >>> class MyCSVImporter(CSVImporter):
308    ...   grok.adapts(ICSVFile, MyCSVReceiver)
309    ...   grok.provides(IWAeUPCSVImporter)
310    ...   datatype = u'My Stuff'
311    ...   def doImport(self, filepath, clear_old_data=True,
312    ...                                overwrite=True):
313    ...     print "Data imported!"
314
315We grok the components to get the importer (which is actually an
316adapter) registered with the component architechture:
317
318    >>> grok.testing.grok('waeup')
319    >>> grok.testing.grok_component('MyCSVImporter', MyCSVImporter)
320    True
321
322Now we can create an instance of `SomeContainer`:
323
324    >>> mycontainer = SomeContainer()
325
326As we are not creating real sites and the objects are 'placeless' from
327the ZODB point of view, we fake a location by telling the datacenter,
328that its parent is the container:
329
330    >>> mycontainer.datacenter.__parent__ = mycontainer
331    >>> datacenter = mycontainer.datacenter
332
333When a datacenter is stored in the ZODB, this step will happen
334automatically.
335
336Before we can go on, we have to set a usable path where we can store
337files without doing harm:
338
339    >>> os.mkdir('filestore')
340    >>> filestore = os.path.abspath('filestore')
341    >>> datacenter.setStoragePath(filestore)
342    []
343
344Furthermore we must create a file for possible import, as we will get
345only importers, for which also an importable file is available:
346
347    >>> import os
348    >>> filepath = os.path.join(datacenter.storage, 'mydata.csv')
349    >>> open(filepath, 'wb').write("""col1,col2
350    ... 'ATerm','Something'
351    ... """)
352
353The datacenter is now able to find the CSV receivers in its parents:
354
355    >>> datacenter.getImporters()
356    [<MyCSVImporter object at 0x...>, <MyCSVImporter object at 0x...>]
357
358
359Imports with the WAeUP portal
360-----------------------------
361
362The examples above looks complicated, but this is the price for
363modularity. If you create a new container type, you can define an
364importer and it will be used automatically by other components.
365
366In the WAeUP portal the only component that actually provides CSV data
367importables is the `University` object.
368
369
370Getting imports (not: importers)
371--------------------------------
372
373We can get 'imports':
374
375    >>> datacenter.getPossibleImports()
376    [(<...DataCenterFile object at 0x...>,
377      [(<MyCSVImporter object at 0x...>, '...'),
378       (<MyCSVImporter object at 0x...>, '...')])]
379
380As we can see, an import is defined here as a tuple of a
381DataCenterFile and a list of available importers with an associated
382data receiver (the thing where the data should go to).
383
384The data receiver is given as an ZODB object id (if the data receiver
385is persistent) or a simple id (if it is not).
386
387Clean up:
388
389    >>> import shutil
390    >>> shutil.rmtree(filestore)
391
392
393Data center helpers
394===================
395
396Data centers provide several helper methods to make their usage more
397convenient.
398
399
400Receivers and receiver ids
401--------------------------
402
403As already mentioned above, imports are defined as triples containing
404
405* a file to import,
406
407* an importer to do the import and
408
409* an object, which should be updated by the data file.
410
411The latter normally is some kind of container, like a faculty
412container or similar. This is what we call a ``receiver`` as it
413receives the data from the file via the importer.
414
415The datacenter finds receivers by looking up its parents for a
416component, that implements `ICSVDataReceivers` and scanning that
417component for attributes, that can be adapted to `ICSVImporter`.
418
419I.e., once found an `ICSVDataReceiver` parent, the datacenter gets all
420importers that can be applied to attributes of this component. For
421each attribute there can be at most one importer.
422
423When building the importer list for a certain file, we also check,
424that the headers of the file comply with what the respective importers
425expect. So, if a file contains broken headers, the file won't be
426offered for import at all.
427
428The contexts of the found importers then build our list of available
429receivers. This means also, that for each receiver provided by the
430datacenter, there is also an importer available.
431
432If for a potential receiver no importer can be found, this receiver
433will be skipped.
434
435As one type of importer might be able to serve several receivers, we
436also have to provide a unique id for each receiver. This is, where
437``receiver ids`` come into play.
438
439Receiver ids of objects are determined as
440
441* the ZODB oid of the object if the object is persistent
442
443* the result of id(obj) otherwise.
444
445The value won this way is a long integer which we turn into a
446string. If the value was get from the ZODB oid, we also prepend it
447with a ``z`` to avoid any clash with non-ZODB objects (they might
448deliver the same id, although this is *very* unlikely).
Note: See TracBrowser for help on using the repository browser.