WAeUP Data Center ***************** The WAeUP data center cares for managing CSV files and importing then. :Test-Layer: unit Creating a data center ====================== A data center can be created easily: >>> from waeup.sirp.datacenter import DataCenter >>> mydatacenter = DataCenter() >>> mydatacenter Each data center has a location in file system where files are stored: >>> storagepath = mydatacenter.storage >>> storagepath '/.../waeup/sirp/files' Managing the storage path ------------------------- We can set another storage path: >>> import os >>> os.mkdir('newlocation') >>> newpath = os.path.abspath('newlocation') >>> mydatacenter.setStoragePath(newpath) [] The result here is a list of filenames, that could not be copied. Luckily, this list is empty. When we set a new storage path, we can tell to move all files in the old location to the new one. To see this feature in action, we first have to put a file into the old location: >>> open(os.path.join(newpath, 'myfile.txt'), 'wb').write('hello') Now we can set a new location and the file will be copied: >>> verynewpath = os.path.abspath('verynewlocation') >>> os.mkdir(verynewpath) >>> mydatacenter.setStoragePath(verynewpath, move=True) [] >>> storagepath = mydatacenter.storage >>> 'myfile.txt' in os.listdir(verynewpath) True We remove the created file to have a clean testing environment for upcoming examples: >>> os.unlink(os.path.join(storagepath, 'myfile.txt')) Uploading files =============== We can get a list of files stored in that location: >>> mydatacenter.getFiles() [] Let's put some file in the storage: >>> import os >>> filepath = os.path.join(storagepath, 'data.csv') >>> open(filepath, 'wb').write('Some Content\n') Now we can find a file: >>> mydatacenter.getFiles() [] As we can see, the actual file is wrapped by a convenience wrapper, that enables us to fetch some data about the file. The data returned is formatted in strings, so that it can easily be put into output pages: >>> datafile = mydatacenter.getFiles()[0] >>> datafile.getSize() '13 bytes' >>> datafile.getDate() # Nearly current datetime... '...' Clean up: >>> import shutil >>> shutil.rmtree(newpath) >>> shutil.rmtree(verynewpath) Distributing processed files ============================ When files were processed by a batch processor, we can put the resulting files into desired destinations. We recreate the datacenter root in case it is missing: >>> import os >>> dc_root = mydatacenter.storage >>> fin_dir = os.path.join(dc_root, 'finished') >>> unfin_dir = os.path.join(dc_root, 'unfinished') >>> def recreate_dc_storage(): ... if os.path.exists(dc_root): ... shutil.rmtree(dc_root) ... os.mkdir(dc_root) ... mydatacenter.setStoragePath(mydatacenter.storage) >>> recreate_dc_storage() We define a function that creates a set of faked result files: >>> import os >>> import tempfile >>> def create_fake_results(source_basename, create_pending=True): ... tmp_dir = tempfile.mkdtemp() ... src = os.path.join(dc_root, source_basename) ... pending_src = None ... if create_pending: ... pending_src = os.path.join(tmp_dir, 'mypendingsource.csv') ... finished_src = os.path.join(tmp_dir, 'myfinishedsource.csv') ... for path in (src, pending_src, finished_src): ... if path is not None: ... open(path, 'wb').write('blah') ... return tmp_dir, src, finished_src, pending_src Now we can create the set of result files, that typically come after a successful processing of a regular source: Now we can try to distribute those files. Let's start with a source file, that was processed successfully: >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.csv', create_pending=False) >>> mydatacenter.distProcessedFiles(True, src, finished_src, ... pending_src) >>> sorted(os.listdir(dc_root)) ['finished', 'logs', 'unfinished'] >>> sorted(os.listdir(fin_dir)) ['mysource.csv', 'mysource.finished.csv'] >>> sorted(os.listdir(unfin_dir)) [] The created dir will be removed for us by the datacenter. This way we can assured, that less temporary dirs are left hanging around: >>> os.path.exists(tmp_dir) False The root dir is empty, while the original file and the file containing all processed data were moved to'finished/'. Now we restart, but this time we fake an erranous action: >>> recreate_dc_storage() >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.csv') >>> mydatacenter.distProcessedFiles(False, src, finished_src, ... pending_src) >>> sorted(os.listdir(dc_root)) ['finished', 'logs', 'mysource.pending.csv', 'unfinished'] >>> sorted(os.listdir(fin_dir)) ['mysource.finished.csv'] >>> sorted(os.listdir(unfin_dir)) ['mysource.csv'] While the original source was moved to the 'unfinished' dir, the pending file went to the root and the set of already processed items are stored in finished/. We fake processing the pending file and assume that everything went well this time: >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.pending.csv', create_pending=False) >>> mydatacenter.distProcessedFiles(True, src, finished_src, ... pending_src) >>> sorted(os.listdir(dc_root)) ['finished', 'logs', 'unfinished'] >>> sorted(os.listdir(fin_dir)) ['mysource.csv', 'mysource.finished.csv'] >>> sorted(os.listdir(unfin_dir)) [] The result is the same as in the first case shown above. We restart again, but this time we fake several non-working imports in a row. We start with a faulty start-import: >>> recreate_dc_storage() >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.csv') >>> mydatacenter.distProcessedFiles(False, src, finished_src, ... pending_src) We try to process the pending file, which fails again: >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.pending.csv') >>> mydatacenter.distProcessedFiles(False, src, finished_src, ... pending_src) We try to process the new pending file: >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.pending.csv') >>> mydatacenter.distProcessedFiles(False, src, finished_src, ... pending_src) >>> sorted(os.listdir(dc_root)) ['finished', 'logs', 'mysource.pending.csv', 'unfinished'] >>> sorted(os.listdir(fin_dir)) ['mysource.finished.csv'] >>> sorted(os.listdir(unfin_dir)) ['mysource.csv'] Finally, we process the pending file and everything works: >>> tmp_dir, src, finished_src, pending_src = create_fake_results( ... 'mysource.pending.csv', create_pending=False) >>> mydatacenter.distProcessedFiles(True, src, finished_src, ... pending_src) >>> sorted(os.listdir(dc_root)) ['finished', 'logs', 'unfinished'] >>> sorted(os.listdir(fin_dir)) ['mysource.csv', 'mysource.finished.csv'] >>> sorted(os.listdir(unfin_dir)) [] The root dir is empty (contains no input files) and only the files in finished-subdirectory remain. Clean up: >>> shutil.rmtree(verynewpath) Handling imports ================ Data centers can find objects ready for CSV imports and associate appropriate importers with them. Getting importers ----------------- To do so, data centers look up their parents for the nearest ancestor, that implements `ICSVDataReceivers` and grab all attributes, that provide some importer. We therefore have to setup a proper scenario first. We start by creating a simple thing that is ready for receiving CSV data: >>> class MyCSVReceiver(object): ... pass Then we create a container for such a CSV receiver: >>> import grok >>> from waeup.sirp.interfaces import ICSVDataReceivers >>> from waeup.sirp.datacenter import DataCenter >>> class SomeContainer(grok.Container): ... grok.implements(ICSVDataReceivers) ... def __init__(self): ... self.some_receiver = MyCSVReceiver() ... self.other_receiver = MyCSVReceiver() ... self.datacenter = DataCenter() By implementing `ICSVDataReceivers`, a pure marker interface, we indicate, that we want instances of this class to be searched for CSV receivers. This root container has two CSV receivers. The datacenter is also an attribute of our root container. Before we can go into action, we also need an importer, that is able to import data into instances of MyCSVReceiver: >>> from waeup.sirp.csvfile.interfaces import ICSVFile >>> from waeup.sirp.interfaces import IWAeUPCSVImporter >>> from waeup.sirp.utils.importexport import CSVImporter >>> class MyCSVImporter(CSVImporter): ... grok.adapts(ICSVFile, MyCSVReceiver) ... grok.provides(IWAeUPCSVImporter) ... datatype = u'My Stuff' ... def doImport(self, filepath, clear_old_data=True, ... overwrite=True): ... print "Data imported!" We grok the components to get the importer (which is actually an adapter) registered with the component architechture: >>> grok.testing.grok('waeup') >>> grok.testing.grok_component('MyCSVImporter', MyCSVImporter) True Now we can create an instance of `SomeContainer`: >>> mycontainer = SomeContainer() As we are not creating real sites and the objects are 'placeless' from the ZODB point of view, we fake a location by telling the datacenter, that its parent is the container: >>> mycontainer.datacenter.__parent__ = mycontainer >>> datacenter = mycontainer.datacenter When a datacenter is stored in the ZODB, this step will happen automatically. Before we can go on, we have to set a usable path where we can store files without doing harm: >>> os.mkdir('filestore') >>> filestore = os.path.abspath('filestore') >>> datacenter.setStoragePath(filestore) [] Furthermore we must create a file for possible import, as we will get only importers, for which also an importable file is available: >>> import os >>> filepath = os.path.join(datacenter.storage, 'mydata.csv') >>> open(filepath, 'wb').write("""col1,col2 ... 'ATerm','Something' ... """) The datacenter is now able to find the CSV receivers in its parents: >>> datacenter.getImporters() [, ] Imports with the WAeUP portal ----------------------------- The examples above looks complicated, but this is the price for modularity. If you create a new container type, you can define an importer and it will be used automatically by other components. In the WAeUP portal the only component that actually provides CSV data importables is the `University` object. Getting imports (not: importers) -------------------------------- We can get 'imports': >>> datacenter.getPossibleImports() [(<...DataCenterFile object at 0x...>, [(, '...'), (, '...')])] As we can see, an import is defined here as a tuple of a DataCenterFile and a list of available importers with an associated data receiver (the thing where the data should go to). The data receiver is given as an ZODB object id (if the data receiver is persistent) or a simple id (if it is not). Clean up: >>> import shutil >>> shutil.rmtree(filestore) Data center helpers =================== Data centers provide several helper methods to make their usage more convenient. Receivers and receiver ids -------------------------- As already mentioned above, imports are defined as triples containing * a file to import, * an importer to do the import and * an object, which should be updated by the data file. The latter normally is some kind of container, like a faculty container or similar. This is what we call a ``receiver`` as it receives the data from the file via the importer. The datacenter finds receivers by looking up its parents for a component, that implements `ICSVDataReceivers` and scanning that component for attributes, that can be adapted to `ICSVImporter`. I.e., once found an `ICSVDataReceiver` parent, the datacenter gets all importers that can be applied to attributes of this component. For each attribute there can be at most one importer. When building the importer list for a certain file, we also check, that the headers of the file comply with what the respective importers expect. So, if a file contains broken headers, the file won't be offered for import at all. The contexts of the found importers then build our list of available receivers. This means also, that for each receiver provided by the datacenter, there is also an importer available. If for a potential receiver no importer can be found, this receiver will be skipped. As one type of importer might be able to serve several receivers, we also have to provide a unique id for each receiver. This is, where ``receiver ids`` come into play. Receiver ids of objects are determined as * the ZODB oid of the object if the object is persistent * the result of id(obj) otherwise. The value won this way is a long integer which we turn into a string. If the value was get from the ZODB oid, we also prepend it with a ``z`` to avoid any clash with non-ZODB objects (they might deliver the same id, although this is *very* unlikely).