source: main/waeup.sirp/trunk/src/waeup/sirp/csvfile/README.txt @ 4931

Last change on this file since 4931 was 4920, checked in by uli, 15 years ago

Make unit tests run again with the new package layout.

File size: 17.4 KB
Line 
1:mod:`waeup.sirp.csvfile` -- generic support for handling CSV files
2*******************************************************************
3
4:Test-Layer: unit
5
6.. module:: waeup.sirp.csvfile
7   :synopsis: generic support for handling CSV files.
8
9
10.. note::
11
12   This version of the :mod:`waeup.sirp.csvfile` module doesn't support
13   Unicode input.  Also, there are currently some issues regarding
14   ASCII NUL characters.  Accordingly, all input should be UTF-8 or
15   printable ASCII to be safe. These restrictions will be removed in
16   the future.
17
18
19Module Contents
20================
21
22:class:`CSVFile`
23----------------
24
25.. class:: CSVFile(filepath)
26
27   Wrapper around the path to a real CSV file. :class:`CSVFile` is an
28   adapter that adapts basestring objects (aka regular and unicode
29   strings).
30
31   :class:`CSVFile` is designed as a base for derived, more
32   specialized types of CSV files wrappers, although it can serve as a
33   basic wrapper for simple, unspecified CSV files.
34
35   .. method:: grok.context(basestring)
36      :noindex:
37
38      We bind to basestring objects.
39
40   .. method:: grok.implements(ICSVFile)
41      :noindex:
42
43   .. attribute:: required_fields=[]
44
45      A list of header fields (strings) required for this kind of
46      CSVFile. Using the default constructor will fail with paths to
47      files that do *not* provide those fields.
48
49      The defaul value (empty list) means: no special fields required
50      at all.
51
52      Deriving classes can override this attribute to accept only
53      files that provide the appropriate header fields. The default
54      constructor already checks this.
55
56   .. attribute:: path
57
58      A string describing the path to the associated CSV file.
59
60   .. method:: getData()
61
62      Returns a generator delivering one data row of the denoted file
63      at a time.
64
65      Each data row is delivered as a dictionary mapping from header
66      field names to the values. Therefore a source line like this::
67
68        field1,field2
69        ...
70        data 1,data 2
71
72      will become a dictionary like this::
73
74        {'field1' : 'data 1',
75         'field2' : 'data 2'}
76
77      The :meth:`getData` method does not evaluate file
78      values for correct data types or so.
79
80   .. method:: getHeaderFields()
81
82      Get a sorted list of header fields in the wrapped CSV file.
83
84
85:func:`getCSVFile`
86------------------
87
88.. function:: getCSVFile(filepath)
89
90   `filepath`
91      String with a filepath to an existing CSV file.
92
93   Get a CSV file wrapper for the given `filepath`. :func:`getCSVFile`
94   knows about all registered :class:`CSVFile` wrappers registered and
95   searches them for the most appropriate one.
96
97   If none can be found ``None`` is returned.
98
99   .. seealso::
100
101      :ref:`getcsvfilewrapper`,
102
103      :ref:`getcsvfiledecision`
104
105Helpers
106=======
107
108Some helper functions provide convenience methods for handling CSV
109data.
110
111:func:`toBool`
112--------------
113
114.. function:: toBool(string)
115
116   `string`
117      String containing some CSV data.
118
119   Turn a string into a boolean value.
120
121   If the string contains one of the values ``'true'``, ``'yes'``,
122   ``'y'``, ``'on'`` or ``'checked'`` then ``True`` is returned,
123   ``False`` otherwise.
124
125   The string can be uppercase, lowercase or mixed:
126
127     >>> from waeup.sirp.csvfile import toBool
128     >>> toBool('y')
129     True
130
131     >>> toBool('Yes')
132     True
133
134     >>> toBool('TRUE')
135     True
136
137     >>> toBool('no')
138     False
139
140   If we pass in a boolean then this will be returned unchanged:
141
142     >>> toBool(True)
143     True
144
145     >>> toBool(False)
146     False
147
148Basic example
149=============
150
151To initialize the whole framework we have to grok the :mod:`waeup.sirp`
152package first:
153
154    >>> import grok
155    >>> grok.testing.grok('waeup.sirp')
156
157Create a file:
158
159    >>> path = 'mycsvfile.csv'
160    >>> open(path, 'wb').write(
161    ... """col1,col2
162    ... dataitem1,dataitem2
163    ... item3,item4
164    ... """)
165
166A regular file path is difficult to handle in terms of a component
167framework. Therefore we get a wrapper for it, that makes a CSV file
168object out of a path string:
169
170    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
171    >>> src = ICSVFile(path)
172    >>> src
173    <waeup.sirp.csvfile.csvfile.CSVFile object at 0x...>
174
175Create a receiver:
176
177    >>> from waeup.sirp.csvfile.interfaces import ICSVDataReceiver
178    >>> class Receiver(object):
179    ...   grok.implements(ICSVDataReceiver)
180    ...   def receive(self, data):
181    ...     print "RECEIVED: ", data
182
183    >>> recv = Receiver()
184
185Find a connector:
186
187    >>> from zope.component import getMultiAdapter
188    >>> from waeup.sirp.csvfile.interfaces import ICSVDataConnector
189    >>> conn = getMultiAdapter((src, recv), ICSVDataConnector)
190    Traceback (most recent call last):
191    ...
192    ComponentLookupError: ((<waeup.sirp.csvfile.csvfile.CSVFile object at 0x...>,
193                            <Receiver object at 0x...>),
194        <InterfaceClass waeup.sirp.csvfile.interfaces.ICSVDataConnector>, u'')
195
196Okay, create a connector:
197
198    >>> class Connector1(grok.MultiAdapter):
199    ...   grok.adapts(ICSVFile, ICSVDataReceiver)
200    ...   grok.provides(ICSVDataConnector)
201    ...   def __init__(self, source, receiver):
202    ...     self.source = source
203    ...     self.receiver = receiver
204    ...   def doImport(self):
205    ...     self.receiver.receive(
206    ...        self.source.getData())
207
208    >>> grok.testing.grok_component('Connector1', Connector1)
209    True
210
211Try again...
212
213    >>> conn = getMultiAdapter((src, recv), ICSVDataConnector)
214    >>> conn
215    <Connector1 object at 0x...>
216
217    >>> conn.doImport()
218    RECEIVED: <generator object at 0x...>
219
220Clean up:
221
222    >>> import os
223    >>> os.unlink(path)
224
225
226CSV file wrappers
227=================
228
229CSV file wrappers can extract data from CSV files denoted by a path.
230
231We create a CSV file:
232
233    >>> path = 'mycsvfile.csv'
234    >>> open(path, 'wb').write(
235    ... """col1,col2
236    ... dataitem1,dataitem2
237    ... item3,item4
238    ... """)
239
240Now we get a CSV file wrapper for it. This is simply done by asking
241for an adapter to the path string:
242
243    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
244    >>> wrapper = ICSVFile(path)
245    >>> wrapper
246    <waeup.sirp.csvfile.csvfile.CSVFile object at 0x...>
247
248This wrapper can return the CSV data as a sequence of dicts:
249
250    >>> wrapper.getData()
251    <generator object at 0x...>
252
253As we see, the single dicts (each representing a row) are returned as
254a generator. We can list them:
255
256    >>> list(wrapper.getData())
257    [{'col2': 'dataitem2', 'col1': 'dataitem1'},
258     {'col2': 'item4', 'col1': 'item3'}]
259
260We can get a list of headerfields found in the file:
261
262    >>> wrapper.getHeaderFields()
263    ['col1', 'col2']
264
265.. _getcsvfilewrapper:
266
267Getting a wrapper
268=================
269
270If we want to get a wrapper best suited for our purposes, we can also
271use the :func:`getCSVFile` function:
272
273    >>> from waeup.sirp.csvfile.csvfile import getCSVFile
274    >>> wrapper = getCSVFile(path)
275    >>> wrapper
276    <waeup.sirp.csvfile.csvfile.CSVFile object at 0x...>
277
278As we currently have only one type of wrapper, we get this. Let's
279create another wrapper, that requires a column 'col1':
280
281    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
282    >>> from waeup.sirp.csvfile import CSVFile
283    >>> class ICSVFileWithCol1(ICSVFile):
284    ...   """A CSV file that contains a 'col1' column.
285    ...   """
286
287    >>> class CSVFileWithCol1(CSVFile):
288    ...   required_fields = ['col1']
289    ...   grok.implements(ICSVFileWithCol1)
290    ...   grok.provides(ICSVFileWithCol1)
291
292We have to grok:
293
294    >>> grok.testing.grok_component('CSVFileWithCol1', CSVFileWithCol1)
295    True
296
297Now we can ask for a wrapper again, but this time we will get a
298CSVFileWithCol12 instance:
299
300    >>> getCSVFile(path)
301    <CSVFileWithCol1 object at 0x...>
302
303If we cannot get a wrapper at all, ``None`` is returned:
304
305    >>> getCSVFile('not-existent-file') is None
306    True
307
308.. _getcsvfiledecision:
309
310
311How :func:`getCSVFile` decides which wrapper to use
312---------------------------------------------------
313
314Apparently, :func:`getCSVFile` performes some magic: given a certain
315CSV file, it decides which one of all registered wrappers suits the
316file best.
317
318This decision is based on a score, which is computed as shown below
319for each registered wrapper.
320
321Before we can show this, we create some more CSV files.
322
323One file that does not contain valid CSV data:
324
325     >>> nocsvpath = 'nocsvfile.csv'
326     >>> open(nocsvpath, 'wb').write(
327     ... """blah blah blah.
328     ... blubb blubb.
329     ... """)
330
331One file that contains a 'col1' and a 'col3' column:
332
333    >>> path2 = 'mycsvfile2.csv'
334    >>> open(path2, 'wb').write(
335    ... """col1,col3
336    ... dataitem1b,dataitem2b
337    ... item3b,item4b
338    ... """)
339
340We create a wrapper that requires 'col1' and 'col2' but does not check
341in constructor, whether this requirement is met:
342
343    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
344    >>> from waeup.sirp.csvfile import CSVFile
345    >>> class ICSVFile13(ICSVFile):
346    ...   """A CSV file that contains a 'special_col' column.
347    ...   """
348
349    >>> class CSVFile13(CSVFile):
350    ...   required_fields = ['col1', 'col2']
351    ...   grok.context(basestring)
352    ...   grok.implements(ICSVFile13)
353    ...   grok.provides(ICSVFile13)
354    ...   def __init__(self, context):
355    ...     self.path = context
356
357    >>> grok.testing.grok_component('CSVFile13',  CSVFile13)
358    True
359
360.. warn:: This is bad design as :class:`ICSVFile` instances should
361          always raise an exception in their constructor if a file
362          does not meet the basic requirements.
363
364          The base constructor will check for the correct values. So,
365          if you do not overwrite the base constructor, instances will
366          check on creation time, whether they can handle the desired
367          file.
368
369
370We create a wrapper that requires 'col1' and 'col2':
371
372    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
373    >>> from waeup.sirp.csvfile import CSVFile
374    >>> class ICSVFile12(ICSVFile):
375    ...   """A CSV file that contains a 'special_col' column.
376    ...   """
377
378    >>> class CSVFile12(CSVFile):
379    ...   required_fields = ['col1', 'col2']
380    ...   grok.context(basestring)
381    ...   grok.implements(ICSVFile12)
382    ...   grok.provides(ICSVFile12)
383
384    >>> grok.testing.grok_component('CSVFile12',  CSVFile12)
385    True
386
387Now the rules applied to all wrappers are:
388
389* If no instance of a certain wrapper can be created from the given
390  path (i.e. __init__ raises some kind of exception): score is -1:
391
392    >>> from waeup.sirp.csvfile.csvfile import getScore
393    >>> getScore('nonexistant', CSVFile)
394    -1
395
396* If a wrapper requires at least one header_field and the given file
397  does not provide all of the required fields: score is -1:
398
399    >>> getScore(path2, CSVFile12)
400    -1
401
402* If a wrapper requires no header fields at all
403  (i.e. `required_fields` equals empty list): score is 0 (zero):
404
405    >>> getScore(path, CSVFile)
406    0
407
408* If a wrapper requires at least one header_field and all header
409  fields do also appear in the file: score is number of required
410  fields.
411
412    >>> getScore(path, CSVFileWithCol1)
413    1
414
415    >>> getScore(path, CSVFile12)
416    2
417
418If several wrappers get the same score for a certain file, the result
419is not determined.
420
421
422How to build custom CSV file wrappers
423=====================================
424
425A typical CSV file wrapper can be built like this:
426
427    >>> import grok
428    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
429    >>> from waeup.sirp.csvfile import CSVFile
430
431    >>> class ICustomCSVFile(ICSVFile):
432    ...   """A marker for custom CSV files."""
433
434    >>> class CustomCSVFile(CSVFile):
435    ...   required_fields = ['somecol', 'othercol']
436    ...   grok.implements(ICustomCSVFile)
437    ...   grok.provides(ICustomCSVFile)
438
439    >>> grok.testing.grok_component('CustomCSVFile',  CustomCSVFile)
440    True
441
442The special things here are:
443
444* Derive from :class:`CSVFile`
445
446  :func:`getCSVFile` looks only for classes that are derived from
447  :class:`CSVFile`. So if you want your wrapper to be found by this function,
448  derive from :class:`CSVFile`.
449
450  As :class:`CSVFile` is an adapter, also our custom wrapper will
451  become one (adapting strings):
452
453     >>> ICustomCSVFile(path)
454     Traceback (most recent call last):
455     ...
456     TypeError: Missing columns in CSV file: ['somecol', 'othercol']
457
458  If our input file provides the correct columns, it will work:
459
460     >>> custompath = 'mycustom.csv'
461     >>> open(custompath, 'wb').write(
462     ... """somecol,othercol,thirdcol
463     ... dataitem1,dataitem2,dataitem3
464     ... """)
465
466     >>> ICustomCSVFile(custompath)
467     <CustomCSVFile object at 0x...>
468
469* Provide and implement a custom interface
470
471  A custom CSV file wrapper should provide and implement an own
472  interface. Otherwise it could not be required by other components
473  explicitly.
474
475  We have to provide *and* implement the custom interface because
476  otherwise instances would not implement the required interface
477  (maybe due to a flaw in `grok`/`martian`).
478
479
480Common Use Cases
481================
482
483Get a wrapper for a certain type of CSV file
484--------------------------------------------
485
486The type of a :class:`CSVFile` is determined by the interfaces it
487provides.
488
489If we want to get a wrapper that also guarantees to support certain
490fields (or None), then we already know about the wanted type.
491
492We create a file that does not have a 'special_col' field:
493
494    >>> path = 'mycsvfile.csv'
495    >>> open(path, 'wb').write(
496    ... """col1,col2
497    ... dataitem1,dataitem2
498    ... item3,item4
499    ... """)
500
501Now we create a wrapper, that requires that field:
502
503    >>> from waeup.sirp.csvfile.interfaces import ICSVFile
504    >>> from waeup.sirp.csvfile import CSVFile
505    >>> class ICSVFileWithSpecialCol(ICSVFile):
506    ...   """A CSV file that contains a 'special_col' column.
507    ...   """
508
509    >>> class CSVFileWithSpecialCol(CSVFile):
510    ...   required_fields = ['special_col']
511    ...   grok.provides(ICSVFileWithSpecialCol)
512
513    >>> grok.testing.grok_component('CSVFileWithSpecialCol',
514    ...                             CSVFileWithSpecialCol)
515    True
516
517If we want to get a wrapper for that kind of file:
518
519    >>> ICSVFileWithSpecialCol(path)
520    Traceback (most recent call last):
521    ...
522    TypeError: Missing columns in CSV file: ['special_col']
523
524If the required col is available, however:
525
526    >>> path2 = 'mycsvfile2.csv'
527    >>> open(path2, 'wb').write(
528    ... """col1,col2,special_col
529    ... dataitem1,dataitem2,dataitem3
530    ... item4,item5,item6
531    ... """)
532    >>> ICSVFileWithSpecialCol(path2)
533    <CSVFileWithSpecialCol object at 0x...>
534
535Build an importer framework
536---------------------------
537
538We can also build an importer framework with CSV file support using
539the components described above.
540
541To model this, we start with two files, we will import lateron:
542
543    >>> path = 'mycsvfile.csv'
544    >>> open(path, 'wb').write(
545    ... """col1,col2
546    ... dataitem1a,dataitem2a
547    ... item3a,item4a
548    ... """)
549
550    >>> path2 = 'mycsvfile2.csv'
551    >>> open(path2, 'wb').write(
552    ... """col1,col3
553    ... dataitem1b,dataitem2b
554    ... item3b,item4b
555    ... """)
556
557Then we create two receivers for CSV file data:
558
559    >>> from zope.interface import Interface
560    >>> class IReceiver1(Interface):
561    ...   """A CSV data receiver."""
562
563    >>> class IReceiver2(Interface):
564    ...   """Another CSV data receiver."""
565
566    >>> class Receiver1(object):
567    ...   grok.implements(IReceiver1)
568    ...   def receive(self, data):
569    ...     print "Receiver1 received: ", data
570
571    >>> class Receiver2(object):
572    ...   grok.implements(IReceiver2)
573    ...   def receive(self, data):
574    ...     print "Receiver2 received: ", data
575   
576
577If we want to be sure, that a wrapper requires these fields, we ask
578for ICSVFile12:
579
580    >>> wrapper1 = ICSVFile12(path)
581    >>> wrapper1
582    <CSVFile12 object at 0x...>
583
584We could not use this interface (adapter) with the other CSV file:
585
586    >>> wrapper2 = ICSVFile12(path2)
587    Traceback (most recent call last):
588    ...
589    TypeError: Missing columns in CSV file: ['col2']
590
591The last step is to build a bridge between the receivers and the
592sources. We call it connector or importer here:
593
594    >>> class IImporter(Interface):
595    ...   """Import sources to receivers."""
596
597    >>> class IImporter12(IImporter):
598    ...   """Imports ICSVFile12 data into IReceiver1 objects."""
599
600    >>> class Importer12(grok.MultiAdapter):
601    ...   grok.adapts(ICSVFile12, IReceiver1)
602    ...   grok.implements(IImporter12)
603    ...   def __init__(self, csvfile, receiver):
604    ...     self.csvfile = csvfile
605    ...     self.receiver = receiver
606    ...   def doImport(self):
607    ...     self.receiver.receive(self.csvfile.getData())
608
609    >>> grok.testing.grok_component('Importer12',  Importer12)
610    True
611
612We can create an importer if we know the type of CSV file:
613
614    >>> myrecv = Receiver1()
615    >>> myfile = ICSVFile12(path)
616    >>> myfile
617    <CSVFile12 object at 0x...>
618
619    >>> ICSVFile12.providedBy(myfile)
620    True
621
622    >>> from zope.component import getMultiAdapter
623    >>> myimporter = getMultiAdapter((myfile, myrecv), IImporter12)
624    >>> myimporter
625    <Importer12 object at 0x...>
626
627
628We can also create an importer without knowing the type of CSV file
629before, using the :func:`getCSVFile` function:
630
631    >>> from waeup.sirp.csvfile import getCSVFile
632    >>> myfile = getCSVFile(path)
633    >>> ICSVFile12.providedBy(myfile)
634    True
635
636Apparently, the CSVFileWrapper getter knows, that a CSVFile12 suits
637the contents of our file best. We did not specify what type of file
638wrapper we want.
639
640Getting an importer now is easy:
641
642    >>> myimporter = getMultiAdapter((myfile, myrecv), IImporter12)
643    >>> myimporter
644    <Importer12 object at 0x...>
645
646
647Clean up:
648
649    >>> import os
650    >>> os.unlink(path)
651    >>> os.unlink(path2)
652    >>> os.unlink(nocsvpath)
653    >>> os.unlink(custompath)
Note: See TracBrowser for help on using the repository browser.