Prev: ImportError: DLL load failed: The specified module could notbe found, SWIG, life, etc
Next: vidio*****
From: Tim on 5 Jul 2010 13:01 Hullo Csv is a very common format for publishing data as a form of primitive integration. It's an annoyingly brittle approach, so I'd like to ensure that I capture errors as soon as possible, so that I can get the upstream processes fixed, or at worst put in some correction mechanisms and avoid getting polluted data into my analyses. A symptom of several types of errors is that the number of fields being interpreted varies over a file (eg from wrongly embedded quote strings or mishandled embedded newlines). My preferred approach would be to get DictReader to throw an exception when encountering such oddities, but at the moment it seems to try to patch over the error and fill in the blanks for short lines, or ignore long lines. I know that I can use the restval parameter and then check for what's been parsed when I get my results back, but this seems brittle as whatever I use for restval could legitimately be in the data. Is there any way to get csv.DictReader to throw and exception on such simple line errors, or am I going to have to use csv.reader and explicitly check for the number of fields read in on each line? cheers Tim
From: Peter Otten on 5 Jul 2010 13:32
Tim wrote: > Csv is a very common format for publishing data as a form of primitive > integration. It's an annoyingly brittle approach, so I'd like to > ensure that I capture errors as soon as possible, so that I can get > the upstream processes fixed, or at worst put in some correction > mechanisms and avoid getting polluted data into my analyses. > > A symptom of several types of errors is that the number of fields > being interpreted varies over a file (eg from wrongly embedded quote > strings or mishandled embedded newlines). My preferred approach would > be to get DictReader to throw an exception when encountering such > oddities, but at the moment it seems to try to patch over the error > and fill in the blanks for short lines, or ignore long lines. I know > that I can use the restval parameter and then check for what's been > parsed when I get my results back, but this seems brittle as whatever > I use for restval could legitimately be in the data. > > Is there any way to get csv.DictReader to throw and exception on such > simple line errors, or am I going to have to use csv.reader and > explicitly check for the number of fields read in on each line? I think you have to use csv.reader. Untested: def DictReader(f, fieldnames=None, *args, **kw): reader = csv.reader(f, *args, **kw) if fieldnames is None: fieldnames = next(reader) for row in reader: if row: if len(fieldnames) != len(row): raise ValueError yield dict(zip(fieldnames, row)) Peter |