Parsing CSV files [MFC]

Prev: tooltip for controls on a dialog application
Next: Accessing all parts of too big dialog box

From: Hector Santos on 21 Jan 2010 22:52

Stanza wrote:

> What is the easiest way of reading a line at a time through a textual
> CSV file, and then extracting the comma-separated elements from each line?

"Easiest" depends on what language and framework you are using and how
you hold, store, process the data in memory.

Assuming C language, the traditional implementation is to use
strtok(), is a C/C++ simple example:

// File: d:\wc5beta\testtok.cpp

// compile with: cl testtok.cpp

#include <stdio.h>
#include <afx.h>

int main(char argc, char *argv[])
{
//
// get file name from command line
//

char *pfn = (argc>1)?argv[1]:NULL;

if (!pfn) {
printf("- syntax: testeol csv_filename\n");
return 1;
}

//
// open text file for reading
//

FILE *fv = fopen(pfn,"rt");
if (!fv) {
printf("ERROR %d Opening file\n",GetLastError());
return 1;
}

//
// read each line using fgets() and parse
// the "," and cr/lf (\r\n) token characters.
//

char *tok = ",\r\n";

int nLine = 0;
char szLine[1024];
memset(&szLine,sizeof(szLine),0);
while (fgets(szLine,sizeof(szLine)-1,fv)) {
nLine++;
printf("# %d | %s",nLine, szLine);

//
// parse the line by the tok characters
//
char *fld = strtok(szLine, tok);
while(fld) {
printf("- [%s]\n",fld);
fld = strtok(NULL, tok);
}
}

fclose(fv);
return 0;
}

So for example testdata.csv file containing these lines:

hector santos,email1(a)whatever.com
stanza,email2(a)whatever2.com
Joe Newcomer,email3(a)whatever3.com

compiling and running testtok testdata.csv, you get:

# 1 | hector santos,email1(a)whatever.com
- [hector santos]
- [email1(a)whatever.com]
# 2 | stanza,email2(a)whatever2.com
- [stanza]
- [email2(a)whatever2.com]
# 3 | Joe Newcomer,email3(a)whatever3.com
- [Joe Newcomer]
- [email3(a)whatever3.com]

This is very simplistic and doesn't many design issues in regards to
parsing csv bases files.

The #1 design issue is the idea of "escaping" the token character you
are using to separate fields, in this case the comma (',') because it
is possible to have the comma with the field strings. That depends on
the type and data specifications. Maybe your program doesn't expect
them and maybe the creator the file will never ADD them and/or escapes
them. All this is implementation base.

For example, the data file can have a 3rd field that is a description
like field, OR the name field can have commas this, thus introduce the
idea that it can escaping is requiring. i.e, the data file can look
like this:

hector santos,email1(a)whatever.com,whatever,whatever,whatever
stanza,email2(a)whatever2.com,"whatever,whatever,whatever"
Joe Newcomer,email3(a)whatever3.com
Serlace, tom,email4(a)whatever4.com

So you can roll up sleeves and begin to use the above simple C/C++
code as a basis to fine tune the reading requirements for your CSV by
adding token escaping concepts, or you can use 3rd party libraries and
functions available to do these things, and your requirements will be
that these 3rd party libraries and function have the features of
escaping tokens.

Now, I purposely creates the testdata.csv above that would normally be
considered bad formatting and doesn't promote or help good csv
reading. A good practice it surround the fields with double quotes
and that MAY be enough for escaping embedded commas, for example,
the first line has a 3rd field:

whatever,whatever,whatever

well, if you parsing only by comma, the field results in just
"whatever". So what is normally done is use lines like the 2nd line
where the 3rd field is quoted:

"whatever,whatever,whatever"

The same issue with the 4th line with the first "expected" field has:

Serlace, tom,

and this causes your fields to be shifted and off.

There are other concepts to deal with, namely, how you are reading
into memory storage, if needed or if your processing each line and
forgetting about it.

So writing a robust CSV reader that takes into account, such as:

- escaping and embedded tokens
- reading into memory

are common design requirements here. It really isn't that hard. I
would encourage to learn and gain the rewarding experiences to program
this yourself. It covers ideas that will be common ideas in a
programmers life. I will say, that sometimes it pays do to just a
byte stream parser instead of using strtok() checking each possible
token and delimiter, double quoted strings, etc. For example, instead
of the strtok block of lines, you can use something like:

char *p = szLine;
while (*p) {
switch(*p) {
case '\r':
... add logic for this ...
break;
case '\n':
... add logic for this ...
break;
case '\"':
... add logic for this ...
break;
case ',':
... add logic for this ...
break;
}
p++;
}

It can be simple to complex depending on the CSV reading requirements.

Anyway, if you just wish to get a solution, you can use one the many
3rd party libraries, classes, that will do these things for you.

If you using another language, the same ideas apply, but some
languages already have a good library, like .NET perhaps. It has an
excellent text I/O reader class in its collections library, See
OpenTextFieldParser(). It supports CSV reading and covers the two
important ideas above for escaping and storage.

--
HLS

From: Hector Santos on 22 Jan 2010 04:41

Hector Santos wrote:

> Goran,
>
> Many times even with 3rd party libraries, you still have to learn how to
> use it. Many times, the attempt to generalized does not cover all
> bases. What if there is a bug? Many times with CSV, it might requires
> upfront field definition or its all viewed as strings. So the "easiest"
> does not always mean use a 3rd party solution.
>
> Of course the devil is in the details and it helps when the OP provides
> info, like what language and platform. If he said .NET, as I mention
> the MS .net collection library has a pretty darn good reader class with
> the benefits of supporting OOPS as well which allows you to create a
> data "class" that you pass to the line reader.
>
> Guess what? There is still a learning curve here to understand the
> interface, to use it right as there would be with any library.
>
> So the easiest? For me, it all depends - a simple text reader and
> strtok() parser and work in the escaping issues can be both very easy
> and super fast! with no dependency on 3rd party QA issues.
>
> For me, I have never come across a library or class that could handle
> everything and if it did, required a data definition interface of some
> sort - like the .NET collection class offers. If he using .NET, then I
> recommend using this class as the "easiest."

Case in point.

Even with the excellent .NET text I/O class and a CSV reader wrapper,
it only offers a generalized method to parse fields. This still
requires proper setup and conditions that might occur. It might
require specific addition logic to handle situations where it does not
cover, like when fields span across multiple lines. For example:

1,2,3,4,5,"hector
, santos",6
7,8
9,10

That might be 1 data record with 10 fields.

However, even if the library allows you to do this, in my opinion,
only an experienced implementator knows what to look for, see how to
do it with the library to properly address this.

Here is a VB.NET test program I wrote a few years back for a VERY long
thread regarding this topic and how to handle the situation for a
fella that had this need of fields spanning across multiple rows.

------------- CUT HERE -------------------
'--------------------------------------------------------------
' File : D:\Local\wcsdk\wcserver\dotnet\Sandbox\readcsf4.vb
' About:
'--------------------------------------------------------------
Option Strict Off
Option Explicit On

imports system
imports system.diagnostics
imports system.console
imports system.reflection
imports system.collections.generic
Imports system.text

Module module1

//
// Dump an object
//

Sub dumpObject(ByVal o As Object)
Dim t As Type = o.GetType()
WriteLine("Type: {0} Fields: {1}", t, t.GetFields().Length)
For Each s As FieldInfo In t.GetFields()
Dim ft As Type = s.FieldType()
WriteLine("- {0,-10} {1,-15} => {2}", s.Name, ft,
s.GetValue(o))
Next
End Sub

//
// Data definition "TRecord" class, for this example
// 9 fields are expected per data record.
//

Public Class TRecord
Public f1 As String
Public f2 As String
Public f3 As String
Public f4 As String
Public f5 As String
Public f6 As String
Public f7 As String
Public f8 As String
Public f9 As String

Public Sub Convert(ByRef flds As List(Of String))
Dim fi As FieldInfo() = Me.GetType().GetFields()
Dim i As Integer = 0
For Each s As FieldInfo In fi
Dim tt As Type = s.FieldType()
If (i < flds.Count) Then
If TypeOf (s.GetValue(Me)) Is Integer Then
s.SetValue(Me, CInt(flds.Item(i)))
Else
s.SetValue(Me, flds.Item(i))
End If
End If
i += 1
Next
End Sub

Public Sub New()
End Sub

Public Sub New(ByVal flds As List(Of String))
Convert(flds)
End Sub

Public Shared Narrowing Operator CType(_
ByVal flds As List(Of String)) As TRecord
Return New TRecord(flds)
End Operator

Public Shared Narrowing Operator CType(_
ByVal flds As String()) As TRecord
Dim sl As New List(Of String)
For i As Integer = 1 To flds.Length
sl.Add(flds(i - 1))
Next
Return New TRecord(sl)
End Operator
End Class

Public Class ReaderCVS

Public Shared data As New List(Of TRecord)

'
' Read cvs file with max_fields, optional eolfilter
'
Public Function ReadCSV( _
ByVal fn As String, _
Optional ByVal max_fields As Integer = 0, _
Optional ByVal eolfilter As Boolean = True) As Boolean
Try
Dim tr As New TRecord
max_fields = tr.GetType().GetFields().Length()
data.Clear()

Dim rdr As FileIO.TextFieldParser
rdr = My.Computer.FileSystem.OpenTextFieldParser(fn)
rdr.SetDelimiters(",")
Dim flds As New List(Of String)
While Not rdr.EndOfData()
Dim lines As String() = rdr.ReadFields()
For Each fld As String In lines
If eolfilter Then
fld = fld.Replace(vbCr, " ").Replace(vbLf,"")
End If
flds.Add(fld)
If flds.Count = max_fields Then
tr = flds
data.Add(tr)
flds = New List(Of String)
End If
Next
End While
If flds.Count > 0 Then
tr = flds
data.Add(tr)
End If
rdr.Close()
Return True

Catch ex As Exception
WriteLine(ex.Message)
WriteLine(ex.StackTrace)
Return False
End Try
End Function

Public Sub Dump()
WriteLine("------- DUMP ")
debug.WriteLine("Dump")
For i As Integer = 1 To data.Count
dumpObject(data(i - 1))
Next
End Sub

End Class

Sub main(ByVal args() As String)
Dim csv As New ReaderCVS
csv.ReadCSV("test1.csf")
csv.Dump()
End Sub

End Module
------------- CUT HERE -------------------

Mind you, the above written 2 years ago while I was still learning
..NET library and I was participating in support questions to learn
myself to do common concept ideas in the .NET environment.

Is the above simple for most beginners? I wouldn't say so, but then
again, I tend to be a "tools" writer and try to generalized an tool,
hence when I spent the time to implement a data class using an object
dump function to debug it all. Not eveyone needs this. Most of the
time, the field types are known so a reduction can be done, or better
yet, you can take the above, have it read the first line as the field
definition line and generalize the TRecord class to make it all dynamic.

--
HLS

From: Hector Santos on 22 Jan 2010 04:55

Note, if anyone is trying this out, I added the C/C++ inline //
comments after I posted the code (my default language today). For
VB.NET it is a single quote. So if you can get compiler error, change
the inline characters.

--
HLS

Hector Santos wrote:

> Module module1
>
> //
> // Dump an object
> //
>

--
HLS

From: Goran on 22 Jan 2010 05:05

> Guess what? There is still a learning curve here to understand the
> interface, to use it right as there would be with any library.

Perhaps. But imagine:

class CCsvReader
{
CCsvFile(fileName);
size_t GetRecordCount() const;
const CCsvRecord& operator[](size_t index) const;
};

class CCsvRecord
{
CString operator[](size_t fieldIndex);
CString operator[](LPCTSTR FieldName);
};

Effort in learning __that__ certainly beats effort of rolling your
own. Of course, that's provided that fits your use-case and that there
is a similar library. But that's done by Googling, newsgrouping and
reading.

> So the easiest? For me, it all depends - a simple text reader and
> strtok() parser and work in the escaping issues can be both very easy
> and super fast! with no dependency on 3rd party QA issues.

Are you suggesting that +/- long existing library, probably seen by
many internet eyes, will have quality issues and your own code just
won't? This, frankly, smacks of hubris.

Goran.

From: Hector Santos on 22 Jan 2010 06:04

Goran wrote:

> Effort in learning __that__ certainly beats effort of rolling your
> own. Of course, that's provided that fits your use-case and that there
> is a similar library. But that's done by Googling, newsgrouping and
> reading.

Still a learning curve for most. You know the old saying "Teach a man
how to fish...." moral.

> Are you suggesting that +/- long existing library, probably seen by
> many internet eyes, will have quality issues and your own code just
> won't? This, frankly, smacks of hubris.

Since I didn't say nor imply it, the rhetorical suggestion is what
walks, talks and smell of hubris. :)

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: tooltip for controls on a dialog application
Next: Accessing all parts of too big dialog box