From: Hector Santos on 21 Jan 2010 22:52 Stanza wrote: > What is the easiest way of reading a line at a time through a textual > CSV file, and then extracting the comma-separated elements from each line? "Easiest" depends on what language and framework you are using and how you hold, store, process the data in memory. Assuming C language, the traditional implementation is to use strtok(), is a C/C++ simple example: // File: d:\wc5beta\testtok.cpp // compile with: cl testtok.cpp #include <stdio.h> #include <afx.h> int main(char argc, char *argv[]) { // // get file name from command line // char *pfn = (argc>1)?argv[1]:NULL; if (!pfn) { printf("- syntax: testeol csv_filename\n"); return 1; } // // open text file for reading // FILE *fv = fopen(pfn,"rt"); if (!fv) { printf("ERROR %d Opening file\n",GetLastError()); return 1; } // // read each line using fgets() and parse // the "," and cr/lf (\r\n) token characters. // char *tok = ",\r\n"; int nLine = 0; char szLine[1024]; memset(&szLine,sizeof(szLine),0); while (fgets(szLine,sizeof(szLine)-1,fv)) { nLine++; printf("# %d | %s",nLine, szLine); // // parse the line by the tok characters // char *fld = strtok(szLine, tok); while(fld) { printf("- [%s]\n",fld); fld = strtok(NULL, tok); } } fclose(fv); return 0; } So for example testdata.csv file containing these lines: hector santos,email1(a)whatever.com stanza,email2(a)whatever2.com Joe Newcomer,email3(a)whatever3.com compiling and running testtok testdata.csv, you get: # 1 | hector santos,email1(a)whatever.com - [hector santos] - [email1(a)whatever.com] # 2 | stanza,email2(a)whatever2.com - [stanza] - [email2(a)whatever2.com] # 3 | Joe Newcomer,email3(a)whatever3.com - [Joe Newcomer] - [email3(a)whatever3.com] This is very simplistic and doesn't many design issues in regards to parsing csv bases files. The #1 design issue is the idea of "escaping" the token character you are using to separate fields, in this case the comma (',') because it is possible to have the comma with the field strings. That depends on the type and data specifications. Maybe your program doesn't expect them and maybe the creator the file will never ADD them and/or escapes them. All this is implementation base. For example, the data file can have a 3rd field that is a description like field, OR the name field can have commas this, thus introduce the idea that it can escaping is requiring. i.e, the data file can look like this: hector santos,email1(a)whatever.com,whatever,whatever,whatever stanza,email2(a)whatever2.com,"whatever,whatever,whatever" Joe Newcomer,email3(a)whatever3.com Serlace, tom,email4(a)whatever4.com So you can roll up sleeves and begin to use the above simple C/C++ code as a basis to fine tune the reading requirements for your CSV by adding token escaping concepts, or you can use 3rd party libraries and functions available to do these things, and your requirements will be that these 3rd party libraries and function have the features of escaping tokens. Now, I purposely creates the testdata.csv above that would normally be considered bad formatting and doesn't promote or help good csv reading. A good practice it surround the fields with double quotes and that MAY be enough for escaping embedded commas, for example, the first line has a 3rd field: whatever,whatever,whatever well, if you parsing only by comma, the field results in just "whatever". So what is normally done is use lines like the 2nd line where the 3rd field is quoted: "whatever,whatever,whatever" The same issue with the 4th line with the first "expected" field has: Serlace, tom, and this causes your fields to be shifted and off. There are other concepts to deal with, namely, how you are reading into memory storage, if needed or if your processing each line and forgetting about it. So writing a robust CSV reader that takes into account, such as: - escaping and embedded tokens - reading into memory are common design requirements here. It really isn't that hard. I would encourage to learn and gain the rewarding experiences to program this yourself. It covers ideas that will be common ideas in a programmers life. I will say, that sometimes it pays do to just a byte stream parser instead of using strtok() checking each possible token and delimiter, double quoted strings, etc. For example, instead of the strtok block of lines, you can use something like: char *p = szLine; while (*p) { switch(*p) { case '\r': ... add logic for this ... break; case '\n': ... add logic for this ... break; case '\"': ... add logic for this ... break; case ',': ... add logic for this ... break; } p++; } It can be simple to complex depending on the CSV reading requirements. Anyway, if you just wish to get a solution, you can use one the many 3rd party libraries, classes, that will do these things for you. If you using another language, the same ideas apply, but some languages already have a good library, like .NET perhaps. It has an excellent text I/O reader class in its collections library, See OpenTextFieldParser(). It supports CSV reading and covers the two important ideas above for escaping and storage. -- HLS
From: Hector Santos on 22 Jan 2010 04:41 Hector Santos wrote: > Goran, > > Many times even with 3rd party libraries, you still have to learn how to > use it. Many times, the attempt to generalized does not cover all > bases. What if there is a bug? Many times with CSV, it might requires > upfront field definition or its all viewed as strings. So the "easiest" > does not always mean use a 3rd party solution. > > Of course the devil is in the details and it helps when the OP provides > info, like what language and platform. If he said .NET, as I mention > the MS .net collection library has a pretty darn good reader class with > the benefits of supporting OOPS as well which allows you to create a > data "class" that you pass to the line reader. > > Guess what? There is still a learning curve here to understand the > interface, to use it right as there would be with any library. > > So the easiest? For me, it all depends - a simple text reader and > strtok() parser and work in the escaping issues can be both very easy > and super fast! with no dependency on 3rd party QA issues. > > For me, I have never come across a library or class that could handle > everything and if it did, required a data definition interface of some > sort - like the .NET collection class offers. If he using .NET, then I > recommend using this class as the "easiest." Case in point. Even with the excellent .NET text I/O class and a CSV reader wrapper, it only offers a generalized method to parse fields. This still requires proper setup and conditions that might occur. It might require specific addition logic to handle situations where it does not cover, like when fields span across multiple lines. For example: 1,2,3,4,5,"hector , santos",6 7,8 9,10 That might be 1 data record with 10 fields. However, even if the library allows you to do this, in my opinion, only an experienced implementator knows what to look for, see how to do it with the library to properly address this. Here is a VB.NET test program I wrote a few years back for a VERY long thread regarding this topic and how to handle the situation for a fella that had this need of fields spanning across multiple rows. ------------- CUT HERE ------------------- '-------------------------------------------------------------- ' File : D:\Local\wcsdk\wcserver\dotnet\Sandbox\readcsf4.vb ' About: '-------------------------------------------------------------- Option Strict Off Option Explicit On imports system imports system.diagnostics imports system.console imports system.reflection imports system.collections.generic Imports system.text Module module1 // // Dump an object // Sub dumpObject(ByVal o As Object) Dim t As Type = o.GetType() WriteLine("Type: {0} Fields: {1}", t, t.GetFields().Length) For Each s As FieldInfo In t.GetFields() Dim ft As Type = s.FieldType() WriteLine("- {0,-10} {1,-15} => {2}", s.Name, ft, s.GetValue(o)) Next End Sub // // Data definition "TRecord" class, for this example // 9 fields are expected per data record. // Public Class TRecord Public f1 As String Public f2 As String Public f3 As String Public f4 As String Public f5 As String Public f6 As String Public f7 As String Public f8 As String Public f9 As String Public Sub Convert(ByRef flds As List(Of String)) Dim fi As FieldInfo() = Me.GetType().GetFields() Dim i As Integer = 0 For Each s As FieldInfo In fi Dim tt As Type = s.FieldType() If (i < flds.Count) Then If TypeOf (s.GetValue(Me)) Is Integer Then s.SetValue(Me, CInt(flds.Item(i))) Else s.SetValue(Me, flds.Item(i)) End If End If i += 1 Next End Sub Public Sub New() End Sub Public Sub New(ByVal flds As List(Of String)) Convert(flds) End Sub Public Shared Narrowing Operator CType(_ ByVal flds As List(Of String)) As TRecord Return New TRecord(flds) End Operator Public Shared Narrowing Operator CType(_ ByVal flds As String()) As TRecord Dim sl As New List(Of String) For i As Integer = 1 To flds.Length sl.Add(flds(i - 1)) Next Return New TRecord(sl) End Operator End Class Public Class ReaderCVS Public Shared data As New List(Of TRecord) ' ' Read cvs file with max_fields, optional eolfilter ' Public Function ReadCSV( _ ByVal fn As String, _ Optional ByVal max_fields As Integer = 0, _ Optional ByVal eolfilter As Boolean = True) As Boolean Try Dim tr As New TRecord max_fields = tr.GetType().GetFields().Length() data.Clear() Dim rdr As FileIO.TextFieldParser rdr = My.Computer.FileSystem.OpenTextFieldParser(fn) rdr.SetDelimiters(",") Dim flds As New List(Of String) While Not rdr.EndOfData() Dim lines As String() = rdr.ReadFields() For Each fld As String In lines If eolfilter Then fld = fld.Replace(vbCr, " ").Replace(vbLf,"") End If flds.Add(fld) If flds.Count = max_fields Then tr = flds data.Add(tr) flds = New List(Of String) End If Next End While If flds.Count > 0 Then tr = flds data.Add(tr) End If rdr.Close() Return True Catch ex As Exception WriteLine(ex.Message) WriteLine(ex.StackTrace) Return False End Try End Function Public Sub Dump() WriteLine("------- DUMP ") debug.WriteLine("Dump") For i As Integer = 1 To data.Count dumpObject(data(i - 1)) Next End Sub End Class Sub main(ByVal args() As String) Dim csv As New ReaderCVS csv.ReadCSV("test1.csf") csv.Dump() End Sub End Module ------------- CUT HERE ------------------- Mind you, the above written 2 years ago while I was still learning ..NET library and I was participating in support questions to learn myself to do common concept ideas in the .NET environment. Is the above simple for most beginners? I wouldn't say so, but then again, I tend to be a "tools" writer and try to generalized an tool, hence when I spent the time to implement a data class using an object dump function to debug it all. Not eveyone needs this. Most of the time, the field types are known so a reduction can be done, or better yet, you can take the above, have it read the first line as the field definition line and generalize the TRecord class to make it all dynamic. -- HLS
From: Hector Santos on 22 Jan 2010 04:55 Note, if anyone is trying this out, I added the C/C++ inline // comments after I posted the code (my default language today). For VB.NET it is a single quote. So if you can get compiler error, change the inline characters. -- HLS Hector Santos wrote: > Module module1 > > // > // Dump an object > // > -- HLS
From: Goran on 22 Jan 2010 05:05 > Guess what? There is still a learning curve here to understand the > interface, to use it right as there would be with any library. Perhaps. But imagine: class CCsvReader { CCsvFile(fileName); size_t GetRecordCount() const; const CCsvRecord& operator[](size_t index) const; }; class CCsvRecord { CString operator[](size_t fieldIndex); CString operator[](LPCTSTR FieldName); }; Effort in learning __that__ certainly beats effort of rolling your own. Of course, that's provided that fits your use-case and that there is a similar library. But that's done by Googling, newsgrouping and reading. > So the easiest? For me, it all depends - a simple text reader and > strtok() parser and work in the escaping issues can be both very easy > and super fast! with no dependency on 3rd party QA issues. Are you suggesting that +/- long existing library, probably seen by many internet eyes, will have quality issues and your own code just won't? This, frankly, smacks of hubris. Goran.
From: Hector Santos on 22 Jan 2010 06:04
Goran wrote: > Effort in learning __that__ certainly beats effort of rolling your > own. Of course, that's provided that fits your use-case and that there > is a similar library. But that's done by Googling, newsgrouping and > reading. Still a learning curve for most. You know the old saying "Teach a man how to fish...." moral. > Are you suggesting that +/- long existing library, probably seen by > many internet eyes, will have quality issues and your own code just > won't? This, frankly, smacks of hubris. Since I didn't say nor imply it, the rhetorical suggestion is what walks, talks and smell of hubris. :) -- HLS |