From: analyst41 on 11 Oct 2009 10:20 On Oct 11, 9:20 am, dpb <n...(a)non.net> wrote: > analys...(a)hotmail.com wrote: > > ... > > > I am asking this question for a practical reason. I extract a csv > > file from a database and apparently their native character encoding > > isn't 7-bit ASCII. When I read this extracted file with fortran two > > thing shappen > > > (1) (trivial) some characters are misinterpreted - even excel, notepad > > etc, do the same thing. > > > (2) (non-trivial) spurious "End of record" markers are seen by > > Fortran, Excel etc. (If I eliminate the character fields from the > > database extract, this doesn't happen) and the file as read in sees > > more records than there are rows in the database. > > > (3) I posed this problem earlier to the ng. and although I received > > some suggestions, I still haven't solved the problem. > > This doesn't seem to be a Fortran problem, really, but one in the file > generation from the database. > > What hardware, OS, database, etc., ... might lead to somebody having > input to resolving the problem. > > -- The database is Microsoft Windows SQL. the downlaod Engine is the Microsoft SQL client. The OS is Windows XP running Lahey Fortran. But you are right - it is not necessarily a Fortran problem since Excel, Notepad etc. have the same problem - I thought I might be able to do reconcile rows and records using "byte by byte" processing using Fortran - but with no luck so far. The suggestions I received helped me to resolve delimiters embedded within delimiters (thanks to everybody who contributed - the SOBs who built the data base even use '|' as a datum instead of a delimter - and of course commas,spaces and periods within '"' happen all the time as to HTML type delimiters "</" and "/>") - but I don't know how to remove spurious EOR markers (even DOS's 'Type" command sees them. ) I can provide any other info. needed.
From: Dan Nagle on 11 Oct 2009 10:33 Hello, On 2009-10-11 10:20:34 -0400, analyst41(a)hotmail.com said: > But you are right - it is > not necessarily a Fortran problem since Excel, Notepad etc. have the > same problem - I thought I might be able to do reconcile rows and > records using "byte by byte" processing using Fortran - but with no > luck so far. The suggestions I received helped me to resolve > delimiters embedded within delimiters (thanks to everybody who > contributed - the SOBs who built the data base even use '|' as a datum > instead of a delimter - and of course commas,spaces and periods within > '"' happen all the time as to HTML type delimiters "</" and "/>") - > but I don't know how to remove spurious EOR markers (even DOS's 'Type" > command sees them. ) If you can find a compiler that supports the f08 i/o encoding= specifier, you might be able to tinker with the character set (just as another knob to twist). -- Cheers! Dan Nagle
From: dpb on 11 Oct 2009 10:35 analyst41(a)hotmail.com wrote: .... >>> (2) (non-trivial) spurious "End of record" markers are seen by >>> Fortran, Excel etc. (If I eliminate the character fields from the >>> database extract, this doesn't happen) and the file as read in sees >>> more records than there are rows in the database. .... > but I don't know how to remove spurious EOR markers (even DOS's 'Type" > command sees them. ) .... That implies these are embedded into the character fields' data then? If so, I would see only two ways to attack-- 1) Get the originators of the database to fix the problem (not likely, I gather) 2) Clean up after their mess (which is obviously what you're trying to do) I don't think 2) is possible unequivocally unless there is some way to tell what a record and field length should be a priori and know there are a fixed number of fields per record. Or, iff the field separators are reliable, then you should be able to count fields. If that is the case, then it would seem that only way would be to open the file first as "binary" (sorry for the vernacular usage, Richard :) ) stream and count field delimiters and simply toss out EOR characters that don't belong and rewrite the file before processing it as csv. That seems to me to be doable in theory; whether it would work in practice I don't know. --
From: dpb on 11 Oct 2009 10:39 dpb wrote: .... > I don't think 2) is possible unequivocally unless there is some way to > tell what a record and field length should be a priori and know there > are a fixed number of fields per record. Or, iff the field separators > are reliable, then you should be able to count fields. > > If that is the case, then it would seem that only way would be to open > the file first as "binary" (sorry for the vernacular usage, Richard :) ) > stream and count field delimiters and simply toss out EOR characters > that don't belong and rewrite the file before processing it as csv. Of course, it would require all your previous logic on parsing character fields correctly as I presume there will be embedded record delimiters in them as well so it isn't simply counting their occurrence. Sounds like a pita, indeed... :( --
From: analyst41 on 11 Oct 2009 12:01
On Oct 11, 10:35 am, dpb <n...(a)non.net> wrote: > analys...(a)hotmail.com wrote: > > ...>>> (2) (non-trivial) spurious "End of record" markers are seen by > >>> Fortran, Excel etc. (If I eliminate the character fields from the > >>> database extract, this doesn't happen) and the file as read in sees > >>> more records than there are rows in the database. > ... > > but I don't know how to remove spurious EOR markers (even DOS's 'Type" > > command sees them. ) > > ... > > That implies these are embedded into the character fields' data then? > > If so, I would see only two ways to attack-- > > 1) Get the originators of the database to fix the problem (not likely, I > gather) > > 2) Clean up after their mess (which is obviously what you're trying to do) > > I don't think 2) is possible unequivocally unless there is some way to > tell what a record and field length should be a priori and know there > are a fixed number of fields per record. Or, iff the field separators > are reliable, then you should be able to count fields. The problems are caused by large varchar columns - so I don't think the notion of record length makes sense here. > > If that is the case, then it would seem that only way would be to open > the file first as "binary" (sorry for the vernacular usage, Richard :) ) > stream and count field delimiters and simply toss out EOR characters > that don't belong and rewrite the file before processing it as csv. That sounds interesting: We know that "true" EORs can only occur after the last columnn in the database. So if one sees them "in between" one can throw them out. Is there a Windows/DOS tool that will let me see the EOR characters? I haven't used binary files in ages - any pointers as to how I can do that would be appreciated and I suppose I can look for "EOR" (real or spurious) with the IACHAR value of the EOR marker (it is control M on unix but I don't exactly know what the csv downloader in the database client puts at the end of records.) > > That seems to me to be doable in theory; whether it would work in > practice I don't know. > > -- |