From: analyst41 on 27 Jan 2010 18:51 I posted on this topic before and this is my latest take on it: (1) In my case the messy files are csv extracts from a database (whose character encoding is Unicode - I don't know if it has anything to do with the problem). (2) I discovered that Fortran sees spurious EOR markers within character fields and I couldn't see a rhyme or reason why. (3) But since I control the input - I inserted row numbers at the beginning and end of each row extracted from the database and I added 2000000000 to the row number make sure its unlikely that this data would show up naturally. (4) I then read each record and make sure that it has at least 18 characters (if not it is simply concatenated to cum_buffer - see below). I use the statement (adapted from Cooper Redwine's book) read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size = num_chars) buffer you must have EOR or EOF or error on each read - otherwise the buffer is too small and the program has to be halted. I then check if the record number is showing up at the end which is the same as the one on the left. If yes, you have a complete record - if not - you have a spurious EOR and and simply concatenate the buffer to another buffer called cum_buffer. when cum_buffer looks like 2000000127stuff2000000127 You have a facsimile of a row 127 from the database. You might still have to struggle with separating 'stuff' into fields - but thats a purely programming task having nothing to do with the file system or operating system or character encoding schemes. I hope others find this useful and suggestions for improvements would be good.
From: Arjen Markus on 28 Jan 2010 03:15 On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com> wrote: > I posted on this topic before and this is my latest take on it: > > (1) In my case the messy files are csv extracts from a database (whose > character encoding is Unicode - I don't know if it has anything to do > with the problem). > > (2) I discovered that Fortran sees spurious EOR markers within > character fields and I couldn't see a rhyme or reason why. > > (3) But since I control the input - I inserted row numbers at the > beginning and end of each row extracted from the database and I added > 2000000000 to the row number make sure its unlikely that this data > would show up naturally. > > (4) I then read each record and make sure that it has at least 18 > characters (if not it is simply concatenated to cum_buffer - see > below). > > I use the statement (adapted from Cooper Redwine's book) > > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size > = num_chars) buffer > > you must have EOR or EOF or error on each read - otherwise the buffer > is too small and the program has to be halted. > > I then check if the record number is showing up at the end which is > the same as the one on the left. If yes, you have a complete record - > if not - you have a spurious EOR and and simply concatenate the buffer > to another buffer called cum_buffer. > > when cum_buffer looks like > > 2000000127stuff2000000127 > > You have a facsimile of a row 127 from the database. > > You might still have to struggle with separating 'stuff' into fields - > but thats a purely programming task having nothing to do with the file > system or operating system or character encoding schemes. > > I hope others find this useful and suggestions for improvements would > be good. I do not remember your previous postings, but I am curious about these end-of-records. Can you send me an example? (I want to look at CSV files more closely, as I recently was confronted with some of their nastier aspects in the context of my Flibs project - http://flibs.sf.net). Regards, Arjen
From: robert.corbett on 28 Jan 2010 03:28 On Jan 27, 3:51 pm, "analys...(a)hotmail.com" <analys...(a)hotmail.com> wrote: > I posted on this topic before and this is my latest take on it: > > (1) In my case the messy files are csv extracts from a database (whose > character encoding is Unicode - I don't know if it has anything to do > with the problem). It might. > (2) I discovered that Fortran sees spurious EOR markers within > character fields and I couldn't see a rhyme or reason why. Are the characters in the file using 16-bit Unicode characters, 32-bit characters, or a multi-byte encoding such as UTF-8? If they are 16-bit or 32-bit characters, then some of the bytes that form a character could have the value of an end-of-record character. If the file contains multi-byte encodings, that is not the problem. Bob Corbett
From: analyst41 on 28 Jan 2010 18:39 On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote: > On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com> > wrote: > > > > > > > I posted on this topic before and this is my latest take on it: > > > (1) In my case the messy files are csv extracts from a database (whose > > character encoding is Unicode - I don't know if it has anything to do > > with the problem). > > > (2) I discovered that Fortran sees spurious EOR markers within > > character fields and I couldn't see a rhyme or reason why. > > > (3) But since I control the input - I inserted row numbers at the > > beginning and end of each row extracted from the database and I added > > 2000000000 to the row number make sure its unlikely that this data > > would show up naturally. > > > (4) I then read each record and make sure that it has at least 18 > > characters (if not it is simply concatenated to cum_buffer - see > > below). > > > I use the statement (adapted from Cooper Redwine's book) > > > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size > > = num_chars) buffer > > > you must have EOR or EOF or error on each read - otherwise the buffer > > is too small and the program has to be halted. > > > I then check if the record number is showing up at the end which is > > the same as the one on the left. If yes, you have a complete record - > > if not - you have a spurious EOR and and simply concatenate the buffer > > to another buffer called cum_buffer. > > > when cum_buffer looks like > > > 2000000127stuff2000000127 > > > You have a facsimile of a row 127 from the database. > > > You might still have to struggle with separating 'stuff' into fields - > > but thats a purely programming task having nothing to do with the file > > system or operating system or character encoding schemes. > > > I hope others find this useful and suggestions for improvements would > > be good. > > I do not remember your previous postings, but I am curious about these > end-of-records. Can you send me an example? (I want to look at CSV > files > more closely, as I recently was confronted with some of their nastier > aspects > in the context of my Flibs project -http://flibs.sf.net). > > Regards, > > Arjen- Hide quoted text - > > - Show quoted text - I'd love to given you actual files that show fake EORs - but it is copyright/proprietary data and I din't have the time to clean it up from that stand point. But here are three cases( the occurrence of these strings causes Fortran to see a fake EOR - LF95 running on windows): <br /> </STRONG> </B> These seem to be terminators of HTML phrases - I don't know why Fortran thinks these are EORs. Excel would trip up similarly as would the language R - in fact, Fortran, R and Excel may see a different number of rows in the same csv file.
From: analyst41 on 28 Jan 2010 18:54 On Jan 28, 3:28 am, robert.corb...(a)sun.com wrote: > On Jan 27, 3:51 pm, "analys...(a)hotmail.com" <analys...(a)hotmail.com> > wrote: > > > I posted on this topic before and this is my latest take on it: > > > (1) In my case the messy files are csv extracts from a database (whose > > character encoding is Unicode - I don't know if it has anything to do > > with the problem). > > It might. > > > (2) I discovered that Fortran sees spurious EOR markers within > > character fields and I couldn't see a rhyme or reason why. > > Are the characters in the file using 16-bit Unicode characters, > 32-bit characters, or a multi-byte encoding such as UTF-8? > If they are 16-bit or 32-bit characters, then some of the > bytes that form a character could have the value of an > end-of-record character. If the file contains multi-byte > encodings, that is not the problem. > > Bob Corbett I believe it is UTF-8 - please also see my reply to Arjen.
|
Next
|
Last
Pages: 1 2 3 4 5 6 7 Prev: reading complex data using implied do loops Next: FTP libraries |