Reading messy files with Fortran [Fortran]

Prev: reading complex data using implied do loops
Next: FTP libraries

From: analyst41 on 27 Jan 2010 18:51

I posted on this topic before and this is my latest take on it:

(1) In my case the messy files are csv extracts from a database (whose
character encoding is Unicode - I don't know if it has anything to do
with the problem).

(2) I discovered that Fortran sees spurious EOR markers within
character fields and I couldn't see a rhyme or reason why.

(3) But since I control the input - I inserted row numbers at the
beginning and end of each row extracted from the database and I added
2000000000 to the row number make sure its unlikely that this data
would show up naturally.

(4) I then read each record and make sure that it has at least 18
characters (if not it is simply concatenated to cum_buffer - see
below).

I use the statement (adapted from Cooper Redwine's book)

read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
= num_chars) buffer

you must have EOR or EOF or error on each read - otherwise the buffer
is too small and the program has to be halted.

I then check if the record number is showing up at the end which is
the same as the one on the left. If yes, you have a complete record -
if not - you have a spurious EOR and and simply concatenate the buffer
to another buffer called cum_buffer.

when cum_buffer looks like

2000000127stuff2000000127

You have a facsimile of a row 127 from the database.

You might still have to struggle with separating 'stuff' into fields -
but thats a purely programming task having nothing to do with the file
system or operating system or character encoding schemes.

I hope others find this useful and suggestions for improvements would
be good.

From: Arjen Markus on 28 Jan 2010 03:15

On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
wrote:
> I posted on this topic before and this is my latest take on it:
>
> (1) In my case the messy files are csv extracts from a database (whose
> character encoding is Unicode - I don't know if it has anything to do
> with the problem).
>
> (2) I discovered that Fortran sees spurious EOR markers within
> character fields and I couldn't see a rhyme or reason why.
>
> (3) But since I control the input - I inserted row numbers at the
> beginning and end of each row extracted from the database and I added
> 2000000000 to the row number make sure its unlikely that this data
> would show up naturally.
>
> (4) I then read each record and make sure that it has at least 18
> characters (if not it is simply concatenated to cum_buffer - see
> below).
>
> I use the statement (adapted from Cooper Redwine's book)
>
> read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
> = num_chars) buffer
>
> you must have EOR or EOF or error on each read - otherwise the buffer
> is too small and the program has to be halted.
>
> I then check if the record number is showing up at the end which is
> the same as the one on the left. If yes, you have a complete record -
> if not - you have a spurious EOR and and simply concatenate the buffer
> to another buffer called cum_buffer.
>
> when cum_buffer looks like
>
> 2000000127stuff2000000127
>
> You have a facsimile of a row 127 from the database.
>
> You might still have to struggle with separating 'stuff' into fields -
> but thats a purely programming task having nothing to do with the file
> system or operating system or character encoding schemes.
>
> I hope others find this useful and suggestions for improvements would
> be good.

I do not remember your previous postings, but I am curious about these
end-of-records. Can you send me an example? (I want to look at CSV
files
more closely, as I recently was confronted with some of their nastier
aspects
in the context of my Flibs project - http://flibs.sf.net).

Regards,

Arjen

From: robert.corbett on 28 Jan 2010 03:28

On Jan 27, 3:51 pm, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
wrote:
> I posted on this topic before and this is my latest take on it:
>
> (1) In my case the messy files are csv extracts from a database (whose
> character encoding is Unicode - I don't know if it has anything to do
> with the problem).

It might.

> (2) I discovered that Fortran sees spurious EOR markers within
> character fields and I couldn't see a rhyme or reason why.

Are the characters in the file using 16-bit Unicode characters,
32-bit characters, or a multi-byte encoding such as UTF-8?
If they are 16-bit or 32-bit characters, then some of the
bytes that form a character could have the value of an
end-of-record character. If the file contains multi-byte
encodings, that is not the problem.

Bob Corbett

From: analyst41 on 28 Jan 2010 18:39

On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
> On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
> wrote:
>
>
>
>
>
> > I posted on this topic before and this is my latest take on it:
>
> > (1) In my case the messy files are csv extracts from a database (whose
> > character encoding is Unicode - I don't know if it has anything to do
> > with the problem).
>
> > (2) I discovered that Fortran sees spurious EOR markers within
> > character fields and I couldn't see a rhyme or reason why.
>
> > (3) But since I control the input - I inserted row numbers at the
> > beginning and end of each row extracted from the database and I added
> > 2000000000 to the row number make sure its unlikely that this data
> > would show up naturally.
>
> > (4) I then read each record and make sure that it has at least 18
> > characters (if not it is simply concatenated to cum_buffer - see
> > below).
>
> > I use the statement (adapted from Cooper Redwine's book)
>
> > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
> > = num_chars) buffer
>
> > you must have EOR or EOF or error on each read - otherwise the buffer
> > is too small and the program has to be halted.
>
> > I then check if the record number is showing up at the end which is
> > the same as the one on the left. If yes, you have a complete record -
> > if not - you have a spurious EOR and and simply concatenate the buffer
> > to another buffer called cum_buffer.
>
> > when cum_buffer looks like
>
> > 2000000127stuff2000000127
>
> > You have a facsimile of a row 127 from the database.
>
> > You might still have to struggle with separating 'stuff' into fields -
> > but thats a purely programming task having nothing to do with the file
> > system or operating system or character encoding schemes.
>
> > I hope others find this useful and suggestions for improvements would
> > be good.
>
> I do not remember your previous postings, but I am curious about these
> end-of-records. Can you send me an example? (I want to look at CSV
> files
> more closely, as I recently was confronted with some of their nastier
> aspects
> in the context of my Flibs project -http://flibs.sf.net).
>
> Regards,
>
> Arjen- Hide quoted text -
>
> - Show quoted text -

I'd love to given you actual files that show fake EORs - but it is
copyright/proprietary data and I din't have the time to clean it up
from that stand point.

But here are three cases( the occurrence of these strings causes
Fortran to see a fake EOR - LF95 running on windows):

<br />

</STRONG>

</B>

These seem to be terminators of HTML phrases - I don't know why
Fortran thinks these are EORs. Excel would trip up similarly as would
the language R - in fact, Fortran, R and Excel may see a different
number of rows in the same csv file.

From: analyst41 on 28 Jan 2010 18:54

On Jan 28, 3:28 am, robert.corb...(a)sun.com wrote:
> On Jan 27, 3:51 pm, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
> wrote:
>
> > I posted on this topic before and this is my latest take on it:
>
> > (1) In my case the messy files are csv extracts from a database (whose
> > character encoding is Unicode - I don't know if it has anything to do
> > with the problem).
>
> It might.
>
> > (2) I discovered that Fortran sees spurious EOR markers within
> > character fields and I couldn't see a rhyme or reason why.
>
> Are the characters in the file using 16-bit Unicode characters,
> 32-bit characters, or a multi-byte encoding such as UTF-8?
> If they are 16-bit or 32-bit characters, then some of the
> bytes that form a character could have the value of an
> end-of-record character. If the file contains multi-byte
> encodings, that is not the problem.
>
> Bob Corbett

I believe it is UTF-8 - please also see my reply to Arjen.

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: reading complex data using implied do loops
Next: FTP libraries