From: Arjen Markus on 29 Jan 2010 10:49 On 29 jan, 15:33, nos...(a)see.signature (Richard Maine) wrote: > Arjen Markus <arjen.markus...(a)gmail.com> wrote: > > But a / not enclosed in ' or " in the input for a list-directed read is > > defined to stop the input! That may be the cause for the Fortran program > > to indicate an end-of-record. > > No, that would not constitute and end-of-record to Fortran. In fact, you > can't get an end-of-record with list-directed input at all. Might be a > good guess for user confusions with sloppy terminology, but not for a > literal end-of-record. > > I'm not going to try to speculate about the problem from the data given. > > -- > Richard Maine | Good judgment comes from experience; > email: last name at domain . net | experience comes from bad judgment. > domain: summertriangle | -- Mark Twain Oops, you are right. It was the slash and a discussion about using list-directed reads to read incomplete CSV records plus the unusual appearance of HTML tags that must have put me off guard. It remains a mystery then - perhaps it is a UTF-8 character sequence that does this ... Regards, Arjen
From: analyst41 on 29 Jan 2010 21:15 On Jan 29, 9:44 am, dpb <n...(a)non.net> wrote: > analys...(a)hotmail.com wrote: > > On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote: > >> On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com> > >> wrote: > > >>> I posted on this topic before and this is my latest take on it: > >>> (1) In my case the messy files are csv extracts from a database (whose > >>> character encoding is Unicode - I don't know if it has anything to do > >>> with the problem). > >>> (2) I discovered that Fortran sees spurious EOR markers within > >>> character fields and I couldn't see a rhyme or reason why. > >>> (3) But since I control the input - I inserted row numbers at the > >>> beginning and end of each row extracted from the database and I added > >>> 2000000000 to the row number make sure its unlikely that this data > >>> would show up naturally. > >>> (4) I then read each record and make sure that it has at least 18 > >>> characters (if not it is simply concatenated to cum_buffer - see > >>> below). > >>> I use the statement (adapted from Cooper Redwine's book) > >>> read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size > >>> = num_chars) buffer > >>> you must have EOR or EOF or error on each read - otherwise the buffer > >>> is too small and the program has to be halted. > >>> I then check if the record number is showing up at the end which is > >>> the same as the one on the left. If yes, you have a complete record - > >>> if not - you have a spurious EOR and and simply concatenate the buffer > >>> to another buffer called cum_buffer. > >>> when cum_buffer looks like > >>> 2000000127stuff2000000127 > >>> You have a facsimile of a row 127 from the database. > >>> You might still have to struggle with separating 'stuff' into fields - > >>> but thats a purely programming task having nothing to do with the file > >>> system or operating system or character encoding schemes. > >>> I hope others find this useful and suggestions for improvements would > >>> be good. > >> I do not remember your previous postings, but I am curious about these > >> end-of-records. Can you send me an example? (I want to look at CSV > >> files > >> more closely, as I recently was confronted with some of their nastier > >> aspects > >> in the context of my Flibs project -http://flibs.sf.net). > > >> Regards, > > >> Arjen- Hide quoted text - > > >> - Show quoted text - > > > I'd love to given you actual files that show fake EORs - but it is > > copyright/proprietary data and I din't have the time to clean it up > > from that stand point. > > > But here are three cases( the occurrence of these strings causes > > Fortran to see a fake EOR - LF95 running on windows): > > > <br /> > > > </STRONG> > > > </B> > > > These seem to be terminators of HTML phrases - I don't know why > > Fortran thinks these are EORs. Excel would trip up similarly as would > > the language R - in fact, Fortran, R and Excel may see a different > > number of rows in the same csv file. > > Can you post a short section of the file surrounding the offending > characters as seen by a hex dump program so can see what's actually in > the data stream? > > Do these strings fail when read on their own in any length record or > only in the generated output file from the database? > > If you can make it fail repeatedly it should be quite simple to at least > figure out what is the root cause and whether that is a data problem or > a bug in the particular compiler i/o library. > > Which raises a point of what happens w/ another compiler? > > --- Hide quoted text - > > - Show quoted text - I can tell you that its not a Fortran issue. Notepad, Excel and the R language are unable to split the file up into records so that the records correspond to rows in the database. I actually don;t know the Windows/DOS command to produce a HEX dump - if someone knows it - please post it. I have reduced the problem row=set to a few rows - it should be possible to post the entire data here as a HEX dump.
From: Dr Ivan D. Reid on 30 Jan 2010 07:15 On Fri, 29 Jan 2010 18:15:06 -0800 (PST), analyst41(a)hotmail.com <analyst41(a)hotmail.com> wrote in <a31f5cab-b0b1-4cdf-ab66-ed1432409861(a)g39g2000vba.googlegroups.com>: > I actually don;t know the Windows/DOS command to produce a HEX dump - > if someone knows it - please post it. I have reduced the problem > row=set to a few rows - it should be possible to post the entire data > here as a HEX dump. Use debug in a DOS window. Example (on the first short file I saw): C:\cygwin\home\Compaq_Owner>cat undupe #! /bin/bash export DUPE=$1 export WIN_NT='$WIN_NT' find $DUPE -type f -ls | gawk -f finddupe.awk C:\cygwin\home\Compaq_Owner>debug undupe -d 1554:0100 23 21 20 2F 62 69 6E 2F-62 61 73 68 0A 65 78 70 #! /bin/bash.exp 1554:0110 6F 72 74 20 44 55 50 45-3D 24 31 0A 65 78 70 6F ort DUPE=$1.expo 1554:0120 72 74 20 57 49 4E 5F 4E-54 3D 27 24 57 49 4E 5F rt WIN_NT='$WIN_ 1554:0130 4E 54 27 0A 66 69 6E 64-20 24 44 55 50 45 20 2D NT'.find $DUPE - 1554:0140 74 79 70 65 20 66 20 2D-6C 73 20 7C 20 67 61 77 type f -ls | gaw 1554:0150 6B 20 2D 66 20 66 69 6E-64 64 75 70 65 2E 61 77 k -f finddupe.aw 1554:0160 6B 0A 0A 00 00 00 00 00-00 00 00 00 00 00 00 00 k............... 1554:0170 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................ - -- Ivan Reid, School of Engineering & Design, _____________ CMS Collaboration, Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN KotPT -- "for stupidity above and beyond the call of duty".
From: glen herrmannsfeldt on 30 Jan 2010 13:46 analyst41(a)hotmail.com <analyst41(a)hotmail.com> wrote: (big snip) > I actually don;t know the Windows/DOS command to produce a HEX dump - > if someone knows it - please post it. I have reduced the problem > row=set to a few rows - it should be possible to post the entire data > here as a HEX dump. The dos DEBUG command is still available, but only for files that it can fit into memory. A better choice is the port of the GNU file utilities, including the od command (with the -x option) or xd. I believe if you search for UNXUTILS at sourceforge you can find them. That includes some very useful utilities such as grep and diff. -- glen
From: robin on 31 Jan 2010 06:35
"Arjen Markus" <arjen.markus895(a)gmail.com> wrote in message news:89ef5ea7-4e37-4232-bf9c-3e4c446777ee(a)g1g2000yqi.googlegroups.com... On 29 jan, 00:39, "analys...(a)hotmail.com" <analys...(a)hotmail.com> wrote: > On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote: >> > > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size >> > > = num_chars) buffer >But a / not enclosed in ' or " in the input for a list-directed read >is defined >to stop the input! That may be the cause for the Fortran program to >indicate an end-of-record. No, because he is using formatted READ (see his READ statement). |