From: Tom Serface on 23 Jan 2010 02:46 Yes, after some time I have a parser that I like, but it has a lot of hand coding in it. I agree that it is a matter of taste how the strings are formed, but unfortunately, I don't have a lot of control over the input to out program sometimes. I'm not a big fan of the \ escape thing in CSV files since that seems odd to uninitiated users. Not having the separator should be considered a syntax error though. That much seems fair. We've mostly gone to XML for input and output these days and that's solved a lot of issues, but raised a whole lot of other ones of course. Tom "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message news:ef#g5K6mKHA.5692(a)TK2MSFTNGP04.phx.gbl... > Tom Serface wrote: > >> One thing most parsers don't handle correctly, that's I've seen, is >> double double quotes for strings if you want to have a quote as part of >> the string like: >> >> "This is my string "Tom" that I am using", "Next token", "Next token" >> >> In the above, from my perspective, the parser should read the entire >> first string since we didn't come to a delimiter yet, but a lot of >> tokenizers choke on this sort of thing. > > > Often, it takes two to tango. A writer needs to escape tokens in order to > reach some level of sanity. i.e, borrowing a C slash for \". > > "This is my string \"Tom\" that I am using" > > Or use some encoding method, each HTTP Escape! :) > > The above is simple if just delimiting by comma. So watching for an > embedded comma is required. For example: > > "This is my string "Tom, Hector" that I am using" > > That can be easily handled if the design assumption is each field is > double quoted. The first token: > > "This is my string "Tom, > > does not end in double quote, so you continue with a concatenation of the > next token. > > Hector" that I am using" > > to complete the first field. > > But overall, I found unless its really simple, it helps if you have field > type definitions known before hand. > > > -- > HLS
From: Tom Serface on 23 Jan 2010 02:48 Yes, that's become particularly important to me in recent years since I've had to work with files from other platforms (like Mac or other Unix based systems). I guess that why we get to keep working. So many things to consider. Tom "David Wilkinson" <no-reply(a)effisols.com> wrote in message news:#lNeGP8mKHA.5552(a)TK2MSFTNGP05.phx.gbl... > Tom Serface wrote: >> One thing most parsers don't handle correctly, that's I've seen, is >> double double quotes for strings if you want to have a quote as part of >> the string like: >> >> "This is my string "Tom" that I am using", "Next token", "Next token" >> >> In the above, from my perspective, the parser should read the entire >> first string since we didn't come to a delimiter yet, but a lot of >> tokenizers choke on this sort of thing. > > Another thing is tolerating files that have \n or \r line endings rather > than \r\n. > > -- > David Wilkinson > Visual C++ MVP
From: Tom Serface on 23 Jan 2010 02:50 Well, you could have used Xerces and spent 8 days getting it to work instead :o) Tom "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message news:99vkl5p0cdc2ngvsqpdn0h9rhr5sn8fnal(a)4ax.com... > I can generally write an FSM parser in an hour or so, depending on the > syntax. I wrote an > XML parser, recursive descent, in eight hours, start to finish. The > constraints were > strange, and involved "no public source code, ever", which I thought was > foolish, but they > were paying. I did tell them there were a number of cheats, such as it > did not handle all > possible encodings of XML files, a constraint they found acceptable. > joe
From: Tom Serface on 23 Jan 2010 02:58 I think Joe is saying it is meaningless these days because there is no carriage to return any longer. I think most of us consider \n synonymous with Enter and that implies the start of a new line. A lot of this is carry over from the days of teletype and paper terminals and we're just stuck with it as part of ASCII. Tom "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message news:uqDAH$$mKHA.1548(a)TK2MSFTNGP04.phx.gbl... > > Joseph M. Newcomer wrote: > >> One of the rules we developed about forty years ago (1968) is that \r is >> meaningless noise >> treated as whitespace, and \n is a newline. This works until you import >> a text file >> creating on a pre-OS X Mac, where \r is the newline character. >> joe > > > Don't confuse raw vs cooked vs display/print device vs storage systems! > > \r\n has their basis as hardware device codes for the harder devices of > the day; printers, teletypes, dumb terminals, etc > > \r <CR> is what it is - a carriage return (move it to the first column) of > the printer head! Note the operative word - Carriage! > > \n <LF> is what it is - a line feed (move carriage head down one line) of > the printer head! > > When the consoles came, the printer head was now your cursor. That is why > it is paired whether there are from translations or not. > > Now, your Terminal and Printer could have OPTIONAL translation for an > automatic line feed (/n) with each carriage return (/r) which means it > APPEAR as it was a line delimiter as in in the unix wienie world. In the > MAC word, a /n is the line delimiter. DOS of courses uses /r/n (<CR><LF>) > pairs. > > But it is your terminal or printer providing the illusion with > translations which may be default depending on the OS it connected to). > So if you dumped a unix file or mac file to a printer, it did the proper > translation for you. The printer or carriage or laser point did not > change, you still need to tell it to go left, right, up or down! > > Geez, Meaningless? > > This again is a example of insane revisionist comments. > > -- > HLS
From: Hector Santos on 23 Jan 2010 03:44
Not so Tom. It is all the still the same! Trust me! Its what we do! This is my business. (http://www.santronics.com) It is what we do as one of the early pioneers in the telecommunications market. It is all still the same. It a natural part of our framework and everyone else in the same market. It is a fundamental understanding in this market. If you don't follow it, you will not be compatibility with the rest of the world. Our software covers every aspect of the communications market, from mail readers, telecommunication programs, mail/file distribution and hosting, dialup vs internet, name it. Your mail post here is guaranteed to be read by some users in the world with one of our mail reading devices. Your mail is guaranteed to be stored and forwarded (gated) to servers using our product, and honestly, if you recently saw a doctor and a health claim was filed on your behalf, the chances are really good our software was somewhere in the network loop in getting that claim collected, processed and the doctor paid! When you hit ENTER, depending on the device and the OS, it will do the translation for you. If you going to display a text file on the screen or send it to a printer, the device is doing the translation for you or not. Storage is different because the OS may use 1 EOL (END OF LINE) character or two. Sure, one can say that is a "WASTE" but you also have to think of the consequences in overall global portability and interfacing with other software and hardware devices. Ultimately, regardless of how it is stored, a translation needs to take place if you are going to display or print it correctly. If that was not the case, then I am sure Tom you have seen times where a printout was all one black line or jagged across a page. Now, internet based mail protocols, it uses CRLF for many historical reasons. When a MAC or UNIX mail software sends email or news it must implement translations otherwise it is broken. Same with FTP, a well designed server and client needs to take this into account. Same with the HTTP protocol - the CRLF is the standard. So that means that if you are in the MAC/UNIX world, the interface software MUST do translations. For some parts of a user software, like a mail reader, most good ones needs to be DOS/UNIX/MAC ready in reading a text file and these software generally have sound/solid logic for reading such files. This is an example where as Joe indicated, a "/n" may be read as a NEWLINE (EOL is my preferred terminology) but only if there is no /r that proceeds it. It is not old, it still here, it fundamental in telecommunications and no way we can't live without it. But the software and devices today are so highly engineered to deal with all situations, it is all transparent to users. :) -- Tom Serface wrote: > I think Joe is saying it is meaningless these days because there is no > carriage to return any longer. I think most of us consider \n > synonymous with Enter and that implies the start of a new line. A lot > of this is carry over from the days of teletype and paper terminals and > we're just stuck with it as part of ASCII. > > Tom > > "Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message > news:uqDAH$$mKHA.1548(a)TK2MSFTNGP04.phx.gbl... >> >> Joseph M. Newcomer wrote: >> >>> One of the rules we developed about forty years ago (1968) is that \r >>> is meaningless noise >>> treated as whitespace, and \n is a newline. This works until you >>> import a text file >>> creating on a pre-OS X Mac, where \r is the newline character. >>> joe >> >> >> Don't confuse raw vs cooked vs display/print device vs storage systems! >> >> \r\n has their basis as hardware device codes for the harder devices >> of the day; printers, teletypes, dumb terminals, etc >> >> \r <CR> is what it is - a carriage return (move it to the first >> column) of the printer head! Note the operatie word - Carriage! >> >> \n <LF> is what it is - a line feed (move carriage head down one line) >> of the printer head! >> >> When the consoles came, the printer head was now your cursor. That is >> why it is paired whether there are from translations or not. >> >> Now, your Terminal and Printer could have OPTIONAL translation for an >> automatic line feed (/n) with each carriage return (/r) which means it >> APPEAR as it was a line delimiter as in in the unix wienie world. In >> the MAC word, a /n is the line delimiter. DOS of courses uses /r/n >> (<CR><LF>) pairs. >> >> But it is your terminal or printer providing the illusion with >> translations which may be default depending on the OS it connected >> to). So if you dumped a unix file or mac file to a printer, it did the >> proper translation for you. The printer or carriage or laser point >> did not change, you still need to tell it to go left, right, up or down! >> >> Geez, Meaningless? >> >> This again is a example of insane revisionist comments. >> >> -- >> HLS > -- HLS |