From: Terence on 28 Oct 2009 05:19 On Oct 26, 8:23 am, Ron Shepard <ron-shep...(a)NOSPAM.comcast.net> wrote: > In article <PVYEm.49864$ze1.16...(a)news-server.bigpond.net.au>, > This all seems to be about parsing (ascii?) symbols off an input medium. I've worked a great length of time in this area and I've found that the best way is to read the input medium in "suitable" chunks in "binary" or "transparent" unformatted sequential mode, and use a "next byte" routuine to get the next byte or cause a new data "chunk" to be obtained. Any error signal is to be treated as a signal for special initial tretment of the last "chunk" to locate a credible EOF signal symbol, knowing that the last chucnk read in the same buffer is NOT fully overwritten and therefore will have the later characters apparently duplicated. The data captured with this method consists of eight-bit characters whose numeric values can be from 00 to 255. My preferred technique is pick up each 8-bit character into the lower byte of a pre-zeroed (once) 16-bit word and then determine in which section of the ascii table (#00-#1F, #20-#7F, and #80-#FF) the symbol falls as the first step in parsing for sense or an EOR/EOF symbol. This reduces the size of the action tables indexed by the numeric value of the symbol by dividing the problem up into control, ascii text and accented text. This allows parsing any language that can use 8-bit characters; and can be extended to 16-bit character sets. I've used this for very many left-to-right languages including romaji and greek. |