Prev: Geometric hashing
Next: Want To Buy SOlution Manual Or Test Bank? We Do Have The Largest Collection Over The Net
From: Walter Roberson on 10 Aug 2010 19:00 syam kumar medandrao wrote: > Can u please explain how to use regexp to divide a text into words..i > tried textscan also but that is alright, It depends upon what you mean by "words" in this context. For example do you need to separate out punctuation marks that might be immediately beside a word, or do you need to keep them? If you need to separate them out, then what about the cases where the marks are _part_ of the word, such as "etc." or "seas'" or "o'clock" ? What about the syntactical blunder in the Chicago Manual of Style where you cannot determine with certainty whether a period before a closing quote mark indicates an abbreviation or the ending of the quoted sentence because CMS requires that both be rendered the same way? > i wanted to know how > efficiently this can work. Straight-forward non-contextual splitting is O(N) -- that is, can be done in time proportional to the length of the string. When you start getting into some of the more-complex cases, then it can be O(e^N), an exponential-time process (but would more likely be O(N^2) or so). If you start dealing with nested quotations with context, then you can no longer use Deterministic Finite Automata (DFA) techniques, and unfortunately I do not presently recall the efficiency analysis of techniques such as PDA (Push-Down Automata.) |