From: Walter Roberson on
syam kumar medandrao wrote:

> Can u please explain how to use regexp to divide a text into words..i
> tried textscan also but that is alright,

It depends upon what you mean by "words" in this context. For example do you
need to separate out punctuation marks that might be immediately beside a
word, or do you need to keep them? If you need to separate them out, then what
about the cases where the marks are _part_ of the word, such as "etc." or
"seas'" or "o'clock" ? What about the syntactical blunder in the Chicago
Manual of Style where you cannot determine with certainty whether a period
before a closing quote mark indicates an abbreviation or the ending of the
quoted sentence because CMS requires that both be rendered the same way?

> i wanted to know how
> efficiently this can work.

Straight-forward non-contextual splitting is O(N) -- that is, can be done in
time proportional to the length of the string. When you start getting into
some of the more-complex cases, then it can be O(e^N), an exponential-time
process (but would more likely be O(N^2) or so). If you start dealing with
nested quotations with context, then you can no longer use Deterministic
Finite Automata (DFA) techniques, and unfortunately I do not presently recall
the efficiency analysis of techniques such as PDA (Push-Down Automata.)