Byte Offsets of Tokens, Ngrams and Sentences? [Python]

Prev: How to read large amounts of output via popen
Next: Python Portability

From: Muhammad Adeel on 6 Aug 2010 05:07

Hi,

Does any one know how to tokenize a string in python that returns the
byte offsets and tokens? Moreover, the sentence splitter that returns
the sentences and byte offsets? Finally n-grams returned with byte
offsets.

Input:
This is a string.

Output:
This 0
is 5
a 8
string. 10

thanks

From: Gabriel Genellina on 6 Aug 2010 05:49

En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabadeel(a)gmail.com>
escribi�:

> Does any one know how to tokenize a string in python that returns the
> byte offsets and tokens? Moreover, the sentence splitter that returns
> the sentences and byte offsets? Finally n-grams returned with byte
> offsets.
>
> Input:
> This is a string.
>
> Output:
> This 0
> is 5
> a 8
> string. 10

Like this?

py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
.... print g.group(), g.start()
....
This 0
is 5
a 8
string. 10

--
Gabriel Genellina

From: Muhammad Adeel on 6 Aug 2010 06:06

On Aug 6, 10:49 am, "Gabriel Genellina" <gagsl-...(a)yahoo.com.ar>
wrote:
> En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabad...(a)gmail.com>
> escribió:
>
> > Does any one know how to tokenize a string in python that returns the
> > byte offsets and tokens? Moreover, the sentence splitter that returns
> > the sentences and byte offsets? Finally n-grams returned with byte
> > offsets.
>
> > Input:
> > This is a string.
>
> > Output:
> > This 0
> > is 5
> > a 8
> > string. 10
>
> Like this?
>
> py> import re
> py> s = "This is a string."
> py> for g in re.finditer("\S+", s):
> ... print g.group(), g.start()
> ...
> This 0
> is 5
> a 8
> string. 10
>
> --
> Gabriel Genellina

Hi,

Thanks. Can you please tell me how to do for n-grams and sentences as
well?

|
Pages: 1
Prev: How to read large amounts of output via popen
Next: Python Portability