Splitting a sequence into pieces with identical elements [Python]

Prev: type enforcement in _ssl.sslwrap
Next: Access lotus notes using Python 2.5.1

From: candide on 10 Aug 2010 20:37

Suppose you have a sequence s , a string for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

What is the pythonic way to answer this question?

A naive solution would be the following :

# -------------------------------
z='spppammmmegggssss'

zz=[]
while z:
k=1
while z[:k]==k*z[0]:
k+=1
zz+=[z[:k-1]]
z=z[k-1:]

print zz
# -------------------------------

but I guess this code is not very idiomatic :(

From: Tim Chase on 10 Aug 2010 21:18

On 08/10/10 19:37, candide wrote:
> Suppose you have a sequence s , a string for say, for instance this one :
>
> spppammmmegggssss
>
> We want to split s into the following parts :
>
> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>
> ie each part is a single repeated character word.

While I'm not sure it's idiomatic, the overabuse of regexps in
Python certainly seems prevalent enough to be idiomatic ;-)

As such, you can use:

import re
r = re.compile(r'((.)\1*)')
#r = re.compile(r'((\w)\1*)')
s = 'spppammmmegggssss'
results = [m.group(0) for m in r.finditer(s)]

Additionally, you have all the properties of the match-object
(which includes the start/end) available too if you need).

You don't specify what you want to have happen with non-letters
(whitespace, punctuation, etc). The above just treats them like
any other character, finding repeats. If you just want "word"
characters, you can use the 2nd ("\w") version, or adjust
accordingly.

-tkc

From: Chris Rebert on 10 Aug 2010 21:11

On Tue, Aug 10, 2010 at 5:37 PM, candide <candide(a)free.invalid> wrote:
> Suppose you have a sequence s , a string Â for say, for instance this one :
>
> spppammmmegggssss
>
> We want to split s into the following parts :
>
> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>
> ie each part is a single repeated character word.
>
> What is the pythonic way to answer this question?

If you're doing an operation on an iterable, always leaf thru itertools first:
http://docs.python.org/library/itertools.html

from itertools import groupby
def split_into_runs(seq):
return ["".join(run) for letter, run in groupby(seq)]

If itertools didn't exist:

def split_into_runs(seq):
if not seq: return []

iterator = iter(seq)
letter = next(iterator)
count = 1
words = []
for c in iterator:
if c == letter:
count += 1
else:
word = letter * count
words.append(word)
letter = c
count = 1
words.append(letter*count)
return words

Cheers,
Chris
--
http://blog.rebertia.com

From: MRAB on 10 Aug 2010 21:30

Tim Chase wrote:
> On 08/10/10 19:37, candide wrote:
>> Suppose you have a sequence s , a string for say, for instance this
>> one :
>>
>> spppammmmegggssss
>>
>> We want to split s into the following parts :
>>
>> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>
>> ie each part is a single repeated character word.
>
> While I'm not sure it's idiomatic, the overabuse of regexps in Python
> certainly seems prevalent enough to be idiomatic ;-)
>
> As such, you can use:
>
> import re
> r = re.compile(r'((.)\1*)')
> #r = re.compile(r'((\w)\1*)')

That should be \2, not \1.

Alternatively:

r = re.compile(r'(.)\1*')
#r = re.compile(r'(\w)\1*')

> s = 'spppammmmegggssss'
> results = [m.group(0) for m in r.finditer(s)]
>
> Additionally, you have all the properties of the match-object (which
> includes the start/end) available too if you need).
>
> You don't specify what you want to have happen with non-letters
> (whitespace, punctuation, etc). The above just treats them like any
> other character, finding repeats. If you just want "word" characters,
> you can use the 2nd ("\w") version, or adjust accordingly.
>

From: Tim Chase on 10 Aug 2010 22:31

On 08/10/10 20:30, MRAB wrote:
> Tim Chase wrote:
>> r = re.compile(r'((.)\1*)')
>> #r = re.compile(r'((\w)\1*)')
>
> That should be \2, not \1.
>
> Alternatively:
>
> r = re.compile(r'(.)\1*')

Doh, I had played with both and mis-transcribed the combination
of them into one malfunctioning regexp. My original trouble with
the 2nd one was that r.findall() (not .finditer) was only
returning the first letter of each because that's what was
matched. Wrapping it in the extra set of parens and using "\2"
returned the actual data in sub-tuples:

>>> s = 'spppammmmegggssss'
>>> import re
>>> r = re.compile(r'(.)\1*')
>>> r.findall(s) # no repeated text, just the initial letter
['s', 'p', 'a', 'm', 'e', 'g', 's']
>>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>> r = re.compile(r'((.)\2*)')
>>> r.findall(s)
[('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'),
('ggg', 'g'), ('ssss', 's')]
>>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

By then changing to .finditer() it made them both work the way I
wanted.

Thanks for catching my mistranscription.

-tkc

|
Pages: 1
Prev: type enforcement in _ssl.sslwrap
Next: Access lotus notes using Python 2.5.1