From: candide on 10 Aug 2010 20:37 Suppose you have a sequence s , a string for say, for instance this one : spppammmmegggssss We want to split s into the following parts : ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss'] ie each part is a single repeated character word. What is the pythonic way to answer this question? A naive solution would be the following : # ------------------------------- z='spppammmmegggssss' zz=[] while z: k=1 while z[:k]==k*z[0]: k+=1 zz+=[z[:k-1]] z=z[k-1:] print zz # ------------------------------- but I guess this code is not very idiomatic :(
From: Tim Chase on 10 Aug 2010 21:18 On 08/10/10 19:37, candide wrote: > Suppose you have a sequence s , a string for say, for instance this one : > > spppammmmegggssss > > We want to split s into the following parts : > > ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss'] > > ie each part is a single repeated character word. While I'm not sure it's idiomatic, the overabuse of regexps in Python certainly seems prevalent enough to be idiomatic ;-) As such, you can use: import re r = re.compile(r'((.)\1*)') #r = re.compile(r'((\w)\1*)') s = 'spppammmmegggssss' results = [m.group(0) for m in r.finditer(s)] Additionally, you have all the properties of the match-object (which includes the start/end) available too if you need). You don't specify what you want to have happen with non-letters (whitespace, punctuation, etc). The above just treats them like any other character, finding repeats. If you just want "word" characters, you can use the 2nd ("\w") version, or adjust accordingly. -tkc
From: Chris Rebert on 10 Aug 2010 21:11 On Tue, Aug 10, 2010 at 5:37 PM, candide <candide(a)free.invalid> wrote: > Suppose you have a sequence s , a string  for say, for instance this one : > > spppammmmegggssss > > We want to split s into the following parts : > > ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss'] > > ie each part is a single repeated character word. > > What is the pythonic way to answer this question? If you're doing an operation on an iterable, always leaf thru itertools first: http://docs.python.org/library/itertools.html from itertools import groupby def split_into_runs(seq): return ["".join(run) for letter, run in groupby(seq)] If itertools didn't exist: def split_into_runs(seq): if not seq: return [] iterator = iter(seq) letter = next(iterator) count = 1 words = [] for c in iterator: if c == letter: count += 1 else: word = letter * count words.append(word) letter = c count = 1 words.append(letter*count) return words Cheers, Chris -- http://blog.rebertia.com
From: MRAB on 10 Aug 2010 21:30 Tim Chase wrote: > On 08/10/10 19:37, candide wrote: >> Suppose you have a sequence s , a string for say, for instance this >> one : >> >> spppammmmegggssss >> >> We want to split s into the following parts : >> >> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss'] >> >> ie each part is a single repeated character word. > > While I'm not sure it's idiomatic, the overabuse of regexps in Python > certainly seems prevalent enough to be idiomatic ;-) > > As such, you can use: > > import re > r = re.compile(r'((.)\1*)') > #r = re.compile(r'((\w)\1*)') That should be \2, not \1. Alternatively: r = re.compile(r'(.)\1*') #r = re.compile(r'(\w)\1*') > s = 'spppammmmegggssss' > results = [m.group(0) for m in r.finditer(s)] > > Additionally, you have all the properties of the match-object (which > includes the start/end) available too if you need). > > You don't specify what you want to have happen with non-letters > (whitespace, punctuation, etc). The above just treats them like any > other character, finding repeats. If you just want "word" characters, > you can use the 2nd ("\w") version, or adjust accordingly. >
From: Tim Chase on 10 Aug 2010 22:31 On 08/10/10 20:30, MRAB wrote: > Tim Chase wrote: >> r = re.compile(r'((.)\1*)') >> #r = re.compile(r'((\w)\1*)') > > That should be \2, not \1. > > Alternatively: > > r = re.compile(r'(.)\1*') Doh, I had played with both and mis-transcribed the combination of them into one malfunctioning regexp. My original trouble with the 2nd one was that r.findall() (not .finditer) was only returning the first letter of each because that's what was matched. Wrapping it in the extra set of parens and using "\2" returned the actual data in sub-tuples: >>> s = 'spppammmmegggssss' >>> import re >>> r = re.compile(r'(.)\1*') >>> r.findall(s) # no repeated text, just the initial letter ['s', 'p', 'a', 'm', 'e', 'g', 's'] >>> [m.group(0) for m in r.finditer(s)] ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss'] >>> r = re.compile(r'((.)\2*)') >>> r.findall(s) [('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'), ('ggg', 'g'), ('ssss', 's')] >>> [m.group(0) for m in r.finditer(s)] ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss'] By then changing to .finditer() it made them both work the way I wanted. Thanks for catching my mistranscription. -tkc
|
Pages: 1 Prev: type enforcement in _ssl.sslwrap Next: Access lotus notes using Python 2.5.1 |