From: Steven D'Aprano on 7 Apr 2010 22:51 On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote: > BTW, although I find it annoying when people say "don't do that" when > "that" is a perfectly good thing to do, and although I also find it > annoying when people tell you what not to do without telling you what > *to* do, Grant did give a perfectly good solution. > and although I find the regex solution to this problem to be > quite clean, the equivalent non-regex solution is not terrible, so I > will present it as well, for your viewing pleasure: > > >>> [x for x in '# 1 Short offline Completed without error > 00%'.split(' ') if x.strip()] > ['# 1', 'Short offline', ' Completed without error', ' 00%'] This is one of the reasons we're so often suspicious of re solutions: >>> s = '# 1 Short offline Completed without error 00%' >>> tre = Timer("re.split(' {2,}', s)", .... "import re; from __main__ import s") >>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]", .... "from __main__ import s") >>> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] True >>> >>> >>> min(tre.repeat(repeat=5)) 6.1224789619445801 >>> min(tsplit.repeat(repeat=5)) 1.8338048458099365 Even when they are correct and not unreadable line-noise, regexes tend to be slow. And they get worse as the size of the input increases: >>> s *= 1000 >>> min(tre.repeat(repeat=5, number=1000)) 2.3496899604797363 >>> min(tsplit.repeat(repeat=5, number=1000)) 0.41538596153259277 >>> >>> s *= 10 >>> min(tre.repeat(repeat=5, number=1000)) 23.739185094833374 >>> min(tsplit.repeat(repeat=5, number=1000)) 4.6444299221038818 And this isn't even one of the pathological O(N**2) or O(2**N) regexes. Don't get me wrong -- regexes are a useful tool. But if your first instinct is to write a regex, you're doing it wrong. [quote] A related problem is Perl's over-reliance on regular expressions that is exaggerated by advocating regex-based solution in almost all O'Reilly books. The latter until recently were the most authoritative source of published information about Perl. While simple regular expression is a beautiful thing and can simplify operations with string considerably, overcomplexity in regular expressions is extremly dangerous: it cannot serve a basis for serious, professional programming, it is fraught with pitfalls, a big semantic mess as a result of outgrowing its primary purpose. Diagnostic for errors in regular expressions is even weaker then for the language itself and here many things are just go unnoticed. [end quote] http://www.softpanorama.org/Scripting/Perlbook/Ch01/ place_of_perl_among_other_lang.shtml Even Larry Wall has criticised Perl's regex culture: http://dev.perl.org/perl6/doc/design/apo/A05.html -- Steven
From: J on 7 Apr 2010 23:01 On Wed, Apr 7, 2010 at 22:45, Patrick Maupin <pmaupin(a)gmail.com> wrote: > When I saw "And I am interested in the string that appears in the > third column, which changes as the test runs and then completes" I > assumed that, not only could that string change, but so could the one > before it. > > I guess my base assumption that anything with words in it could > change. I was looking at the OP's attempt at a solution, and he > obviously felt he needed to see two or more spaces as an item > delimiter. I apologize for the confusion, Pat... I could have worded that better, but at that point I was A: Frustrated, B: starving, and C: had my wife nagging me to stop working to come get something to eat ;-) What I meant was, in that output string, the phrase in the middle could change in length... After looking at the source code for smartctl (part of the smartmontools package for you linux people) I found the switch that creates those status messages.... they vary in character length, some with non-text characters like ( and ) and /, and have either 3 or 4 words... The spaces between each column, instead of being a fixed number of spaces each, were seemingly arbitrarily created... there may be 4 spaces between two columns or there may be 9, or 7 or who knows what, and since they were all being treated as individual spaces instead of tabs or something, I was having trouble splitting the output into something that was easy to parse (at least in my mind it seemed that way). Anyway, that's that... and I do apologize if my original post was confusing at all... Cheers Jeff
From: Patrick Maupin on 7 Apr 2010 23:04 On Apr 7, 9:51 pm, Steven D'Aprano <ste...(a)REMOVE.THIS.cybersource.com.au> wrote: > On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote: > > BTW, although I find it annoying when people say "don't do that" when > > "that" is a perfectly good thing to do, and although I also find it > > annoying when people tell you what not to do without telling you what > > *to* do, > > Grant did give a perfectly good solution. Yeah, I noticed later and apologized for that. What he gave will work perfectly if the only data that changes the number of words is the data the OP is looking for. This may or may not be true. I don't know anything about the program generating the data, but I did notice that the OP's attempt at an answer indicated that the OP felt (rightly or wrongly) he needed to split on two or more spaces. > > > and although I find the regex solution to this problem to be > > quite clean, the equivalent non-regex solution is not terrible, so I > > will present it as well, for your viewing pleasure: > > > >>> [x for x in '# 1 Short offline Completed without error > > 00%'.split(' ') if x.strip()] > > ['# 1', 'Short offline', ' Completed without error', ' 00%'] > > This is one of the reasons we're so often suspicious of re solutions: > > >>> s = '# 1 Short offline Completed without error 00%' > >>> tre = Timer("re.split(' {2,}', s)", > > ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]", > > ... "from __main__ import s") > > >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] > True > > >>> min(tre.repeat(repeat=5)) > 6.1224789619445801 > >>> min(tsplit.repeat(repeat=5)) > > 1.8338048458099365 > > Even when they are correct and not unreadable line-noise, regexes tend to > be slow. And they get worse as the size of the input increases: > > >>> s *= 1000 > >>> min(tre.repeat(repeat=5, number=1000)) > 2.3496899604797363 > >>> min(tsplit.repeat(repeat=5, number=1000)) > 0.41538596153259277 > > >>> s *= 10 > >>> min(tre.repeat(repeat=5, number=1000)) > 23.739185094833374 > >>> min(tsplit.repeat(repeat=5, number=1000)) > > 4.6444299221038818 > > And this isn't even one of the pathological O(N**2) or O(2**N) regexes. > > Don't get me wrong -- regexes are a useful tool. But if your first > instinct is to write a regex, you're doing it wrong. > > [quote] > A related problem is Perl's over-reliance on regular expressions > that is exaggerated by advocating regex-based solution in almost > all O'Reilly books. The latter until recently were the most > authoritative source of published information about Perl. > > While simple regular expression is a beautiful thing and can > simplify operations with string considerably, overcomplexity in > regular expressions is extremly dangerous: it cannot serve a basis > for serious, professional programming, it is fraught with pitfalls, > a big semantic mess as a result of outgrowing its primary purpose.. > Diagnostic for errors in regular expressions is even weaker then > for the language itself and here many things are just go unnoticed. > [end quote] > > http://www.softpanorama.org/Scripting/Perlbook/Ch01/ > place_of_perl_among_other_lang.shtml > > Even Larry Wall has criticised Perl's regex culture: > > http://dev.perl.org/perl6/doc/design/apo/A05.html Bravo!!! Good data, quotes, references, all good stuff! I absolutely agree that regex shouldn't always be the first thing you reach for, but I was reading way too much unsubstantiated "this is bad. Don't do it." on the subject recently. In particular, when people say "Don't use regex. Use PyParsing!" It may be good advice in the right context, but it's a bit disingenuous not to mention that PyParsing will use regex under the covers... Regards, Pat
From: Grant Edwards on 7 Apr 2010 23:10 On 2010-04-08, Patrick Maupin <pmaupin(a)gmail.com> wrote: > Sorry, my eyes completely missed your one-liner, so my criticism about > not posting a solution was unwarranted. I don't think you and I read > the problem the same way (which is probably why I didn't notice your > solution -- because it wasn't solving the problem I thought I saw). No worries. > When I saw "And I am interested in the string that appears in the > third column, which changes as the test runs and then completes" I > assumed that, not only could that string change, but so could the one > before it. If that's the case, my solution won't work right. > I guess my base assumption that anything with words in it could > change. I was looking at the OP's attempt at a solution, and he > obviously felt he needed to see two or more spaces as an item > delimiter. If the requirement is indeed two or more spaces as a delimiter with spaces allowed in any field, then a regular expression split is probably the best solution. -- Grant
From: Patrick Maupin on 7 Apr 2010 23:26 On Apr 7, 9:51 pm, Steven D'Aprano <ste...(a)REMOVE.THIS.cybersource.com.au> wrote: > This is one of the reasons we're so often suspicious of re solutions: > > >>> s = '# 1 Short offline Completed without error 00%' > >>> tre = Timer("re.split(' {2,}', s)", > > ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]", > > ... "from __main__ import s") > > >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] > True > > >>> min(tre.repeat(repeat=5)) > 6.1224789619445801 > >>> min(tsplit.repeat(repeat=5)) > > 1.8338048458099365 I will confess that, in my zeal to defend re, I gave a simple one- liner, rather than the more optimized version: >>> from timeit import Timer >>> s = '# 1 Short offline Completed without error 00%' >>> tre = Timer("splitter(s)", .... "import re; from __main__ import s; splitter = re.compile(' {2,}').split") >>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]", .... "from __main__ import s") >>> min(tre.repeat(repeat=5)) 1.893190860748291 >>> min(tsplit.repeat(repeat=5)) 2.0661051273345947 You're right that if you have an 800K byte string, re doesn't perform as well as split, but the delta is only a few percent. >>> s *= 10000 >>> min(tre.repeat(repeat=5, number=1000)) 15.331652164459229 >>> min(tsplit.repeat(repeat=5, number=1000)) 14.596404075622559 Regards, Pat
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: remote multiprocessing, shared object Next: ftp and python |