From: Patrick Maupin on 8 Apr 2010 00:57 On Apr 7, 9:51 pm, Steven D'Aprano <ste...(a)REMOVE.THIS.cybersource.com.au> wrote: BTW, I don't know how you got 'True' here. > >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] > True You must not have s set up to be the string given by the OP. I just realized there was an error in my non-regexp example, that actually manifests itself with the test data: >>> import re >>> s = '# 1 Short offline Completed without error 00%' >>> re.split(' {2,}', s) ['# 1', 'Short offline', 'Completed without error', '00%'] >>> [x for x in s.split(' ') if x.strip()] ['# 1', 'Short offline', ' Completed without error', ' 00%'] >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] False To fix it requires something like: [x.strip() for x in s.split(' ') if x.strip()] or: [x for x in [x.strip() for x in s.split(' ')] if x] I haven't timed either one of these, but given that the broken original one was slower than the simpler: splitter = re.compile(' {2,}').split splitter(s) on strings of "normal" length, and given that nobody noticed this bug right away (even though it was in the printout on my first message, heh), I think that this shows that (here, let me qualify this carefully), at least in some cases, the first regexp that comes to my mind can be prettier, shorter, faster, less bug-prone, etc. than the first non-regexp that comes to my mind... Regards, Pat
From: Kushal Kumaran on 8 Apr 2010 01:16 On Thu, Apr 8, 2010 at 3:10 AM, J <dreadpiratejeff(a)gmail.com> wrote: > Can someone make me un-crazy? > > I have a bit of code that right now, looks like this: > > status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6] >     status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status) >     print status > > Basically, it pulls the first actual line of data from the return you > get when you use smartctl to look at a hard disk's selftest log. > > The raw data looks like this: > > # 1  Short offline    Completed without error    00%    679     - > > Unfortunately, all that whitespace is arbitrary single space > characters.  And I am interested in the string that appears in the > third column, which changes as the test runs and then completes.  So > in the example, "Completed without error" > > The regex I have up there doesn't quite work, as it seems to be > subbing EVERY space (or at least in instances of more than one space) > to a ':' like this: > > # 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: - > > Ultimately, what I'm trying to do is either replace any space that is >> one space wiht a delimiter, then split the result into a list and > get the third item. > > OR, if there's a smarter, shorter, or better way of doing it, I'd love to know. > > The end result should pull the whole string in the middle of that > output line, and then I can use that to compare to a list of possible > output strings to determine if the test is still running, has > completed successfully, or failed. > Is there any particular reason you absolutely must extract the status message? If you already have a list of possible status messages, you could just test which one of those is present in the line... > Unfortunately, my google-fu fails right now, and my Regex powers were > always rather weak anyway... > > So any ideas on what the best way to proceed with this would be? -- regards, kushal
From: Steven D'Aprano on 8 Apr 2010 03:07 On Wed, 07 Apr 2010 21:57:31 -0700, Patrick Maupin wrote: > On Apr 7, 9:51 pm, Steven D'Aprano > <ste...(a)REMOVE.THIS.cybersource.com.au> wrote: > > BTW, I don't know how you got 'True' here. > >> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] >> True It was a copy and paste from the interactive interpreter. Here it is, in a fresh session: [steve(a)wow-wow ~]$ python Python 2.5 (r25:51908, Nov 6 2007, 16:54:01) [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> s = '# 1 Short offline Completed without error 00%' >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] True >>> Now I copy-and-paste from your latest post to do it again: >>> s = '# 1 Short offline Completed without error 00%' >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] False Weird, huh? And here's the answer: somewhere along the line, something changed the whitespace in the string into non-spaces: >>> s '# 1 \xc2\xa0Short offline \xc2\xa0 \xc2\xa0 \xc2\xa0 Completed without error \xc2\xa0 \xc2\xa0 \xc2\xa0 00%' I blame Google. I don't know how they did it, but I'm sure it was them! *wink* By the way, let's not forget that the string could be fixed-width fields padded with spaces, in which case the right solution almost certainly will be: s = '# 1 Short offline Completed without error 00%' result = s[25:55].rstrip() Even in 2010, there are plenty of programs that export data using fixed width fields. -- Steven
From: J on 8 Apr 2010 09:49
On Thu, Apr 8, 2010 at 01:16, Kushal Kumaran <kushal.kumaran+python(a)gmail.com> wrote: > > Is there any particular reason you absolutely must extract the status > message? If you already have a list of possible status messages, you > could just test which one of those is present in the line... Yes and no... Mostly, it's for the future. Right now, this particular test script (and I mean test script in the sense it's part of a testing framework, not in the sense that I'm being tested on it ;-) ) is fully automated. Once the self-test on the HDD is complete, the script will return either a 0 or 1 for PASS or FAIL respectively. However, in the future, it may need to be changed to or also handled manually instead of automatically. And if we end up wanting it to be automatic, then having that phrase would be important for logging or problem determination. We don't so much care about the rest of the string I want to parse as the data it gives is mostly meaningless, but having it pull things like: Completed: Electrical error or Completed: Bad Sectors Found could as useful as Completed without error or Aborted by user So that's why I was focusing on just extracting that phrase from the output. I could just pull the entire string and do a search for the phrases in question, and that's probably the simplest thing to do: re.search("Search Phrase",outputString) but I do have a tendency to overthink things some times and besides which, having just that phrase for the logs, or for use in a future change would be cool, and this way, I've already got that much of it done for later on. |