Regex driving me crazy... [Python]

Prev: remote multiprocessing, shared object
Next: ftp and python

From: Patrick Maupin on 8 Apr 2010 00:57

On Apr 7, 9:51 pm, Steven D'Aprano
<ste...(a)REMOVE.THIS.cybersource.com.au> wrote:

BTW, I don't know how you got 'True' here.

> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
> True

You must not have s set up to be the string given by the OP. I just
realized there was an error in my non-regexp example, that actually
manifests itself with the test data:

>>> import re
>>> s = '# 1 Short offline Completed without error 00%'
>>> re.split(' {2,}', s)
['# 1', 'Short offline', 'Completed without error', '00%']
>>> [x for x in s.split(' ') if x.strip()]
['# 1', 'Short offline', ' Completed without error', ' 00%']
>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
False

To fix it requires something like:

[x.strip() for x in s.split(' ') if x.strip()]

or:

[x for x in [x.strip() for x in s.split(' ')] if x]

I haven't timed either one of these, but given that the broken
original one was slower than the simpler:

splitter = re.compile(' {2,}').split
splitter(s)

on strings of "normal" length, and given that nobody noticed this bug
right away (even though it was in the printout on my first message,
heh), I think that this shows that (here, let me qualify this
carefully), at least in some cases, the first regexp that comes to my
mind can be prettier, shorter, faster, less bug-prone, etc. than the
first non-regexp that comes to my mind...

Regards,
Pat

From: Kushal Kumaran on 8 Apr 2010 01:16

On Thu, Apr 8, 2010 at 3:10 AM, J <dreadpiratejeff(a)gmail.com> wrote:
> Can someone make me un-crazy?
>
> I have a bit of code that right now, looks like this:
>
> status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
> Â Â Â Â status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
> Â Â Â Â print status
>
> Basically, it pulls the first actual line of data from the return you
> get when you use smartctl to look at a hard disk's selftest log.
>
> The raw data looks like this:
>
> # 1 Â Short offline Â Â Â Completed without error Â Â Â 00% Â Â Â 679 Â Â Â Â -
>
> Unfortunately, all that whitespace is arbitrary single space
> characters. Â And I am interested in the string that appears in the
> third column, which changes as the test runs and then completes. Â So
> in the example, "Completed without error"
>
> The regex I have up there doesn't quite work, as it seems to be
> subbing EVERY space (or at least in instances of more than one space)
> to a ':' like this:
>
> # 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -
>
> Ultimately, what I'm trying to do is either replace any space that is
>> one space wiht a delimiter, then split the result into a list and
> get the third item.
>
> OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.
>
> The end result should pull the whole string in the middle of that
> output line, and then I can use that to compare to a list of possible
> output strings to determine if the test is still running, has
> completed successfully, or failed.
>

Is there any particular reason you absolutely must extract the status
message? If you already have a list of possible status messages, you
could just test which one of those is present in the line...

> Unfortunately, my google-fu fails right now, and my Regex powers were
> always rather weak anyway...
>
> So any ideas on what the best way to proceed with this would be?

--
regards,
kushal

From: Steven D'Aprano on 8 Apr 2010 03:07

On Wed, 07 Apr 2010 21:57:31 -0700, Patrick Maupin wrote:

> On Apr 7, 9:51 pm, Steven D'Aprano
> <ste...(a)REMOVE.THIS.cybersource.com.au> wrote:
>
> BTW, I don't know how you got 'True' here.
>
>> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
>> True

It was a copy and paste from the interactive interpreter. Here it is, in
a fresh session:

[steve(a)wow-wow ~]$ python
Python 2.5 (r25:51908, Nov 6 2007, 16:54:01)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> s = '# 1 Short offline Completed without error 00%'
>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
True
>>>

Now I copy-and-paste from your latest post to do it again:

>>> s = '# 1 Short offline Completed without error 00%'
>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
False

Weird, huh?

And here's the answer: somewhere along the line, something changed the
whitespace in the string into non-spaces:

>>> s
'# 1 \xc2\xa0Short offline \xc2\xa0 \xc2\xa0 \xc2\xa0 Completed without
error \xc2\xa0 \xc2\xa0 \xc2\xa0 00%'

I blame Google. I don't know how they did it, but I'm sure it was them!
*wink*

By the way, let's not forget that the string could be fixed-width fields
padded with spaces, in which case the right solution almost certainly
will be:

s = '# 1 Short offline Completed without error 00%'
result = s[25:55].rstrip()

Even in 2010, there are plenty of programs that export data using fixed
width fields.

--
Steven

From: J on 8 Apr 2010 09:49

On Thu, Apr 8, 2010 at 01:16, Kushal Kumaran
<kushal.kumaran+python(a)gmail.com> wrote:
>
> Is there any particular reason you absolutely must extract the status
> message? If you already have a list of possible status messages, you
> could just test which one of those is present in the line...

Yes and no...

Mostly, it's for the future. Right now, this particular test script
(and I mean test script in the sense it's part of a testing framework,
not in the sense that I'm being tested on it ;-) ) is fully
automated.

Once the self-test on the HDD is complete, the script will return
either a 0 or 1 for PASS or FAIL respectively.

However, in the future, it may need to be changed to or also handled
manually instead of automatically. And if we end up wanting it to be
automatic, then having that phrase would be important for logging or
problem determination. We don't so much care about the rest of the
string I want to parse as the data it gives is mostly meaningless, but
having it pull things like:

Completed: Electrical error

or

Completed: Bad Sectors Found

could as useful as

Completed without error

or

Aborted by user

So that's why I was focusing on just extracting that phrase from the
output. I could just pull the entire string and do a search for the
phrases in question, and that's probably the simplest thing to do:

re.search("Search Phrase",outputString)

but I do have a tendency to overthink things some times and besides
which, having just that phrase for the logs, or for use in a future
change would be cool, and this way, I've already got that much of it
done for later on.

First | Prev |
Pages: 1 2 3 4
Prev: remote multiprocessing, shared object
Next: ftp and python