Behavior of re.split on empty strings is unexpected [Python]

Prev: THANKS GOD! I GOT $2000 FROM PAYPAL....
Next: constructing and using large lexicon in a program

From: John Nagle on 2 Aug 2010 13:34

The regular expression "split" behaves slightly differently than string
split:

>>> import re
>>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)

>>> kresplit2.split(" HELLO THERE ")
['', 'HELLO', 'THERE', '']

>>> kresplit2.split("VERISIGN INC.")
['VERISIGN', 'INC', '']

I'd thought that "split" would never produce an empty string, but
it will.

The regular string split operation doesn't yield empty strings:

>>> " HELLO THERE ".split()
['HELLO', 'THERE']

If I try to get the functionality of string split with re:

>>> s2 = " HELLO THERE "
>>> kresplit4 = re.compile(r'\W+', re.UNICODE)
>>> kresplit4.split(s2)
['', 'HELLO', 'THERE', '']

I still get empty strings.

The documentation just describes re.split as "Split string by the
occurrences of pattern", which is not too helpful.

John Nagle

From: Peter Otten on 2 Aug 2010 14:01

John Nagle wrote:

> The regular string split operation doesn't yield empty strings:
>
> >>> " HELLO THERE ".split()
> ['HELLO', 'THERE']

Note that invocation without separator argument (or None as the separator)
is special in that respect:

>>> " hello there ".split(" ")
['', 'hello', 'there', '']

Peter

From: MRAB on 2 Aug 2010 14:02

John Nagle wrote:
> The regular expression "split" behaves slightly differently than string
> split:
>
> >>> import re
> >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)
>
> >>> kresplit2.split(" HELLO THERE ")
> ['', 'HELLO', 'THERE', '']
>
> >>> kresplit2.split("VERISIGN INC.")
> ['VERISIGN', 'INC', '']
>
> I'd thought that "split" would never produce an empty string, but
> it will.
>
> The regular string split operation doesn't yield empty strings:
>
> >>> " HELLO THERE ".split()
> ['HELLO', 'THERE']
>
Yes it does.

>>> " HELLO THERE ".split(" ")
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']

> If I try to get the functionality of string split with re:
>
> >>> s2 = " HELLO THERE "
> >>> kresplit4 = re.compile(r'\W+', re.UNICODE)
> >>> kresplit4.split(s2)
> ['', 'HELLO', 'THERE', '']
>
> I still get empty strings.
>
> The documentation just describes re.split as "Split string by the
> occurrences of pattern", which is not too helpful.
>
It's the plain str.split() which is unusual in that:

1. it splits on sequences of whitespace instead of one per occurrence;

2. it discards leading and trailing sequences of whitespace.

Compare:

>>> " A B ".split(" ")
['', '', 'A', '', 'B', '', '']

with:

>>> " A B ".split()
['A', 'B']

It just happens that the unusual one is the most commonly used one, if
you see what I mean! :-)

From: John Nagle on 2 Aug 2010 15:41

On 8/2/2010 11:02 AM, MRAB wrote:
> John Nagle wrote:
>> The regular expression "split" behaves slightly differently than
>> string split:
occurrences of pattern", which is not too helpful.
>>
> It's the plain str.split() which is unusual in that:
>
> 1. it splits on sequences of whitespace instead of one per occurrence;

That can be emulated with the obvious regular expression:

re.compile(r'\W+')

> 2. it discards leading and trailing sequences of whitespace.

But that can't, or at least I can't figure out how to do it.

> It just happens that the unusual one is the most commonly used one, if
> you see what I mean! :-)

The no-argument form of "split" shouldn't be that much of a special
case.

John Nagle

From: Thomas Jollans on 2 Aug 2010 15:52

On 08/02/2010 09:41 PM, John Nagle wrote:
> On 8/2/2010 11:02 AM, MRAB wrote:
>> John Nagle wrote:
>>> The regular expression "split" behaves slightly differently than
>>> string split:
> occurrences of pattern", which is not too helpful.
>>>
>> It's the plain str.split() which is unusual in that:
>>
>> 1. it splits on sequences of whitespace instead of one per occurrence;
>
> That can be emulated with the obvious regular expression:
>
> re.compile(r'\W+')
>
>> 2. it discards leading and trailing sequences of whitespace.
>
> But that can't, or at least I can't figure out how to do it.

[ s in rexp.split(long_s) if s ]

>
>> It just happens that the unusual one is the most commonly used one, if
>> you see what I mean! :-)
>
> The no-argument form of "split" shouldn't be that much of a special
> case.
>
> John Nagle
>

| Next | Last
Pages: 1 2 3
Prev: THANKS GOD! I GOT $2000 FROM PAYPAL....
Next: constructing and using large lexicon in a program