Prev: THANKS GOD! I GOT $2000 FROM PAYPAL....
Next: constructing and using large lexicon in a program
From: John Nagle on 2 Aug 2010 13:34 The regular expression "split" behaves slightly differently than string split: >>> import re >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE) >>> kresplit2.split(" HELLO THERE ") ['', 'HELLO', 'THERE', ''] >>> kresplit2.split("VERISIGN INC.") ['VERISIGN', 'INC', ''] I'd thought that "split" would never produce an empty string, but it will. The regular string split operation doesn't yield empty strings: >>> " HELLO THERE ".split() ['HELLO', 'THERE'] If I try to get the functionality of string split with re: >>> s2 = " HELLO THERE " >>> kresplit4 = re.compile(r'\W+', re.UNICODE) >>> kresplit4.split(s2) ['', 'HELLO', 'THERE', ''] I still get empty strings. The documentation just describes re.split as "Split string by the occurrences of pattern", which is not too helpful. John Nagle
From: Peter Otten on 2 Aug 2010 14:01 John Nagle wrote: > The regular string split operation doesn't yield empty strings: > > >>> " HELLO THERE ".split() > ['HELLO', 'THERE'] Note that invocation without separator argument (or None as the separator) is special in that respect: >>> " hello there ".split(" ") ['', 'hello', 'there', ''] Peter
From: MRAB on 2 Aug 2010 14:02 John Nagle wrote: > The regular expression "split" behaves slightly differently than string > split: > > >>> import re > >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE) > > >>> kresplit2.split(" HELLO THERE ") > ['', 'HELLO', 'THERE', ''] > > >>> kresplit2.split("VERISIGN INC.") > ['VERISIGN', 'INC', ''] > > I'd thought that "split" would never produce an empty string, but > it will. > > The regular string split operation doesn't yield empty strings: > > >>> " HELLO THERE ".split() > ['HELLO', 'THERE'] > Yes it does. >>> " HELLO THERE ".split(" ") ['', '', '', 'HELLO', '', '', '', 'THERE', '', '', ''] > If I try to get the functionality of string split with re: > > >>> s2 = " HELLO THERE " > >>> kresplit4 = re.compile(r'\W+', re.UNICODE) > >>> kresplit4.split(s2) > ['', 'HELLO', 'THERE', ''] > > I still get empty strings. > > The documentation just describes re.split as "Split string by the > occurrences of pattern", which is not too helpful. > It's the plain str.split() which is unusual in that: 1. it splits on sequences of whitespace instead of one per occurrence; 2. it discards leading and trailing sequences of whitespace. Compare: >>> " A B ".split(" ") ['', '', 'A', '', 'B', '', ''] with: >>> " A B ".split() ['A', 'B'] It just happens that the unusual one is the most commonly used one, if you see what I mean! :-)
From: John Nagle on 2 Aug 2010 15:41 On 8/2/2010 11:02 AM, MRAB wrote: > John Nagle wrote: >> The regular expression "split" behaves slightly differently than >> string split: occurrences of pattern", which is not too helpful. >> > It's the plain str.split() which is unusual in that: > > 1. it splits on sequences of whitespace instead of one per occurrence; That can be emulated with the obvious regular expression: re.compile(r'\W+') > 2. it discards leading and trailing sequences of whitespace. But that can't, or at least I can't figure out how to do it. > It just happens that the unusual one is the most commonly used one, if > you see what I mean! :-) The no-argument form of "split" shouldn't be that much of a special case. John Nagle
From: Thomas Jollans on 2 Aug 2010 15:52 On 08/02/2010 09:41 PM, John Nagle wrote: > On 8/2/2010 11:02 AM, MRAB wrote: >> John Nagle wrote: >>> The regular expression "split" behaves slightly differently than >>> string split: > occurrences of pattern", which is not too helpful. >>> >> It's the plain str.split() which is unusual in that: >> >> 1. it splits on sequences of whitespace instead of one per occurrence; > > That can be emulated with the obvious regular expression: > > re.compile(r'\W+') > >> 2. it discards leading and trailing sequences of whitespace. > > But that can't, or at least I can't figure out how to do it. [ s in rexp.split(long_s) if s ] > >> It just happens that the unusual one is the most commonly used one, if >> you see what I mean! :-) > > The no-argument form of "split" shouldn't be that much of a special > case. > > John Nagle >
|
Next
|
Last
Pages: 1 2 3 Prev: THANKS GOD! I GOT $2000 FROM PAYPAL.... Next: constructing and using large lexicon in a program |