Prev: using the netflix api
Next: Another Regexp Question
From: John Nagle on 5 Jul 2010 18:19 I'm working on street address parsing again, and I'm trying to deal with some of the harder cases. Here's a subparser, intended to take in things like "N MAIN" and "SOUTH", and break out the "directional" from street name. Directionals = ['southeast', 'northeast', 'north', 'northwest', 'west', 'east', 'south', 'southwest', 'SE', 'NE', 'N', 'NW', 'W', 'E', 'S', 'SW'] direction = Combine(MatchFirst(map(CaselessKeyword, directionals)) + Optional(".").suppress()) streetNameParser = Optional(direction.setResultsName("predirectional")) + Combine(OneOrMore(Word(alphanums)), adjacent=False, joinString=" ").setResultsName("streetname") This parses something like "N WEBB" fine; "N" is the "predirectional", and "WEBB" is the street name. "SOUTH" (which, when not followed by another word, is a streetname, not a predirectional), raises a parsing exception: Street address line parse failed for SOUTH : Expected W:(abcd...) (at char 5), (line:1, col:6) The problem is that "direction" matched SOUTH, and even though "direction" is within an "Optional" and followed by another word, the parser didn't back up when it hit the end of the expression without satisfying the OneOrMore clause. Pyparsing does some backup, but I'm not clear on how much, or how to force it to happen. There's some discussion at "http://www.mail-archive.com/python-list(a)python.org/msg169559.html". Apparently the "Or" operator will force some backup, but it's not clear how much lookahead and backtracking is supported. John Nagle
From: John Nagle on 6 Jul 2010 00:45 On 7/5/2010 3:19 PM, John Nagle wrote: > I'm working on street address parsing again, and I'm trying to deal > with some of the harder cases. The approach below works for the cases given. The "Or" operator ("^") supports backtracking, but "Optional()" apparently does not. direction = Combine(MatchFirst(map(CaselessKeyword, directionals)) + Optional(".").suppress()) streetNameOnly = Combine(OneOrMore(Word(alphanums)), adjacent=False, joinString=" ").setResultsName("streetname") streetNameParser = ((direction.setResultsName("predirectional") + streetNameOnly) ^ streetNameOnly) John Nagle
From: Thomas Jollans on 6 Jul 2010 05:02 On 07/06/2010 04:21 AM, Dennis Lee Bieber wrote: > On Mon, 05 Jul 2010 15:19:53 -0700, John Nagle <nagle(a)animats.com> > declaimed the following in gmane.comp.python.general: > >> I'm working on street address parsing again, and I'm trying to deal >> with some of the harder cases. >> > > Hasn't it been suggested before, that the sanest method to parse > addresses is from the end backwards... > > So that: > > 123 N South St. > > is parsed as > > St. South N 123 You will of course need some trickery for that to work with Hauptstr. 12
From: Cousin Stanley on 7 Jul 2010 19:46 > I'm working on street address parsing again, > and I'm trying to deal with some of the harder cases. > .... For yet another test case my actual address includes .... ... East South Mountain Avenue Sometimes written as .... ... E. South Mtn Ave -- Stanley C. Kitching Human Being Phoenix, Arizona
|
Pages: 1 Prev: using the netflix api Next: Another Regexp Question |