Prev: Python Learning Environment
Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27
From: Tim Roberts on 20 Apr 2010 02:53 John Nagle <nagle(a)animats.com> wrote: > > Unfortunately, now it won't run with the released >version of "pyparsing" (1.5.2, from April 2009), because it uses >"originalTextFor", a feature introduced since then. I worked around that, >but discovered that the new version is case-sensitive. Changed >"Keyword" to "CaselessKeyword" where appropriate. > > I put in the full list of USPS street types, and discovered >that "1500 DEER CREEK LANE" still parses with a street name >of "DEER", and a street type fo "CREEK", because "CREEK" is a >USPS street type. Need to do something to pick up the last street >type, not the first. I'm not sure how to do that with pyparsing. >Maybe if I buy the book... > > There's still a problem with: "2081 N Webb Rd", where the street name >comes out as "N WEBB". >Addresses like "1234 5th St. S." yield a street name of "5 TH", >but if the directional is before the name, it ends up with the name. > > Getting closer, though. If I can get to 95% of common cases, I'll >be happy. This is a very tricky problem. Consider Salem, Oregon, which puts the direction after the street: 3340 Astoria Way NE Salem, OR 97303 Consider northern Los Angeles County, which use directions both before and after. I used to live at: 44720 N 2nd St E Lancaster, CA 93534 Consider much of Utah, which is both easy (because of its very neat grid) and a pain, because of addresses like: 389 W 1700 S Salt Lake City, UT 84115 -- Tim Roberts, timr(a)probo.com Providenza & Boekelheide, Inc.
From: John Yeung on 20 Apr 2010 03:24 My response is similar to John Roth's. It's mainly just sympathy. ;) I deal with addresses a lot, and I know that a really good parser is both rare/expensive to find and difficult to write yourself. We have commercial, USPS-certified products where I work, and even with those I've written a good deal of pre-processing and post-processing code, consisting almost entirely of very silly-looking fixes for special cases. I don't have any experience whatsoever with pyparsing, but I will say I agree that you should try to get the street type from the end of the line. Just be aware that it can be valid to leave off the street type completely. And of course it's a plus if you can handle suites that are on the same line as the street (which is where the USPS prefers them to be). I would take the approach which John R. seems to be suggesting, which is to tokenize and then write a whole bunch of very hairy, special- case-laden logic. ;) I'm almost positive this is what all the commercial packages are doing, and I have a tough time imagining what else you could do. Addresses inherently have a high degree of irregularity. Good luck! John Y.
From: Iain King on 20 Apr 2010 05:23 On Apr 20, 8:24 am, John Yeung <gallium.arsen...(a)gmail.com> wrote: > My response is similar to John Roth's. It's mainly just sympathy. ;) > > I deal with addresses a lot, and I know that a really good parser is > both rare/expensive to find and difficult to write yourself. We have > commercial, USPS-certified products where I work, and even with those > I've written a good deal of pre-processing and post-processing code, > consisting almost entirely of very silly-looking fixes for special > cases. > > I don't have any experience whatsoever with pyparsing, but I will say > I agree that you should try to get the street type from the end of the > line. Just be aware that it can be valid to leave off the street type > completely. And of course it's a plus if you can handle suites that > are on the same line as the street (which is where the USPS prefers > them to be). > > I would take the approach which John R. seems to be suggesting, which > is to tokenize and then write a whole bunch of very hairy, special- > case-laden logic. ;) I'm almost positive this is what all the > commercial packages are doing, and I have a tough time imagining what > else you could do. Addresses inherently have a high degree of > irregularity. > > Good luck! > > John Y. Not sure on the volume of addresses you're working with, but as an alternative you could try grabbing the zip code, looking up all addresses in that zip code, and then finding whatever one of those address strings most closely resembles your address string (smallest Levenshtein distance?). Iain
From: Grant Edwards on 20 Apr 2010 09:41 On 2010-04-20, Tim Roberts <timr(a)probo.com> wrote: > This is a very tricky problem. Consider Salem, Oregon, which puts the > direction after the street: > > 3340 Astoria Way NE > Salem, OR 97303 In Minneapolis, the direction comes before the street in some quadrants and after it in others. I used to live on W 43rd Street. Now I live on 24th Ave NE. And just to be more inconsistent, only the "NE" section uses two directions, everywhere else it's just W, S, N, or E. -- Grant Edwards grant.b.edwards Yow! Is it NOUVELLE at CUISINE when 3 olives are gmail.com struggling with a scallop in a plate of SAUCE MORNAY?
From: John Nagle on 20 Apr 2010 13:16 Iain King wrote: > Not sure on the volume of addresses you're working with, but as an > alternative you could try grabbing the zip code, looking up all > addresses in that zip code, and then finding whatever one of those > address strings most closely resembles your address string (smallest > Levenshtein distance?). The parser doesn't have to be perfect, but it should reliably reports when it fails. Then I can run the hard cases through one of the commercial online address standardizers. I'd like to be able to knock off the easy cases cheaply. What I want to do is to first extract the street number and undecorated street name only, match that to a large database of US businesses stored in MySQL, and then find the best match from the database hits. So I need reliable extraction of undecorated street name and number. The other fields are less important. John Nagle
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: Python Learning Environment Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27 |