Prev: Python Learning Environment
Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27
From: John Nagle on 17 Apr 2010 15:23 Is there a usable street address parser available? There are some bad ones out there, but nothing good that I've found other than commercial products with large databases. I don't need 100% accuracy, but I'd like to be able to extract street name and street number for at least 98% of US mailing addresses. There's pyparsing, of course. There's a street address parser as an example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py". It's not very good. It gets all of the following wrong: 1500 Deer Creek Lane (Parses "Creek" as a street type") 186 Avenue A (NYC street) 2081 N Webb Rd (Parses N Webb as a street name) 2081 N. Webb Rd (Parses N as street name) 1515 West 22nd Street (Parses "West" as name) 2029 Stierlin Court (Street names starting with "St" misparse.) Some special cases that don't work, unsurprisingly. P.O. Box 33170 The Landmark @ One Market, Suite 200 One Market, Suite 200 One Market Much of the problem is that this parser starts at the beginning of the string. US street addresses are best parsed from the end, says the USPS. That's why things like "Deer Creek Lane" are mis-parsed. It's not clear that regular expressions are the right tool for this job. There must be something out there a little better than this. John Nagle
From: John Roth on 18 Apr 2010 17:08 On Apr 17, 1:23 pm, John Nagle <na...(a)animats.com> wrote: > Is there a usable street address parser available? There are some > bad ones out there, but nothing good that I've found other than commercial > products with large databases. I don't need 100% accuracy, but I'd like > to be able to extract street name and street number for at least 98% of > US mailing addresses. > > There's pyparsing, of course. There's a street address parser as an > example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py". > It's not very good. It gets all of the following wrong: > > 1500 Deer Creek Lane (Parses "Creek" as a street type") > 186 Avenue A (NYC street) > 2081 N Webb Rd (Parses N Webb as a street name) > 2081 N. Webb Rd (Parses N as street name) > 1515 West 22nd Street (Parses "West" as name) > 2029 Stierlin Court (Street names starting with "St" misparse.) > > Some special cases that don't work, unsurprisingly. > P.O. Box 33170 > The Landmark @ One Market, Suite 200 > One Market, Suite 200 > One Market > > Much of the problem is that this parser starts at the beginning of the string. > US street addresses are best parsed from the end, says the USPS. That's why > things like "Deer Creek Lane" are mis-parsed. It's not clear that regular > expressions are the right tool for this job. > > There must be something out there a little better than this. > > John Nagle You have my sympathy. I used to work on the address parser module at Trans Union, and I've never seen another piece of code that had as many special cases, odd rules and stuff that absolutely didn't make any sense until one of the old hands showed you the situation it was supposed to handle. And most of those files were supposed to be up to USPS mass mailing standards. When the USPS says that addresses are best parsed from the end, they aren't talking about the street address; they're talking about the address as a whole, where it's easiest if you look for a zip first, then the state, etc. The best approach I know of for the street address is simply to tokenize the thing, and then do some pattern matching. Trying to use any kind of deterministic parser is going to fail big time. IMO, 98% is way too high for any module except one that's been given a lot of love by a company that does this as part of their core business. There's a reason why commercial products come with huge data bases -- it's impossible to parse everything correctly with a single set of rules. Those data bases also contain the actual street names and address ranges by zip code, so that direct marketing files can be cleansed to USPS standards. That said, I don't see any reason why any of the examples in your first group should be misparsed by a competent parser. Sorry I don't have any real help for you. John Roth
From: Paul McGuire on 19 Apr 2010 02:11 On Apr 17, 2:23 pm, John Nagle <na...(a)animats.com> wrote: > Is there a usable street address parser available? There are some > bad ones out there, but nothing good that I've found other than commercial > products with large databases. I don't need 100% accuracy, but I'd like > to be able to extract street name and street number for at least 98% of > US mailing addresses. > > There's pyparsing, of course. There's a street address parser as an > example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py". > It's not very good. It gets all of the following wrong: > > 1500 Deer Creek Lane (Parses "Creek" as a street type") > 186 Avenue A (NYC street) > 2081 N Webb Rd (Parses N Webb as a street name) > 2081 N. Webb Rd (Parses N as street name) > 1515 West 22nd Street (Parses "West" as name) > 2029 Stierlin Court (Street names starting with "St" misparse.) > > Some special cases that don't work, unsurprisingly. > P.O. Box 33170 > The Landmark @ One Market, Suite 200 > One Market, Suite 200 > One Market > Please take a look at the updated form of this parser. It turns out there actually *were* some bugs in the old form, plus there was no provision for PO Boxes, avenues that start with "Avenue" instead of ending with them, or house numbers spelled out as words. The only one I consider a "special case" is the support for "Avenue X" instead of "X Avenue" - adding support for the rest was added in a fairly general way. With these bug fixes, I hope this improves your hit rate. (There are also some simple attempts at adding apt/suite numbers, and APO and AFP in addition to PO boxes - if not exactly what you need, the means to extend to support other options should be pretty straightforward.) -- Paul
From: Stefan Behnel on 19 Apr 2010 02:28 John Nagle, 17.04.2010 21:23: > Is there a usable street address parser available? What kind of street address are you talking about? Only US-American ones? Because street addresses are spelled differently all over the world. Some have house numbers, some use letters or a combination, some have no house numbers at all. Some use ordinal numbers, others use regular numbers. Some put the house number before the street name, some after it. And this is neither a comprehensive list, nor is this topic finished after parsing the line that gives you the street (assuming there is such a thing in the first place). Stefan
From: John Nagle on 20 Apr 2010 01:12 John Nagle wrote: > Is there a usable street address parser available? There are some > bad ones out there, but nothing good that I've found other than commercial > products with large databases. I don't need 100% accuracy, but I'd like > to be able to extract street name and street number for at least 98% of > US mailing addresses. > > There's pyparsing, of course. There's a street address parser as an > example at > "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py". The author of that module has changed the code, and it has some new features. This is much better. Unfortunately, now it won't run with the released version of "pyparsing" (1.5.2, from April 2009), because it uses "originalTextFor", a feature introduced since then. I worked around that, but discovered that the new version is case-sensitive. Changed "Keyword" to "CaselessKeyword" where appropriate. I put in the full list of USPS street types, and discovered that "1500 DEER CREEK LANE" still parses with a street name of "DEER", and a street type fo "CREEK", because "CREEK" is a USPS street type. Need to do something to pick up the last street type, not the first. I'm not sure how to do that with pyparsing. Maybe if I buy the book... There's still a problem with: "2081 N Webb Rd", where the street name comes out as "N WEBB". Addresses like "1234 5th St. S." yield a street name of "5 TH", but if the directional is before the name, it ends up with the name. Getting closer, though. If I can get to 95% of common cases, I'll be happy. John Nagle
|
Next
|
Last
Pages: 1 2 3 Prev: Python Learning Environment Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27 |