Usable street address parser in Python? [Python]

Prev: Python Learning Environment
Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27

From: John Nagle on 17 Apr 2010 15:23

Is there a usable street address parser available? There are some
bad ones out there, but nothing good that I've found other than commercial
products with large databases. I don't need 100% accuracy, but I'd like
to be able to extract street name and street number for at least 98% of
US mailing addresses.

There's pyparsing, of course. There's a street address parser as an
example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".
It's not very good. It gets all of the following wrong:

1500 Deer Creek Lane (Parses "Creek" as a street type")
186 Avenue A (NYC street)
2081 N Webb Rd (Parses N Webb as a street name)
2081 N. Webb Rd (Parses N as street name)
1515 West 22nd Street (Parses "West" as name)
2029 Stierlin Court (Street names starting with "St" misparse.)

Some special cases that don't work, unsurprisingly.
P.O. Box 33170
The Landmark @ One Market, Suite 200
One Market, Suite 200
One Market

Much of the problem is that this parser starts at the beginning of the string.
US street addresses are best parsed from the end, says the USPS. That's why
things like "Deer Creek Lane" are mis-parsed. It's not clear that regular
expressions are the right tool for this job.

There must be something out there a little better than this.

John Nagle

From: John Roth on 18 Apr 2010 17:08

On Apr 17, 1:23 pm, John Nagle <na...(a)animats.com> wrote:
> Is there a usable street address parser available? There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases. I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
> There's pyparsing, of course. There's a street address parser as an
> example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
> It's not very good. It gets all of the following wrong:
>
> 1500 Deer Creek Lane (Parses "Creek" as a street type")
> 186 Avenue A (NYC street)
> 2081 N Webb Rd (Parses N Webb as a street name)
> 2081 N. Webb Rd (Parses N as street name)
> 1515 West 22nd Street (Parses "West" as name)
> 2029 Stierlin Court (Street names starting with "St" misparse.)
>
> Some special cases that don't work, unsurprisingly.
> P.O. Box 33170
> The Landmark @ One Market, Suite 200
> One Market, Suite 200
> One Market
>
> Much of the problem is that this parser starts at the beginning of the string.
> US street addresses are best parsed from the end, says the USPS. That's why
> things like "Deer Creek Lane" are mis-parsed. It's not clear that regular
> expressions are the right tool for this job.
>
> There must be something out there a little better than this.
>
> John Nagle

You have my sympathy. I used to work on the address parser module at
Trans Union, and I've never seen another piece of code that had as
many special cases, odd rules and stuff that absolutely didn't make
any sense until one of the old hands showed you the situation it was
supposed to handle.

And most of those files were supposed to be up to USPS mass mailing
standards.

When the USPS says that addresses are best parsed from the end, they
aren't talking about the street address; they're talking about the
address as a whole, where it's easiest if you look for a zip first,
then the state, etc. The best approach I know of for the street
address is simply to tokenize the thing, and then do some pattern
matching. Trying to use any kind of deterministic parser is going to
fail big time.

IMO, 98% is way too high for any module except one that's been given a
lot of love by a company that does this as part of their core
business. There's a reason why commercial products come with huge data
bases -- it's impossible to parse everything correctly with a single
set of rules. Those data bases also contain the actual street names
and address ranges by zip code, so that direct marketing files can be
cleansed to USPS standards.

That said, I don't see any reason why any of the examples in your
first group should be misparsed by a competent parser.

Sorry I don't have any real help for you.

John Roth

From: Paul McGuire on 19 Apr 2010 02:11

On Apr 17, 2:23 pm, John Nagle <na...(a)animats.com> wrote:
> Is there a usable street address parser available? There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases. I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
> There's pyparsing, of course. There's a street address parser as an
> example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
> It's not very good. It gets all of the following wrong:
>
> 1500 Deer Creek Lane (Parses "Creek" as a street type")
> 186 Avenue A (NYC street)
> 2081 N Webb Rd (Parses N Webb as a street name)
> 2081 N. Webb Rd (Parses N as street name)
> 1515 West 22nd Street (Parses "West" as name)
> 2029 Stierlin Court (Street names starting with "St" misparse.)
>
> Some special cases that don't work, unsurprisingly.
> P.O. Box 33170
> The Landmark @ One Market, Suite 200
> One Market, Suite 200
> One Market
>

Please take a look at the updated form of this parser. It turns out
there actually *were* some bugs in the old form, plus there was no
provision for PO Boxes, avenues that start with "Avenue" instead of
ending with them, or house numbers spelled out as words. The only one
I consider a "special case" is the support for "Avenue X" instead of
"X Avenue" - adding support for the rest was added in a fairly general
way. With these bug fixes, I hope this improves your hit rate. (There
are also some simple attempts at adding apt/suite numbers, and APO and
AFP in addition to PO boxes - if not exactly what you need, the means
to extend to support other options should be pretty straightforward.)

-- Paul

From: Stefan Behnel on 19 Apr 2010 02:28

John Nagle, 17.04.2010 21:23:
> Is there a usable street address parser available?

What kind of street address are you talking about? Only US-American ones?

Because street addresses are spelled differently all over the world. Some
have house numbers, some use letters or a combination, some have no house
numbers at all. Some use ordinal numbers, others use regular numbers. Some
put the house number before the street name, some after it. And this is
neither a comprehensive list, nor is this topic finished after parsing the
line that gives you the street (assuming there is such a thing in the first
place).

Stefan

From: John Nagle on 20 Apr 2010 01:12

John Nagle wrote:
> Is there a usable street address parser available? There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases. I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
> There's pyparsing, of course. There's a street address parser as an
> example at
> "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".

The author of that module has changed the code, and it has some
new features. This is much better.

Unfortunately, now it won't run with the released
version of "pyparsing" (1.5.2, from April 2009), because it uses
"originalTextFor", a feature introduced since then. I worked around that,
but discovered that the new version is case-sensitive. Changed
"Keyword" to "CaselessKeyword" where appropriate.

I put in the full list of USPS street types, and discovered
that "1500 DEER CREEK LANE" still parses with a street name
of "DEER", and a street type fo "CREEK", because "CREEK" is a
USPS street type. Need to do something to pick up the last street
type, not the first. I'm not sure how to do that with pyparsing.
Maybe if I buy the book...

There's still a problem with: "2081 N Webb Rd", where the street name
comes out as "N WEBB".
Addresses like "1234 5th St. S." yield a street name of "5 TH",
but if the directional is before the name, it ends up with the name.

Getting closer, though. If I can get to 95% of common cases, I'll
be happy.

John Nagle

| Next | Last
Pages: 1 2 3
Prev: Python Learning Environment
Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27