From: BK on 29 Aug 2006 08:23 With out using the SPEDIS and Soundex functions I've gotten it down to <2% of the non missing values not being assigened a state. I did do some testing with those functions and did indeed get MANY incorrect results, some of which I would have never expected as the were SO far off. The data is readin using the $UpperW. informat so thats been taken care of and SCAN by default treets consecutive delimiters as one and In my multy word passes I force single space delimiter before testing, so thats taken care of... (though I'll probably take the suggestion as its good for premanently cleaning the orig field) I think the biggest issue I have left are the non standard, but traditional Abbrieviations for the states. Thanks to all!!!! Byron > Start by standardizing your string. Change everything to upper case. Replace > non-alpha characters with blanks. Replace multiple blanks with single ones. > Change "NORTH", "SOUTH", and "WEST" to one-letter abbreviations. > > There are three types of abbreviations to consider: the 2-letter postal > codes (Florida=FL), traditionally recognized ones (Florida=FLA), and > arbitrary truncations (FLORIDA=FLOR, etc.). Missouri and Mississippi require > 5-letter truncations to be differentiated from each other; other names can > be distinguished with fewer letters. You many want to take a few minutes to > build a table for this purpose.
From: "Howard Schreier <hs AT dc-sug DOT org>" on 29 Aug 2006 22:36
On Tue, 29 Aug 2006 05:23:42 -0700, BK <byronkirby(a)GMAIL.COM> wrote: >With out using the SPEDIS and Soundex functions I've gotten it down to ><2% of the non missing values not being assigened a state. I did do >some testing with those functions and did indeed get MANY incorrect >results, some of which I would have never expected as the were SO far >off. If you have V. 9 try COMPGED instead of SPEDIS. One advantage is that you can use CALL COMPCOST to tune the coefficients used by COMPGED. > >The data is readin using the $UpperW. informat so thats been taken care >of and SCAN by default treets consecutive delimiters as one and In my >multy word passes I force single space delimiter before testing, so >thats taken care of... (though I'll probably take the suggestion as its >good for premanently cleaning the orig field) > >I think the biggest issue I have left are the non standard, but >traditional Abbrieviations for the states. Most of those are what I called truncations (eg, TENN for TENNESSEE). If you systematically handle the truncations, there should only be a handful of other traditional abbreviations (eg, PENNA for PENNSYLVANIA). >Thanks to all!!!! > >Byron > >> Start by standardizing your string. Change everything to upper case. Replace >> non-alpha characters with blanks. Replace multiple blanks with single ones. >> Change "NORTH", "SOUTH", and "WEST" to one-letter abbreviations. >> >> There are three types of abbreviations to consider: the 2-letter postal >> codes (Florida=FL), traditionally recognized ones (Florida=FLA), and >> arbitrary truncations (FLORIDA=FLOR, etc.). Missouri and Mississippi require >> 5-letter truncations to be differentiated from each other; other names can >> be distinguished with fewer letters. You many want to take a few minutes to >> build a table for this purpose. |