From: BK on 26 Aug 2006 00:09 hey, I'm trying to extract US States from a character string... true to from, the dataentry inclueds many misspellings and both full names and appriviations. So far, I'm using a few of the built in SAS functions ( & mods of them) for testing the SCANs to see if they are valid, and they work fairly well. Does anyone have any other ideas? I've thought about using the SOUNDEX function with SPEDIS to determin if its close, but am unsure of a good implememtation. I was thinking about incorporating this kind of logic {score(i)=spedis(soundex(schst),soundex(StateArray(i)))} into an array taking the minimum score, but it seems ineffecient; I'd like to have the code stay structured so that the match between the origional value and Actual state is used first and then move toward more fuzzy logic. Thanks in advanced; Byron Here's the code so far: *************************************************** data fmt(drop=i); do i=0 to 56; label=fipstate(i); start=fipname(i); fmtname="$FSfip"; if label not in ('--', ' ') then output fmt; end; start="OTHER"; label=" "; OUTPUT; run; proc sort data=fmt; by start; run; proc format cntlin=fmt; run; data test(drop=l); set MyData; l=(length(schst)-length(compress(schst," ,-'")))+1; nschst=compress(schst,".'`"); do i=1 to l; ** orig. text **; if stfips(scan(schst,i)) ne . then do; nstate=scan(schst,i); i=l; goto el; end; ** compressed version **; if nstate=' ' and stfips(scan(nschst,i)) ne . then do; nstate=scan(nschst,i); i=l; goto el; end; ** single word states **; if nstate=' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do; nstate=put(scan(nschst,i),$FSfip.); i=l; goto el; end; ** for two word states**; if nstate=' ' and i < l and put((scan(nschst,i)||" "||scan(nschst,i+1)),$FSfip.) ne ' ' then do; nstate=put((scan(nschst,i)||" "||scan(nschst,i+1)),$FSfip.); i=l; goto el; end; ** for District of Columbia **; if nstate=' ' and i+1 < l and put((scan(nschst,i)||" "||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' ' then do; nstate=put((scan(nschst,i)||" "||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=l; goto el; end; el: end; run; /** excert of names left out of the above statements ***; MASS MASSACHUETTS MASSACHUSETTE MASSACHUSSETTS MASSSCHUSETTE ** it does get these **; MASSACHUSETTS SPRINGFIELD MA SPRINGFIELD, MA SPRINGFIELD, MA 01107 TYNGSBORO MA W BARNSTALL, MA W BARSTABLE MA WELLESLEY HILLS MA WELLESLEY, MASSACHUSETTS WEST BARNSTABLE MA WORCESTER MA *******************************************************/;
From: Sigurd Hermansen on 26 Aug 2006 18:02 BK: I have to wonder whether any pattern matching method will work that well = in this situation. Two alternatives may supplement pattern matching and = give you acceptable results. First, an equivalent of soundex() for state = names and abbreviations could help by removing vowels and limiting = typographic patterns to variations of phonetic spellings. Second, take = advantage of the context. In your example the state name either appears = alone or as the second element. Sig ________________________________ From: owner-sas-l(a)listserv.uga.edu on behalf of BK Sent: Sat 8/26/2006 12:09 AM To: sas-l(a)uga.edu Subject: Help: fuzzy match in string hey, I'm trying to extract US States from a character string... true to from, the dataentry inclueds many misspellings and both full names and appriviations. So far, I'm using a few of the built in SAS functions ( & mods of them) for testing the SCANs to see if they are valid, and they work fairly well. Does anyone have any other ideas? I've thought about using the SOUNDEX function with SPEDIS to determin if its close, but am unsure of a good implememtation. I was thinking about incorporating this kind of logic {score(i)=3Dspedis(soundex(schst),soundex(StateArray(i)))} into an array taking the minimum score, but it seems ineffecient; I'd like to have the code stay structured so that the match between the origional value and Actual state is used first and then move toward more fuzzy logic. Thanks in advanced; Byron Here's the code so far: *************************************************** data fmt(drop=3Di); do i=3D0 to 56; label=3Dfipstate(i); start=3Dfipname(i); fmtname=3D"$FSfip"; if label not in ('--', ' ') then output fmt; end; start=3D"OTHER"; label=3D" "; OUTPUT; run; proc sort data=3Dfmt; by start; run; proc format cntlin=3Dfmt; run; data test(drop=3Dl); set MyData; l=3D(length(schst)-length(compress(schst," ,-'")))+1; nschst=3Dcompress(schst,".'`"); do i=3D1 to l; ** orig. text **; if stfips(scan(schst,i)) ne . then do; nstate=3Dscan(schst,i); = i=3Dl; goto el; end; ** compressed version **; if nstate=3D' ' and stfips(scan(nschst,i)) ne . then do; nstate=3Dscan(nschst,i); i=3Dl; goto el; end; ** single word states **; if nstate=3D' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do; nstate=3Dput(scan(nschst,i),$FSfip.); i=3Dl; goto el; end; ** for two word states**; if nstate=3D' ' and i < l and put((scan(nschst,i)||" "||scan(nschst,i+1)),$FSfip.) ne ' ' then do; nstate=3Dput((scan(nschst,i)||" "||scan(nschst,i+1)),$FSfip.); i=3Dl; goto el; end; ** for District of Columbia **; if nstate=3D' ' and i+1 < l and put((scan(nschst,i)||" "||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' ' then do; nstate=3Dput((scan(nschst,i)||" "||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=3Dl; goto el; end; el: end; run; /** excert of names left out of the above statements ***; MASS MASSACHUETTS MASSACHUSETTE MASSACHUSSETTS MASSSCHUSETTE ** it does get these **; MASSACHUSETTS SPRINGFIELD MA SPRINGFIELD, MA SPRINGFIELD, MA 01107 TYNGSBORO MA W BARNSTALL, MA W BARSTABLE MA WELLESLEY HILLS MA WELLESLEY, MASSACHUSETTS WEST BARNSTABLE MA WORCESTER MA *******************************************************/;
From: BK on 26 Aug 2006 21:24 Thanks to you both... For a little clarification what I'm trying to end up with is a new variable with the two letter abbreviation for the state found in the string. I've never used regular expressions so I'll have to look them, so thanks for the reference assistance. Also, I'll probably try the first solution you mention Sig of compressing the vowels, that may catch a few more. The method I've got already searches (nearly) all possible segments of the string; it's just making the automating determination of which of the 50 it the value is... The example only shows MA, but there are may other state mispellings (ie:"South Crlina" for "South Carolina") and also countries mixed in. After this, I may be getting about as good as it gets without having a few "false positives" (so to speak) in my results. Byron
From: David L Cassell on 28 Aug 2006 01:10 byronkirby(a)GMAIL.COM wrote: >hey, >I'm trying to extract US States from a character string... true to >from, the dataentry inclueds many misspellings and both full names and >appriviations. So far, I'm using a few of the built in SAS functions ( >& mods of them) for testing the SCANs to see if they are valid, and >they work fairly well. Does anyone have any other ideas? I've thought >about using the SOUNDEX function with SPEDIS to determin if its close, >but am unsure of a good implememtation. > >I was thinking about incorporating this kind of logic >{score(i)=spedis(soundex(schst),soundex(StateArray(i)))} into an array >taking the minimum score, but it seems ineffecient; > >I'd like to have the code stay structured so that the match between the >origional value and Actual state is used first and then move toward >more fuzzy logic. > >Thanks in advanced; >Byron > >Here's the code so far: > >*************************************************** > >data fmt(drop=i); > do i=0 to 56; > label=fipstate(i); > start=fipname(i); > fmtname="$FSfip"; > if label not in ('--', ' ') then output fmt; > end; > start="OTHER"; label=" "; OUTPUT; >run; >proc sort data=fmt; by start; run; >proc format cntlin=fmt; run; > > >data test(drop=l); > set MyData; > l=(length(schst)-length(compress(schst," ,-'")))+1; > nschst=compress(schst,".'`"); > > do i=1 to l; >** orig. text **; > if stfips(scan(schst,i)) ne . then do; nstate=scan(schst,i); i=l; >goto el; end; >** compressed version **; > if nstate=' ' and stfips(scan(nschst,i)) ne . then do; >nstate=scan(nschst,i); i=l; goto el; end; >** single word states **; > if nstate=' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do; >nstate=put(scan(nschst,i),$FSfip.); i=l; goto el; end; >** for two word states**; > if nstate=' ' and i < l and put((scan(nschst,i)||" >"||scan(nschst,i+1)),$FSfip.) ne ' ' > then do; nstate=put((scan(nschst,i)||" >"||scan(nschst,i+1)),$FSfip.); i=l; goto el; end; >** for District of Columbia **; > if nstate=' ' and i+1 < l and put((scan(nschst,i)||" >"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' ' > then do; nstate=put((scan(nschst,i)||" >"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=l; goto el; >end; > el: end; >run; > > >/** excert of names left out of the above statements ***; > MASS > MASSACHUETTS > MASSACHUSETTE > MASSACHUSSETTS > MASSSCHUSETTE >** it does get these **; > MASSACHUSETTS > SPRINGFIELD MA > SPRINGFIELD, MA > SPRINGFIELD, MA 01107 > TYNGSBORO MA > W BARNSTALL, MA > W BARSTABLE MA > WELLESLEY HILLS MA > WELLESLEY, MASSACHUSETTS > WEST BARNSTABLE MA > WORCESTER MA >*******************************************************/; First, note that SOUNDEX() and SPEDIS() are two very different critters. And note that there are some very nice tools like COMPGED() and COMPLEV() which are also worth looking into here. SOUNDEX() might actually work in your context. It oguht to catch common phonetic misspellings, since what it does is compress out the vowels and then treat similar-sounding consonants as being identical. You could build a format or hash that holds all the soundex-compressed versions of the full state names, and try to match against that. However, tools like COMPLEV() yield an edit distance for a pair of words. You could compute COMPLEV(yourword,statename{i}) for all 50 states and then look at anything which has at least one edit distance below some reasonable cutoff that you would assign by experimentation. You can't get away from at least *some* false positives and false negatives. Expect that the more you try to avoid one, the more of the other you will get. HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 _________________________________________________________________ Windows Live Spaces is here! It?s easy to create your own personal Web site. http://spaces.live.com/signup.aspx
From: "Howard Schreier <hs AT dc-sug DOT org>" on 28 Aug 2006 21:48
Can it be presumed that the string expresses a state name? Or is it possible that it's just noise? I wouldn't use SOUNDEX *with* SPEDIS (or withone of the new v. 9 edit distance functions). Perhaps they can be used in parallel, one approach serving to confirm the other or to serve as a "Plan B". Start by standardizing your string. Change everything to upper case. Replace non-alpha characters with blanks. Replace multiple blanks with single ones. Change "NORTH", "SOUTH", and "WEST" to one-letter abbreviations. There are three types of abbreviations to consider: the 2-letter postal codes (Florida=FL), traditionally recognized ones (Florida=FLA), and arbitrary truncations (FLORIDA=FLOR, etc.). Missouri and Mississippi require 5-letter truncations to be differentiated from each other; other names can be distinguished with fewer letters. You many want to take a few minutes to build a table for this purpose. These bits of logic should yield a lot of matches before you have to start with fuzzy approaches. On Fri, 25 Aug 2006 21:09:05 -0700, BK <byronkirby(a)GMAIL.COM> wrote: >hey, >I'm trying to extract US States from a character string... true to >from, the dataentry inclueds many misspellings and both full names and >appriviations. So far, I'm using a few of the built in SAS functions ( >& mods of them) for testing the SCANs to see if they are valid, and >they work fairly well. Does anyone have any other ideas? I've thought >about using the SOUNDEX function with SPEDIS to determin if its close, >but am unsure of a good implememtation. > >I was thinking about incorporating this kind of logic >{score(i)=spedis(soundex(schst),soundex(StateArray(i)))} into an array >taking the minimum score, but it seems ineffecient; > >I'd like to have the code stay structured so that the match between the >origional value and Actual state is used first and then move toward >more fuzzy logic. > >Thanks in advanced; >Byron > >Here's the code so far: > >*************************************************** > >data fmt(drop=i); > do i=0 to 56; > label=fipstate(i); > start=fipname(i); > fmtname="$FSfip"; > if label not in ('--', ' ') then output fmt; > end; > start="OTHER"; label=" "; OUTPUT; >run; >proc sort data=fmt; by start; run; >proc format cntlin=fmt; run; > > >data test(drop=l); > set MyData; > l=(length(schst)-length(compress(schst," ,-'")))+1; > nschst=compress(schst,".'`"); > > do i=1 to l; >** orig. text **; > if stfips(scan(schst,i)) ne . then do; nstate=scan(schst,i); i=l; >goto el; end; >** compressed version **; > if nstate=' ' and stfips(scan(nschst,i)) ne . then do; >nstate=scan(nschst,i); i=l; goto el; end; >** single word states **; > if nstate=' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do; >nstate=put(scan(nschst,i),$FSfip.); i=l; goto el; end; >** for two word states**; > if nstate=' ' and i < l and put((scan(nschst,i)||" >"||scan(nschst,i+1)),$FSfip.) ne ' ' > then do; nstate=put((scan(nschst,i)||" >"||scan(nschst,i+1)),$FSfip.); i=l; goto el; end; >** for District of Columbia **; > if nstate=' ' and i+1 < l and put((scan(nschst,i)||" >"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' ' > then do; nstate=put((scan(nschst,i)||" >"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=l; goto el; >end; > el: end; >run; > > >/** excert of names left out of the above statements ***; > MASS > MASSACHUETTS > MASSACHUSETTE > MASSACHUSSETTS > MASSSCHUSETTE >** it does get these **; > MASSACHUSETTS > SPRINGFIELD MA > SPRINGFIELD, MA > SPRINGFIELD, MA 01107 > TYNGSBORO MA > W BARNSTALL, MA > W BARSTABLE MA > WELLESLEY HILLS MA > WELLESLEY, MASSACHUSETTS > WEST BARNSTABLE MA > WORCESTER MA >*******************************************************/; |