From: Sierra Information Services on 11 Mar 2010 17:12 I am glad a solution/understanding to this problem has already been identified, but I wanted to point out that the SOUNDEX Function, and the underlying SOUNDEX algorithim it implements, is a failry weak way of trying to meausre similarity/dissimilarity between two text strings. There are a lot of limitations to it, especially when using it on family names. You may want to explore use of the SPEDIS (spelling distance) , COMPGED (compute generalize distance) and COMPLEV (compute Levenshtein edit distance) functions as more powerful tools for your project. SPEDIS was added in V8 and the other two were added in SAS 9,0. You can also use the CALL COMPCOST routine in SAS 9 to assign your own "penalty costs" if you don't like the ones that are implemented by default in the COMPGED function. There are examples of how to use SOUNDEX, SPEDIS, COMPGED and COMPLEV in the PDF of my paper "Becoming More FUNCTIONal in SAS 9 Software," available for free download at http://www.sierrainformation.com. From the home page click on "Free Downloads" and take things from there. Hope this helps Andrew Karp Sierra Information Services http://www.sierrainformation.com On Mar 11, 11:42�am, Nancy <nancy0...(a)gmail.com> wrote: > Yes, that is the reason. > I checked the code again. > And found that I used the soudex function before I seperated the first > name and middle for some names. > > Thank you very much! > > Xiaohong > > On Mar 11, 2:16�pm, "data _null_;" <datan...(a)gmail.com> wrote: > > > > > On Mar 11, 12:58�pm, Nancy <nancy0...(a)gmail.com> wrote: > > > > I �used the > > > > IDF=soundex(first_name) > > > > to get the soundex ID for the first name. > > > > Is there anything wrong? > > > > Thank you! > > > > On Mar 11, 12:46�pm, "Lou" <lpog...(a)hotmail.com> wrote: > > > > > From the description of the function in the documentation, "TAMARI" should > > > > encode as T56 - if you're getting anything else, it would appear that you > > > > have a problem. �But whether it's a problem with your installation or your > > > > code, we can't tell. �It might be helpful if you posted an example of your > > > > code. > > > > > "Nancy" <nancy0...(a)gmail.com> wrote in message > > > > >news:c6d64752-83c8-4db6-972f-138850277d73(a)a18g2000yqc.googlegroups.com... > > > > > > Hello All, > > > > > > I just found a problem when using the soundex function. It singed > > > > > different values for the same names in my data sets. I am wondering > > > > > whether there is somehing wrong with my opreation or somthing else. > > > > > > Thank you, > > > > > > Please see the examples: > > > > > > Obs � �First_name � � � � IDF > > > > > > �1 � �TAMARI � � � � � � T5623 > > > > > �2 � �TAMARI � � � � � � T56 > > > > > �3 � �DEVIN � � � � � � �D151 > > > > > �4 � �DEVIN � � � � � � �D15 > > > > > �5 � �JULIO � � � � � � �J42 > > > > > �6 � �JULIO � � � � � � �J4 > > > > > �7 � �NGOC � � � � � � � N221 > > > > > �8 � �NGOC � � � � � � � N22 > > > > > �9 � �TAMARI � � � � � � T5623 > > > > > 10 � �TAMARI � � � � � � T562- Hide quoted text - > > > > > - Show quoted text -- Hide quoted text - > > > > - Show quoted text - > > > I ran the data you posted and got the same soundex values for each > > pair. �So I don't think the problem is SOUNDEX. �But what could it > > be? �Different soundex values imply that some of the words were longer > > but you see the names as being equal. �I can think of one way that > > could happen, I'm sure others can think of otherways. �Perhaps the > > NAMES are formatted with a format that does not display the entire > > value. �As in this example. > > > data test; > > � �input First_name $16. �IDF $; > > � �s = soundex(first_name); > > � �s2 = soundex(scan(first_name,1,' ')); > > � �format First_name $6.; > > � �Name = First_name; > > � �cards; > > TAMARI J � � � � �T5623 > > TAMARI � � � � � �T56 > > DEVIN �S � � � � �D151 > > DEVIN � � � � � � D15 > > JULIO �H � � � � �J42 > > JULIO � � � � � � J4 > > NGOC � P � � � � �N221 > > NGOC � � � � � � �N22 > > TAMARI � � � � � �T5623 > > TAMARI � � � � � �T562 > > ;;;; > > � �run; > > proc print; > > � �run;- Hide quoted text - > > > - Show quoted text -- Hide quoted text - > > - Show quoted text -
First
|
Prev
|
Pages: 1 2 Prev: what is the main purpose of informat? Next: Single Hurdle Poisson - MCMC Procedure |