From: Muthia Kachirayan on
Mike,

It is a good tweak !

Thanks.

On Mon, Jan 18, 2010 at 3:07 PM, Mike Zdeb <msz03(a)albany.edu> wrote:

> hi ... nice
>
> also, same answer, slight tweak of code in the hash routine ...
>
> data have;
> do key = 1 to 10;
> sat = put(key * 100 + key, z4.);
> output;
> end;
> run;
>
> %let m=1000;
>
> data want;
> declare hash rnd ();
> rnd.definekey ('uid');
> rnd.definedone ();
> do until (done);
> set have end=done nobs=n;
> do until (not rc);
> uid = ceil(ranuni(123) * n * &m);
> rc=rnd.add();
> end;
> output;
> end;
> stop;
> drop key rc;
> run;
>
> proc print data=want;
> run;
>
> Obs sat uid
> 1 0101 7504
> 2 0202 3210
> 3 0303 1784
> 4 0404 9061
> 5 0505 3572
> 6 0606 2212
> 7 0707 7865
> 8 0808 3981
> 9 0909 1247
> 10 1010 1877
>
>
> --
> Mike Zdeb
> U(a)Albany School of Public Health
> One University Place
> Rensselaer, New York 12144-3456
> P/518-402-6479 F/630-604-1475
>
> > Aihua,
> >
> > Your clarification to Dan makes your need understood. My earlier array
> > solution permutes the observation numbers in random order to get the
> > urands. There is some chance that both of them may be same. The urands
> will
> > suggest the similarity between observation numbers and urands. This can
> be
> > rectified by choosing a multiplier to the number of observations to get a
> > random sequence of urands.
> >
> > The test data set is:
> >
> > data have;
> > do key = 1 to 10;
> > sat = put(key * 100 + key, z4.);
> > output;
> > end;
> > run;
> >
> > The data set, HAVE, has KEY as the primary key in ascending order with
> some
> > SAT data.
> >
> > key sat
> > 1 0101
> > 2 0202
> > 3 0303
> > 4 0404
> > 5 0505
> > 6 0606
> > 7 0707
> > 8 0808
> > 9 0909
> > 10 1010
> >
> > The following program creates new UNIQUE IDs(uid) based on the number of
> > observations in the data set and note that the KEY is not used in the
> > process. Let us use a number, NUM_FOLD, to multiply the observation
> number,
> > say 1000.
> >
> > %let num_fold = 1000;
> >
> > data need;
> > if _n_ = 1 then do;
> > declare hash h(hashexp:16);
> > h.definekey('uid');
> > h.definedata('RID','key','uid');
> > h.definedone();
> > end;
> > do RID = 1 to num;
> > set have nobs = num ;
> > uid = ceil(ranuni(123) * num * &num_fold);
> > do rc = h.check() by 0 while (rc = 0);
> > uid = ceil(ranuni(123) * num * &num_fold); ** Try another random
> > number ;
> > rc = h.check();
> > end;
> > h.add();
> > output;
> > end;
> > h.output(dataset:'LOOKUP');
> > stop;
> > drop rc key;
> > run;
> >
> > proc print data = need;
> > run;
> >
> > The data set, NEED, gives the UID for the corresponding Record ID(RID)
> and
> > the KEY is dropped to keep the secrecy of data set.
> >
> > RID sat uid
> > 1 0101 7504
> > 2 0202 3210
> > 3 0303 1784
> > 4 0404 9061
> > 5 0505 3572
> > 6 0606 2212
> > 7 0707 7865
> > 8 0808 3981
> > 9 0909 1247
> > 10 1010 1877
> >
> > My earlier array solution is a special case when NUM_FOLD = 1.
> >
> > This program also gives another data set, LOOKUP, giving links to
> RID/KEY.
> > When it is sorted by UID, the reverse process of getting RID/KEY becomes
> > easy.
> >
> > There is another possiblity of creating UIDs based on the KEYs and not
> based
> > on the observation number. The following program does it based on MOD()
> > function. However, KEYs, have to nemeric but this restriction can be
> removed
> > in some circumstances when the character-type KEYs can be changed to
> numeric
> > by the use of functions like, PIBw. For further details refer to
> Dorfman(Key
> > indexing, Bitmapping and Hashing).
> >
> > data need;
> > if _n_ = 1 then do;
> > declare hash h(hashexp:16);
> > h.definekey('uid');
> > h.definedata('RID','key','uid');
> > h.definedone();
> > end;
> > do RID = 1 to num;
> > set have nobs = num ;
> > uid = mod(key, num) + 1;
> > do rc = h.check() by 0 while (rc = 0);
> > ** Try another random number ;
> > uid = uid + 1;
> > if uid > num then uid = 1;
> > rc = h.check();
> > end;
> > h.add();
> > output;
> > end;
> > h.output(dataset:'LOOKUP');
> > stop;
> > drop rc key;
> > run;
> >
> > Do you find this program useful to solve your issue ? Your feedback will
> be
> > useful to SAS-Lers to give alternate solutions.
> >
> > Kind regards,
> > Muthia Kachirayan
> >
> >
> >
> >
> > On Sat, Jan 16, 2010 at 5:30 PM, Ai Hua Wang <aihuawang(a)yahoo.com>
> wrote:
> >
> >> Hi Dan:
> >>
> >> Thank you very much for your thoughtful follow up. Please see my answers
> >> below.
> >>
> >> Why does your multiplier need to be proportional to dataset size?
> >> That is just my thought after I tried. Because when I use the smaller
> >> multiplier I got much more duplicates. When I increase it I got less.
> >> Eventually I found that it should be at least propotional to the size of
> the
> >> data set.
> >>
> >> Why do you want random integers assigned to your data?
> >> I need to use the assigned random integers as the unique id to allow the
> >> data users to identify each unique record. I thought it is better to use
> the
> >> integer than the decimal numbers.
> >>
> >> And why do they need to be unique?
> >> See above description and plus:
> >> It is used as the replacement of the sensitive information (unique id)
> for
> >> the privacy and confidentialiy concern.
> >>
> >> I hope this is helpful when you provide more insightful answers.
> >>
> >> Best Regards,
> >> Aihua
> >>
> >>
> >
>
>