Merge on multiple variables [SAS]

Prev: SASAUTOS= Question
Next: Top Few Solutions from Proc LP

From: Gary on 4 May 2010 08:54

What I would like to do is merge two data sets by a unique id from
dsn1 and any one of four ids from dsn2 by examining the four alternate
IDs in order. If the first one does not match the second one is
examined and so on until a match is found or all alternates have been
used. In other words if there is no match for dsn1 UniqID and dsn2
UniqID then match dsn1 UniqID with dsn2 alt_id1, then if no match is
found try dsn2 alt_id2, still no match try dsn1 UniqID and alt_id3.
The data sets are fairly large so I would like to do this in one pass
if possible. Since not all observations have values for the all the
alternate IDs, (all alternate IDs may be missing and they are
populated in order so there is an alt_id2 only if there is an alt_id2
and an alt_id3 only if there is an alt_id1 and alt_id2) so the only
thing I can come up with is to subset the data and make three passes.
Does anyone have any other suggesions?

dsn1:
UniqID x1 x2 x3...

dsn2:
UniqID y1 y2 y3... alt_id1 alt_id2 alt_id3

Thanks for looking,

-Gary

From: Ya on 4 May 2010 12:04

On May 4, 5:54 am, Gary <garywe...(a)gmail.com> wrote:
> What I would like to do is merge two data sets by a unique id from
> dsn1 and any one of four ids from dsn2 by examining the four alternate
> IDs in order. If the first one does not match the second one is
> examined and so on until a match is found or all alternates have been
> used. In other words if there is no match for dsn1 UniqID and dsn2
> UniqID then match dsn1 UniqID with dsn2 alt_id1, then if no match is
> found try dsn2 alt_id2, still no match try dsn1 UniqID and alt_id3.
> The data sets are fairly large so I would like to do this in one pass
> if possible. Since not all observations have values for the all the
> alternate IDs, (all alternate IDs may be missing and they are
> populated in order so there is an alt_id2 only if there is an alt_id2
> and an alt_id3 only if there is an alt_id1 and alt_id2) so the only
> thing I can come up with is to subset the data and make three passes.
> Does anyone have any other suggesions?
>
> dsn1:
> UniqID x1 x2 x3...
>
> dsn2:
> UniqID y1 y2 y3... alt_id1 alt_id2 alt_id3
>
> Thanks for looking,
>
> -Gary

data ds1;
input id $ x;
cards;
1 2
2 4
3 6
4 8
5 9
;

data ds2;
input id $ aid1 $ aid2 $ aid3 $ y;
cards;
1 8 7 6 23
8 2 3 5 43
10 5 1 3 77
;

proc sql;
create table ds3 as
select ds1.id as ds1id,ds1.x, ds2.*,
case when ds1.id=ds2.id then 1
when ds1.id=ds2.aid1 then 2
when ds1.id=ds2.aid2 then 3
when ds1.id=ds2.aid3 then 4
else . end as order
from ds1 left join ds2
on indexw(compbl(ds2.id||' '||ds2.aid1||' '||ds2.aid2||' '||
ds2.aid2),trim(ds1.id))
order by ds1id, order
;

proc print;
run;

data ds3;
set ds3;
by ds1id order;
if first.ds1id;
keep ds1id x y;
run;

proc print;
run;

Obs ds1id x id aid1 aid2 aid3 y order

1 1 2 1 8 7 6 23 1
2 1 2 10 5 1 3 77 3
3 2 4 8 2 3 5 43 2
4 3 6 8 2 3 5 43 3
5 4 8 . .
6 5 9 10 5 1 3 77 2

As you can see, your requirement may result in multiple records for
each of the original ds1 id.
The order variable gives the priority based on which id match first.
The next data step
is used to filter out those low priority match. Eventually, you get
this:

Obs ds1id x y

1 1 2 23
2 2 4 43
3 3 6 43
4 4 8 .
5 5 9 77

Ya Huang

From: Gary on 5 May 2010 09:35

On May 4, 12:04 pm, Ya <huang8...(a)gmail.com> wrote:
> On May 4, 5:54 am, Gary <garywe...(a)gmail.com> wrote:
>
>
>
>
>
> > What I would like to do is merge two data sets by a unique id from
> > dsn1 and any one of four ids from dsn2 by examining the four alternate
> > IDs in order. If the first one does not match the second one is
> > examined and so on until a match is found or all alternates have been
> > used. In other words if there is no match for dsn1 UniqID and dsn2
> > UniqID then match dsn1 UniqID with dsn2 alt_id1, then if no match is
> > found try dsn2 alt_id2, still no match try dsn1 UniqID and alt_id3.
> > The data sets are fairly large so I would like to do this in one pass
> > if possible. Since not all observations have values for the all the
> > alternate IDs, (all alternate IDs may be missing and they are
> > populated in order so there is an alt_id2 only if there is an alt_id2
> > and an alt_id3 only if there is an alt_id1 and alt_id2) so the only
> > thing I can come up with is to subset the data and make three passes.
> > Does anyone have any other suggesions?
>
> > dsn1:
> > UniqID x1 x2 x3...
>
> > dsn2:
> > UniqID y1 y2 y3... alt_id1 alt_id2 alt_id3
>
> > Thanks for looking,
>
> > -Gary
>
> data ds1;
> input id $ x;
> cards;
> 1 2
> 2 4
> 3 6
> 4 8
> 5 9
> ;
>
> data ds2;
> input id $ aid1 $ aid2 $ aid3 $ y;
> cards;
> 1 8 7 6 23
> 8 2 3 5 43
> 10 5 1 3 77
> ;
>
> proc sql;
> create table ds3 as
> select ds1.id as ds1id,ds1.x, ds2.*,
> case when ds1.id=ds2.id then 1
> when ds1.id=ds2.aid1 then 2
> when ds1.id=ds2.aid2 then 3
> when ds1.id=ds2.aid3 then 4
> else . end as order
> from ds1 left join ds2
> on indexw(compbl(ds2.id||' '||ds2.aid1||' '||ds2.aid2||' '||
> ds2.aid2),trim(ds1.id))
> order by ds1id, order
> ;
>
> proc print;
> run;
>
> data ds3;
> set ds3;
> by ds1id order;
> if first.ds1id;
> keep ds1id x y;
> run;
>
> proc print;
> run;
>
> Obs ds1id x id aid1 aid2 aid3 y order
>
> 1 1 2 1 8 7 6 23 1
> 2 1 2 10 5 1 3 77 3
> 3 2 4 8 2 3 5 43 2
> 4 3 6 8 2 3 5 43 3
> 5 4 8 . .
> 6 5 9 10 5 1 3 77 2
>
> As you can see, your requirement may result in multiple records for
> each of the original ds1 id.
> The order variable gives the priority based on which id match first.
> The next data step
> is used to filter out those low priority match. Eventually, you get
> this:
>
> Obs ds1id x y
>
> 1 1 2 23
> 2 2 4 43
> 3 3 6 43
> 4 4 8 .
> 5 5 9 77
>
> Ya Huang- Hide quoted text -
>
> - Show quoted text -

Thanks Ya Huang, it is what I needed.

|
Pages: 1
Prev: SASAUTOS= Question
Next: Top Few Solutions from Proc LP