From: Gary on
Hi all,

I've done some reading up on hash objects and I'm really keen on
picking up this tool for merging large datasets. I was wondering if
someone could point me in the right direction on the following:

1. Is it possible to merge on keys with different variable names? I
have tried renaming the keys through a dataset rename option, but the
program complains that the key cannot be found. For example:

data merge;
if 0 then set lookup (rename=(oldkeyname=newkeyname));

etc...

run;

2. Is there any documentation which shows how to mimic SQL joins using
hash object, especially with duplicate keys?

Thanks,
Gary
From: Proc Me on
Gary,

Check out Judy Loren's excellent paper:

How Do I Love Hash Tables? Let Me Count The Ways!
http://www2.sas.com/proceedings/forum2008/029-2008.pdf

Paul M Dorfman's papers are also worth checking out, especially his more
recent ones such as:
The Hash Object in Action
http://support.sas.com/resources/papers/proceedings09/153-2009.pdf

There is also a wealth of info on SAS-L,

Proc Me
From: Paul Dorfman on
Gary,

If you decide to rename a key variable, you should rename it both in the
place where you read the descriptor of the file being hashed for parameter
matching (the IF 0 THEN SET part) and where you actually load it. If you
load it explicitly, i.e. reading it in a loop, add the rename statement
there. If you are loading it via the dataset: parameter, rename it there,
as in

dcl hash h (dataset: "to_hash (rename = (old=new))") ;

Unfortunately, the possibility to specify data set options in this manner
is available in 9.2 but not in 9.1.3. If you are running the latter, you
can create a view where the key is renamed and then reference the view in
your DATA step without worrying about renaming the key in two places.

However, since there exists parameter tag KEY:, which accepts any
expression as long as it is of the expected type (i.e. numeric or
variable), none of the above is actually necessary. You can define and
load the object with the old key, but then when calling .check or .find
methods against it, specify the new key to the key: tag, as in

rc = h.find (key: old) ;

Once again, remember that what you specify to KEY: (or, in fact, any other
parameter tag) can be any expression of the correct type.

As far as mimicking SQL with hash code is concerned, it may not be as
beneficial as you expect given the latest improvements in the SQL
optimizer. If you are after an equijoin - expecially with many-to-many key
relationship - SQL will use its inner hash algorithm (SQXJHSH access
method) implicitly if you prompt it to do so with high enough buffersize=
option, and it is going to be faster than its possible hash object
replacement. A non-equijoin may be a different story, because for some
inexplicable reason, SAS SQL does not use the SQXJHSH access method for
non-equijoins. In these cases, follow the good advice and check Judy
Loren's paper.

Duplicate hash keys can be handled in both 9.1 and 9.2. In 9.2 it is
canned functionality with the availability of multi-hash and corresponding
methods; in 9.1 it can be handled programmatically in a number of
different ways, all of them relying on the addition of a unique
discriminating key to the hash in question. If you are interested in more
details, try:

http://www2.sas.com/proceedings/sugi31/232-31.pdf

where dupe hash keys are handled in 9.1.3 via a second hash table.

Kind regards
------------
Paul Dorfman
Jax, FL
------------

On Sat, 5 Dec 2009 23:30:57 -0800, Gary <gshyoung(a)GMAIL.COM> wrote:

>Hi all,
>
>I've done some reading up on hash objects and I'm really keen on
>picking up this tool for merging large datasets. I was wondering if
>someone could point me in the right direction on the following:
>
>1. Is it possible to merge on keys with different variable names? I
>have tried renaming the keys through a dataset rename option, but the
>program complains that the key cannot be found. For example:
>
>data merge;
> if 0 then set lookup (rename=(oldkeyname=newkeyname));
>
>etc...
>
>run;
>
>2. Is there any documentation which shows how to mimic SQL joins using
>hash object, especially with duplicate keys?
>
>Thanks,
>Gary