From: adam kortright on
Hi people,

I was doing some benchmarking of a Hash data step Vs a Proc SQL, both
summing two variables BY one variable and noticed that the elapsed
times followed a distinct, and weird, pattern. Running the two pieces
of code two at a time, ie SQL, SQL, hash, hash, SQL, SQL etc the
following pattern emerged; The first time the SQL was run it would
take on average, 9.71 et seconds. However it's second run would, on
average, take 16.27. Weirdly, the hash step had it round the other
way. It's first run averaged 7.41 and its second run averaged 3.37.

This pattern was 100% consistent on; a standalone pc with nothing else
running, a notebook and a networked pc. (All XP machines running
9.1)

Can anyone explain why this is happening? Or even better, how to get
the fastest result first time.


Adam
From: adam kortright on
On Apr 29, 8:58 am, adam kortright <adam.kortri...(a)gmail.com> wrote:
> Hi people,
>
> I was doing some benchmarking of a Hash data step Vs a Proc SQL, both
> summing two variables BY one variable and noticed that the elapsed
> times followed a distinct, and weird, pattern. Running the two pieces
> of code two at a time, ie SQL, SQL, hash, hash, SQL, SQL etc the
> following pattern emerged; The first time the SQL was run it would
> take on average, 9.71 et seconds. However it's second run would, on
> average, take 16.27. Weirdly, the hash step had it round the other
> way. It's first run averaged 7.41 and its second run averaged 3.37.
>
> This pattern was 100% consistent on; a standalone pc with nothing else
> running, a notebook and a networked pc. (All XP machines running
> 9.1)
>
> Can anyone explain why this is happening? Or even better, how to get
> the fastest result first time.
>
> Adam

Follow up:

Prior to every run of the code I was deleting the output tables using
a PROC SQL; Drop table....; to clean everything out. DROP appears to
be the problem. Changing the DROP to a Proc Datasets ... delete...;
fixed the weird ET pattern and produces the quickest ET consistently.
It appears that the DROP deletes a dataset in the same way that
Windows 'deletes' a file.


cheers
From: adam kortright on
On Apr 29, 1:51 pm, adam kortright <adam.kortri...(a)gmail.com> wrote:
> On Apr 29, 8:58 am, adam kortright <adam.kortri...(a)gmail.com> wrote:
>
>
>
>
>
> > Hi people,
>
> > I was doing some benchmarking of a Hash data step Vs a Proc SQL, both
> > summing two variables BY one variable and noticed that the elapsed
> > times followed a distinct, and weird, pattern. Running the two pieces
> > of code two at a time, ie SQL, SQL, hash, hash, SQL, SQL etc the
> > following pattern emerged; The first time the SQL was run it would
> > take on average, 9.71 et seconds. However it's second run would, on
> > average, take 16.27. Weirdly, the hash step had it round the other
> > way. It's first run averaged 7.41 and its second run averaged 3.37.
>
> > This pattern was 100% consistent on; a standalone pc with nothing else
> > running, a notebook and a networked pc. (All XP machines running
> > 9.1)
>
> > Can anyone explain why this is happening? Or even better, how to get
> > the fastest result first time.
>
> > Adam
>
> Follow up:
>
> Prior to every run of the code I was deleting the output tables using
> a PROC SQL; Drop table....; to clean everything out. DROP appears to
> be the problem. Changing the DROP to a Proc Datasets ... delete...;
> fixed the weird ET pattern and produces the quickest ET consistently.
> It appears that the DROP deletes a dataset in the same way that
> Windows 'deletes' a file.
>
> cheers

Nope. Ignore the above. The original oddity has returned under PROC
DATASETS.
From: adam kortright on
On Apr 29, 4:22 pm, adam kortright <adam.kortri...(a)gmail.com> wrote:
> On Apr 29, 1:51 pm, adam kortright <adam.kortri...(a)gmail.com> wrote:
>
>
>
>
>
> > On Apr 29, 8:58 am, adam kortright <adam.kortri...(a)gmail.com> wrote:
>
> > > Hi people,
>
> > > I was doing some benchmarking of a Hash data step Vs a Proc SQL, both
> > > summing two variables BY one variable and noticed that the elapsed
> > > times followed a distinct, and weird, pattern. Running the two pieces
> > > of code two at a time, ie SQL, SQL, hash, hash, SQL, SQL etc the
> > > following pattern emerged; The first time the SQL was run it would
> > > take on average, 9.71 et seconds. However it's second run would, on
> > > average, take 16.27. Weirdly, the hash step had it round the other
> > > way. It's first run averaged 7.41 and its second run averaged 3.37.
>
> > > This pattern was 100% consistent on; a standalone pc with nothing else
> > > running, a notebook and a networked pc. (All XP machines running
> > > 9.1)
>
> > > Can anyone explain why this is happening? Or even better, how to get
> > > the fastest result first time.
>
> > > Adam
>
> > Follow up:
>
> > Prior to every run of the code I was deleting the output tables using
> > a PROC SQL; Drop table....; to clean everything out. DROP appears to
> > be the problem. Changing the DROP to a Proc Datasets ... delete...;
> > fixed the weird ET pattern and produces the quickest ET consistently.
> > It appears that the DROP deletes a dataset in the same way that
> > Windows 'deletes' a file.
>
> > cheers
>
> Nope. Ignore the above. The original oddity has returned under PROC
> DATASETS.

Ok, using a PROC DATASETS...DELETE...; prior to running each piece of
code and running A & B in pairs, below are the ET figures. As you can
see, it's weird. This happens on all three pcs that I've tested on.

Seq Meth A Meth B
1 6.90
2 33.95
3 25.85
4 2.90
5 7.79
6 32.56
7 25.06
8 2.90
9 7.96
10 34.53
11 24.84
12 2.90
13 7.57
14 34.06
15 25.20
16 2.89
17 7.84
18 35.04
19 25.98
20 2.90

Avg ALL 20.82 14.14
Avg 1st 7.61 25.39
Avg 2nd 34.03 2.90



fwiw, these are the two bits of code I'm trying to benchmark.

* Method A - SQL *;
Proc datasets library= work;
delete MethodA MethodB;

Proc SQL;
Create table MethodA as
Select Item, sum(Num) as TotSrvs, sum(BenPd) as Totben
From Testdata
Group by Item;


*** Method B - Hash ***;
Proc datasets library= work;
delete MethodA MethodB;

Data _Null_ ;
Set Testdata end=end;

If _N_ = 1 Then Do;
Declare hash sum(Hashexp:16);
sum.definekey ('Item');
sum.definedata ('Item','TotBen','TotSrvs');
sum.Definedone ();
End;

If ~sum.Find() Then do;
TotBen + BenPd;
TotSrvs + Num;
End;
Else Do;
TotBen= BenPd;
TotSrvs= Num;
End;
sum.replace();

If end then sum.output(Dataset: 'MethodB');
run;


Any help would be seriously appreciated.


adam
From: Richard A. DeVenezia on
On Apr 28, 6:58 pm, adam kortright <adam.kortri...(a)gmail.com> wrote:
> Hi people,
>
> I was doing some benchmarking of a Hash data step Vs a Proc SQL, both
> summing two variables BY one variable and noticed that the elapsed
> times followed a distinct, and weird, pattern. Running the two pieces
> of code two at a time, ie SQL, SQL, hash, hash, SQL, SQL etc the
> following pattern emerged; The first time the SQL was run it would
> take on average, 9.71 et seconds. However it's second run would, on
> average, take 16.27. Weirdly, the hash step had it round the other
> way. It's first run averaged 7.41 and its second run averaged 3.37.
>
> This pattern was 100% consistent on; a standalone pc with nothing else
> running, a notebook and a networked pc. (All XP machines running
> 9.1)
>
> Can anyone explain why this is happening? Or even better, how to get
> the fastest result first time.
>
> Adam

Varying elapsed times is typically due to operations going on in the I/
O system.

There are probably a wide number of contributing factors:
- How many rows & columns in TestData
- What types of columns
- Laptop hardware ram/disk/diskcache
- XP Virtual memory, cache, delayed write settings
- Virus checkers
- Corporate services (backups/snoops/etc)

SAS on XP does provide many diagnostics you can observe through
Windows perfmon. Check out
http://www2.sas.com/proceedings/sugi30/022-30.pdf

You could also try using the SASFILE statement to force the table into
memory before using it.

--
Richard A. DeVenezia
http://www.devenezia.com