From: GuyA on
Hi.

I'm using PROC SQL to summarise a 100 million observation dataset. It
takes about 7 hours. The system CPU time is only about 30 minutes.

I'm running on a pretty powerful Windows system.

What could speed this up? What bottlenecks is it likely to be running
into?

It's doing quite a lot of different types of calculation for the
output summary (including using plenty of references to "calculated"
variables). Do people think it would probably be best to do a simple
PROC SUMMARY and then use a data step to calculate the necessary extra
fields?

Thanks.
From: Patrick on
Hi

You haven't posted the SQL code and log. Therefore only some
guesswork.

The bottleneck is most likely I\O. And the remedy is to reduce passes
through the data to a minimum.

Using calculated columns will most likely result in backmerging data
(meaning an additional pass through the data). The log should tell you
this.

If your data is stored withing SAS then a PROC SUMMARY / Data Step
combination could help. May be also writing the SQL code different
could have a positive impact.

HTH
Patrick
From: GuyA on
Thanks. I/O is highly likely to be part of the problem: simply
creating 100 million observation datasets takes a good 15-20 minutes
to write to the drive, so I guess it's having the same issues when
reading.

I should have mentioned that this SQL is not doing any joins, which I
know SQL joining large datasets on a non-optimised platform can have
catastrophic results...

What is a backmerge? Is it the same as a remerge? If so, it's not
having to remerge the data.

I've always wondered how efficient calculated columns are. For
example:

group
,sum(case when something then 1 else 0 end) as count
,sum(balance1) / calculated count as balance1
,sum(balance2) / calculated count as balance2

I would rationally want to believe that this is more efficient than re-
performing the calculation for "count" multiple times. Does anybody
know if this is actually not the case?