PROC SQL bottlenecks [SAS]

Prev: Help Desk Software by PerlDesk
Next: A link to a collection of tutorials and videos on Data-Applied.

From: GuyA on 30 Mar 2010 04:06

Hi.

I'm using PROC SQL to summarise a 100 million observation dataset. It
takes about 7 hours. The system CPU time is only about 30 minutes.

I'm running on a pretty powerful Windows system.

What could speed this up? What bottlenecks is it likely to be running
into?

It's doing quite a lot of different types of calculation for the
output summary (including using plenty of references to "calculated"
variables). Do people think it would probably be best to do a simple
PROC SUMMARY and then use a data step to calculate the necessary extra
fields?

Thanks.

From: Patrick on 30 Mar 2010 06:50

Hi

You haven't posted the SQL code and log. Therefore only some
guesswork.

The bottleneck is most likely I\O. And the remedy is to reduce passes
through the data to a minimum.

Using calculated columns will most likely result in backmerging data
(meaning an additional pass through the data). The log should tell you
this.

If your data is stored withing SAS then a PROC SUMMARY / Data Step
combination could help. May be also writing the SQL code different
could have a positive impact.

HTH
Patrick

From: GuyA on 30 Mar 2010 07:02

Thanks. I/O is highly likely to be part of the problem: simply
creating 100 million observation datasets takes a good 15-20 minutes
to write to the drive, so I guess it's having the same issues when
reading.

I should have mentioned that this SQL is not doing any joins, which I
know SQL joining large datasets on a non-optimised platform can have
catastrophic results...

What is a backmerge? Is it the same as a remerge? If so, it's not
having to remerge the data.

I've always wondered how efficient calculated columns are. For
example:

group
,sum(case when something then 1 else 0 end) as count
,sum(balance1) / calculated count as balance1
,sum(balance2) / calculated count as balance2

I would rationally want to believe that this is more efficient than re-
performing the calculation for "count" multiple times. Does anybody
know if this is actually not the case?

|
Pages: 1
Prev: Help Desk Software by PerlDesk
Next: A link to a collection of tutorials and videos on Data-Applied.