Prev: Help Desk Software by PerlDesk
Next: A link to a collection of tutorials and videos on Data-Applied.
From: GuyA on 30 Mar 2010 04:06 Hi. I'm using PROC SQL to summarise a 100 million observation dataset. It takes about 7 hours. The system CPU time is only about 30 minutes. I'm running on a pretty powerful Windows system. What could speed this up? What bottlenecks is it likely to be running into? It's doing quite a lot of different types of calculation for the output summary (including using plenty of references to "calculated" variables). Do people think it would probably be best to do a simple PROC SUMMARY and then use a data step to calculate the necessary extra fields? Thanks.
From: Patrick on 30 Mar 2010 06:50 Hi You haven't posted the SQL code and log. Therefore only some guesswork. The bottleneck is most likely I\O. And the remedy is to reduce passes through the data to a minimum. Using calculated columns will most likely result in backmerging data (meaning an additional pass through the data). The log should tell you this. If your data is stored withing SAS then a PROC SUMMARY / Data Step combination could help. May be also writing the SQL code different could have a positive impact. HTH Patrick
From: GuyA on 30 Mar 2010 07:02
Thanks. I/O is highly likely to be part of the problem: simply creating 100 million observation datasets takes a good 15-20 minutes to write to the drive, so I guess it's having the same issues when reading. I should have mentioned that this SQL is not doing any joins, which I know SQL joining large datasets on a non-optimised platform can have catastrophic results... What is a backmerge? Is it the same as a remerge? If so, it's not having to remerge the data. I've always wondered how efficient calculated columns are. For example: group ,sum(case when something then 1 else 0 end) as count ,sum(balance1) / calculated count as balance1 ,sum(balance2) / calculated count as balance2 I would rationally want to believe that this is more efficient than re- performing the calculation for "count" multiple times. Does anybody know if this is actually not the case? |