From: Gerhard Hellriegel on 12 Nov 2009 11:06 some other things about that. First is, it seems like the squeeze macro also reduces lengths of numeric variables. That is not always a good idea, specially if you are on a mainframe and plan to go to other platforms or use SAS on 2 or more platforms. The behaviour of numeric lengths and precision is different and you might get problems with that. One of my customer had while transferring a dwh from mainframe to unix. There some key-variables were nums and suddenly some duplicates occured! I'd leave all numerics at length 8 (max). Other thing: there are two kinds of compression. One is a simple one, activated with options compress=yes; The other is a more complex one: options compress=binary; Binary works with a binary compression routine - RDC (“Ross Data Compression”), something like ZIP under win also and others. That is more efficient as the simple one and might have advantages if there are well- filled datasets with many numerics. The compression also does not work on obs-level, but on block-level, which means, that also repeats over several obs are recognized. Disatvantage is for sure that the CPU costs are high. The simple compression is not too good, like Mike wrote, if you have datasets with small amounts of variables (try it with a dataset with 0 variables and you see that it increases size 200% or 300%) and with many numeric variables. Characters can be treated very well and especially if you have long observations (many variables and/or long variables) it is very good. SAS also tests the number of obs and switches compressing off if a ds is too small. Normally char variables are defined with a certain "buffer", longer as actually needed to be sure... That are the best candidates for simple compression. In normal cases the compress=yes is very good and in some cases it does not bring much. So there is no "danger" to add it. The cpu costs are small for it. On mainframes the cpu costs could be reduced with compression, because the overhead is small for doing it and IO-cpu-costs might be a important part of the total cpu consumption there. The cases when datasets increase with compression are very rare and very exotic (who has a dataset with 0 variables?). A good paper about that and more: http://www.sas.com/offices/asiapacific/sp/usergroups/smug/archive/2009/pres entations/BillGibsonPerformanceQ22009.pdf Gerhard On Thu, 12 Nov 2009 07:53:16 -0500, Michael Raithel <michaelraithel(a)WESTAT.COM> wrote: >Dear SAS-L-ers, > >Myra posted the following: > >> >> I found this macro a number of years ago, and use it all the time to >> reduce file sizes. I'm not sure if this is where I got it, but you can >> see it here: http://www.nesug.org/Proceedings/nesug06/io/io18.pdf >> >> I only started using SAS 9.2 in the last few months. When I run the >> macro in 9.2, I get an error message I never got before: >> >> WARNING: Multiple lengths were specified for the variable V1 by input >> data set(s). This may cause truncation of data. >> >> I believe this message comes after this data step: >> >> data &DSNOUT ; >> &RETAIN ; >> >> %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ) ; /* optimize char >> var lengths */ >> >> %if &N_NUM > 0 %then %str( &SQZ_NUM ; ) ; /* optimize >> numeric var lengths */ >> >> %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char >> var format lengths */ >> >> set &DSNIN ; >> run ; >> >> I get the message for every variable. Does anyone know why this would >> happen? >> >Myra, ah yes, the old %SQUEEZE macro; it's good to see an old friend still going strong in the workforce! > >I can't help you with that particular problem. However, I did note that you "...use it all the time to reduce file sizes..." and thought that I would offer an alternative that you may decide to use if my SAS-L brethren and sisteren do not come up with the answer that you need. > >SAS data set compression is also a good tool for reducing the size of SAS data sets. It is easy to use and pretty effective when your data sets have a lot of redundant adjacent data in them. There is a CPU Time "penalty" for processing compressed SAS data sets, due to SAS having to decomp the data, but that is often made up for in faster transfer times of more observations per I/O. If CPU Time is not an issue for you and disk space is, then you might want to consider SAS data set compression. > >Here is a link to the documentation under SAS V9.2: > >http://support.sas.com/documentation/cdl/en/lrcon/61722/PDF/default/lrcon. > >...go to page 528 and read from the bottom onward. Note that there > >One final note: In some cases, if you attempt to compress a SAS data set that does not have enough adjacent redundancy in its observations, you can end up with a compressed SAS data set bigger than the original. So, as always (at this point, you know what I am going to say, right?) check the SAS log after running your program! It will tell you the amount of space saved... and sometimes lost:-) > >Myra, best of luck in all of your SAS endeavors! > > >I hope that this suggestion proves helpful now, and in the future! > >Of course, all of these opinions and insights are my own, and do not reflect those of my organization or my associates. All SAS code and/or methodologies specified in this posting are for illustrative purposes only and no warranty is stated or implied as to their accuracy or applicability. People deciding to use information in this posting do so at their own risk. > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Michael A. Raithel >"The man who wrote the book on performance" >E-mail: MichaelRaithel(a)westat.com > >Author: Tuning SAS Applications in the MVS Environment > >Author: Tuning SAS Applications in the OS/390 and z/OS Environments, Second Edition >http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172 > >Author: The Complete Guide to SAS Indexes >http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409 > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Anyone who lives within their means suffers from a lack of >imagination. - Oscar Wilde >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
From: My on 12 Nov 2009 12:58 First: Thanks for your comments. It's been a while since I've posted to SAS-L, but it's great to know everyone's still here. Art, the SAS link to which you refer is the version I've been using. My company has a lot of proprietary data in a format that can be exported to SPSS but not SAS. So, I export to SPSS and then import in SAS. Nearly all the variables are binary (1,0), but in SPSS they have a length of 8. There are thousands of variables, so using %squeeze saves a huge amount of space. I then often use compress=binary in addition, which saves further space. If the file has 2500 variables, the savings are tremendous. But like Mike pointed out (thanks Mike!), if the log says that compressing actually increases the size, I redo it. (I work exclusively on a PC at this time.) I'll check out that system option, Chang. Thanks. Myra On Nov 12, 11:06 am, gerhard.hellrie...(a)T-ONLINE.DE (Gerhard Hellriegel) wrote: > some other things about that. > First is, it seems like the squeeze macro also reduces lengths of numeric > variables. That is not always a good idea, specially if you are on a > mainframe and plan to go to other platforms or use SAS on 2 or more > platforms. The behaviour of numeric lengths and precision is different and > you might get problems with that. One of my customer had while > transferring a dwh from mainframe to unix. There some key-variables were > nums and suddenly some duplicates occured! > I'd leave all numerics at length 8 (max). > Other thing: there are two kinds of compression. One is a simple one, > activated with > > options compress=yes; > > The other is a more complex one: > > options compress=binary; > > Binary works with a binary compression routine - RDC (Ross Data > Compression), something like ZIP under win also and others. That is more > efficient as the simple one and might have advantages if there are well- > filled datasets with many numerics. The compression also does not work on > obs-level, but on block-level, which means, that also repeats over several > obs are recognized. Disatvantage is for sure that the CPU costs are high. > > The simple compression is not too good, like Mike wrote, if you have > datasets with small amounts of variables (try it with a dataset with 0 > variables and you see that it increases size 200% or 300%) and with many > numeric variables. Characters can be treated very well and especially if > you have long observations (many variables and/or long variables) it is > very good. SAS also tests the number of obs and switches compressing off > if a ds is too small. Normally char variables are defined with a > certain "buffer", longer as actually needed to be sure... That are the > best candidates for simple compression. > In normal cases the compress=yes is very good and in some cases it does > not bring much. So there is no "danger" to add it. The cpu costs are small > for it. On mainframes the cpu costs could be reduced with compression, > because the overhead is small for doing it and IO-cpu-costs might be a > important part of the total cpu consumption there. The cases when datasets > increase with compression are very rare and very exotic (who has a dataset > with 0 variables?). > > A good paper about that and more:http://www.sas.com/offices/asiapacific/sp/usergroups/smug/archive/200... > entations/BillGibsonPerformanceQ22009.pdf > > Gerhard > > On Thu, 12 Nov 2009 07:53:16 -0500, Michael Raithel > > > > > > <michaelrait...(a)WESTAT.COM> wrote: > >Dear SAS-L-ers, > > >Myra posted the following: > > >> I found this macro a number of years ago, and use it all the time to > >> reduce file sizes. I'm not sure if this is where I got it, but you can > >> see it here:http://www.nesug.org/Proceedings/nesug06/io/io18.pdf > > >> I only started using SAS 9.2 in the last few months. When I run the > >> macro in 9.2, I get an error message I never got before: > > >> WARNING: Multiple lengths were specified for the variable V1 by input > >> data set(s). This may cause truncation of data. > > >> I believe this message comes after this data step: > > >> data &DSNOUT ; > >> &RETAIN ; > > >> %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ) ; /* optimize char > >> var lengths */ > > >> %if &N_NUM > 0 %then %str( &SQZ_NUM ; ) ; /* optimize > >> numeric var lengths */ > > >> %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char > >> var format lengths */ > > >> set &DSNIN ; > >> run ; > > >> I get the message for every variable. Does anyone know why this would > >> happen? > > >Myra, ah yes, the old %SQUEEZE macro; it's good to see an old friend > > still going strong in the workforce! > > >I can't help you with that particular problem. However, I did note that > > you "...use it all the time to reduce file sizes..." and thought that I > would offer an alternative that you may decide to use if my SAS-L brethren > and sisteren do not come up with the answer that you need. > > >SAS data set compression is also a good tool for reducing the size of SAS > > data sets. It is easy to use and pretty effective when your data sets > have a lot of redundant adjacent data in them. There is a CPU > Time "penalty" for processing compressed SAS data sets, due to SAS having > to decomp the data, but that is often made up for in faster transfer times > of more observations per I/O. If CPU Time is not an issue for you and > disk space is, then you might want to consider SAS data set compression. > > >Here is a link to the documentation under SAS V9.2: > > >http://support.sas.com/documentation/cdl/en/lrcon/61722/PDF/default/l..... > > >...go to page 528 and read from the bottom onward. Note that there > > >One final note: In some cases, if you attempt to compress a SAS data set > > that does not have enough adjacent redundancy in its observations, you can > end up with a compressed SAS data set bigger than the original. So, as > always (at this point, you know what I am going to say, right?) check the > SAS log after running your program! It will tell you the amount of space > saved... and sometimes lost:-) > > >Myra, best of luck in all of your SAS endeavors! > > >I hope that this suggestion proves helpful now, and in the future! > > >Of course, all of these opinions and insights are my own, and do not > > reflect those of my organization or my associates. All SAS code and/or > methodologies specified in this posting are for illustrative purposes only > and no warranty is stated or implied as to their accuracy or > applicability. People deciding to use information in this posting do so at > their own risk. > > > > > > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Michael A. Raithel > >"The man who wrote the book on performance" > >E-mail: MichaelRait...(a)westat.com > > >Author: Tuning SAS Applications in the MVS Environment > > >Author: Tuning SAS Applications in the OS/390 and z/OS Environments, > Second Edition > >http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172 > > >Author: The Complete Guide to SAS Indexes > >http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409 > > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Anyone who lives within their means suffers from a lack of > >imagination. - Oscar Wilde > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- Hide quoted text - > > - Show quoted text -- Hide quoted text - > > - Show quoted text -
From: Jack Hamilton on 12 Nov 2009 19:34 I question two of your statements. On Nov 12, 2009, at 8:06 am, Gerhard Hellriegel wrote: > Binary works with a binary compression routine - RDC (�Ross Data > Compression�), something like ZIP under win also and others. That is more > efficient as the simple one and might have advantages if there are well- > filled datasets with many numerics. The compression also does not work on > obs-level, but on block-level, which means, that also repeats over several > obs are recognized. The documentation at http://support.sas.com/onlinedoc/913/getDoc/en/lrdict.hlp/a001288760.htm "This method is highly effective for compressing medium to large (several hundred bytes or larger) blocks of binary data (numeric variables). Because the compression function operates on a single record at a time, the record length needs to be several hundred bytes or larger for effective compression." Why do you think that across-obs repeats are compressed? > The cases when datasets > increase with compression are very rare and very exotic (who has a dataset > with 0 variables?). My estimate is that about 10% of my data sets would be increased by compression - so for me, not rare at all. -- Jack Hamilton jfh(a)alumni.stanford.org Caelum non animum mutant qui trans mare currunt. On Nov 12, 2009, at 8:06 am, Gerhard Hellriegel wrote: > some other things about that. > First is, it seems like the squeeze macro also reduces lengths of numeric > variables. That is not always a good idea, specially if you are on a > mainframe and plan to go to other platforms or use SAS on 2 or more > platforms. The behaviour of numeric lengths and precision is different and > you might get problems with that. One of my customer had while > transferring a dwh from mainframe to unix. There some key-variables were > nums and suddenly some duplicates occured! > I'd leave all numerics at length 8 (max). > Other thing: there are two kinds of compression. One is a simple one, > activated with > > options compress=yes; > > The other is a more complex one: > > options compress=binary; > > Binary works with a binary compression routine - RDC (�Ross Data > Compression�), something like ZIP under win also and others. That is more > efficient as the simple one and might have advantages if there are well- > filled datasets with many numerics. The compression also does not work on > obs-level, but on block-level, which means, that also repeats over several > obs are recognized. Disatvantage is for sure that the CPU costs are high. > > The simple compression is not too good, like Mike wrote, if you have > datasets with small amounts of variables (try it with a dataset with 0 > variables and you see that it increases size 200% or 300%) and with many > numeric variables. Characters can be treated very well and especially if > you have long observations (many variables and/or long variables) it is > very good. SAS also tests the number of obs and switches compressing off > if a ds is too small. Normally char variables are defined with a > certain "buffer", longer as actually needed to be sure... That are the > best candidates for simple compression. > In normal cases the compress=yes is very good and in some cases it does > not bring much. So there is no "danger" to add it. The cpu costs are small > for it. On mainframes the cpu costs could be reduced with compression, > because the overhead is small for doing it and IO-cpu-costs might be a > important part of the total cpu consumption there. > > A good paper about that and more: > http://www.sas.com/offices/asiapacific/sp/usergroups/smug/archive/2009/pres > entations/BillGibsonPerformanceQ22009.pdf > > Gerhard > > > > > On Thu, 12 Nov 2009 07:53:16 -0500, Michael Raithel > <michaelraithel(a)WESTAT.COM> wrote: > >> Dear SAS-L-ers, >> >> Myra posted the following: >> >>> >>> I found this macro a number of years ago, and use it all the time to >>> reduce file sizes. I'm not sure if this is where I got it, but you can >>> see it here: http://www.nesug.org/Proceedings/nesug06/io/io18.pdf >>> >>> I only started using SAS 9.2 in the last few months. When I run the >>> macro in 9.2, I get an error message I never got before: >>> >>> WARNING: Multiple lengths were specified for the variable V1 by input >>> data set(s). This may cause truncation of data. >>> >>> I believe this message comes after this data step: >>> >>> data &DSNOUT ; >>> &RETAIN ; >>> >>> %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ) ; /* optimize char >>> var lengths */ >>> >>> %if &N_NUM > 0 %then %str( &SQZ_NUM ; ) ; /* optimize >>> numeric var lengths */ >>> >>> %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char >>> var format lengths */ >>> >>> set &DSNIN ; >>> run ; >>> >>> I get the message for every variable. Does anyone know why this would >>> happen? >>> >> Myra, ah yes, the old %SQUEEZE macro; it's good to see an old friend > still going strong in the workforce! >> >> I can't help you with that particular problem. However, I did note that > you "...use it all the time to reduce file sizes..." and thought that I > would offer an alternative that you may decide to use if my SAS-L brethren > and sisteren do not come up with the answer that you need. >> >> SAS data set compression is also a good tool for reducing the size of SAS > data sets. It is easy to use and pretty effective when your data sets > have a lot of redundant adjacent data in them. There is a CPU > Time "penalty" for processing compressed SAS data sets, due to SAS having > to decomp the data, but that is often made up for in faster transfer times > of more observations per I/O. If CPU Time is not an issue for you and > disk space is, then you might want to consider SAS data set compression. >> >> Here is a link to the documentation under SAS V9.2: >> >> http://support.sas.com/documentation/cdl/en/lrcon/61722/PDF/default/lrcon. >> >> ...go to page 528 and read from the bottom onward. Note that there >> >> One final note: In some cases, if you attempt to compress a SAS data set > that does not have enough adjacent redundancy in its observations, you can > end up with a compressed SAS data set bigger than the original. So, as > always (at this point, you know what I am going to say, right?) check the > SAS log after running your program! It will tell you the amount of space > saved... and sometimes lost:-) >> >> Myra, best of luck in all of your SAS endeavors! >> >> >> I hope that this suggestion proves helpful now, and in the future! >> >> Of course, all of these opinions and insights are my own, and do not > reflect those of my organization or my associates. All SAS code and/or > methodologies specified in this posting are for illustrative purposes only > and no warranty is stated or implied as to their accuracy or > applicability. People deciding to use information in this posting do so at > their own risk. >> >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Michael A. Raithel >> "The man who wrote the book on performance" >> E-mail: MichaelRaithel(a)westat.com >> >> Author: Tuning SAS Applications in the MVS Environment >> >> Author: Tuning SAS Applications in the OS/390 and z/OS Environments, > Second Edition >> http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172 >> >> Author: The Complete Guide to SAS Indexes >> http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409 >> >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Anyone who lives within their means suffers from a lack of >> imagination. - Oscar Wilde >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
First
|
Prev
|
Pages: 1 2 Prev: LOCF (Last Obs Carried Forward) Help, Please Next: Normal residual in GLIMMIX? |