From: Gerhard Hellriegel on
some other things about that.
First is, it seems like the squeeze macro also reduces lengths of numeric
variables. That is not always a good idea, specially if you are on a
mainframe and plan to go to other platforms or use SAS on 2 or more
platforms. The behaviour of numeric lengths and precision is different and
you might get problems with that. One of my customer had while
transferring a dwh from mainframe to unix. There some key-variables were
nums and suddenly some duplicates occured!
I'd leave all numerics at length 8 (max).
Other thing: there are two kinds of compression. One is a simple one,
activated with

options compress=yes;

The other is a more complex one:

options compress=binary;

Binary works with a binary compression routine - RDC (“Ross Data
Compression”), something like ZIP under win also and others. That is more
efficient as the simple one and might have advantages if there are well-
filled datasets with many numerics. The compression also does not work on
obs-level, but on block-level, which means, that also repeats over several
obs are recognized. Disatvantage is for sure that the CPU costs are high.

The simple compression is not too good, like Mike wrote, if you have
datasets with small amounts of variables (try it with a dataset with 0
variables and you see that it increases size 200% or 300%) and with many
numeric variables. Characters can be treated very well and especially if
you have long observations (many variables and/or long variables) it is
very good. SAS also tests the number of obs and switches compressing off
if a ds is too small. Normally char variables are defined with a
certain "buffer", longer as actually needed to be sure... That are the
best candidates for simple compression.
In normal cases the compress=yes is very good and in some cases it does
not bring much. So there is no "danger" to add it. The cpu costs are small
for it. On mainframes the cpu costs could be reduced with compression,
because the overhead is small for doing it and IO-cpu-costs might be a
important part of the total cpu consumption there. The cases when datasets
increase with compression are very rare and very exotic (who has a dataset
with 0 variables?).

A good paper about that and more:
http://www.sas.com/offices/asiapacific/sp/usergroups/smug/archive/2009/pres
entations/BillGibsonPerformanceQ22009.pdf

Gerhard




On Thu, 12 Nov 2009 07:53:16 -0500, Michael Raithel
<michaelraithel(a)WESTAT.COM> wrote:

>Dear SAS-L-ers,
>
>Myra posted the following:
>
>>
>> I found this macro a number of years ago, and use it all the time to
>> reduce file sizes. I'm not sure if this is where I got it, but you can
>> see it here: http://www.nesug.org/Proceedings/nesug06/io/io18.pdf
>>
>> I only started using SAS 9.2 in the last few months. When I run the
>> macro in 9.2, I get an error message I never got before:
>>
>> WARNING: Multiple lengths were specified for the variable V1 by input
>> data set(s). This may cause truncation of data.
>>
>> I believe this message comes after this data step:
>>
>> data &DSNOUT ;
>> &RETAIN ;
>>
>> %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ) ; /* optimize char
>> var lengths */
>>
>> %if &N_NUM > 0 %then %str( &SQZ_NUM ; ) ; /* optimize
>> numeric var lengths */
>>
>> %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char
>> var format lengths */
>>
>> set &DSNIN ;
>> run ;
>>
>> I get the message for every variable. Does anyone know why this would
>> happen?
>>
>Myra, ah yes, the old %SQUEEZE macro; it's good to see an old friend
still going strong in the workforce!
>
>I can't help you with that particular problem. However, I did note that
you "...use it all the time to reduce file sizes..." and thought that I
would offer an alternative that you may decide to use if my SAS-L brethren
and sisteren do not come up with the answer that you need.
>
>SAS data set compression is also a good tool for reducing the size of SAS
data sets. It is easy to use and pretty effective when your data sets
have a lot of redundant adjacent data in them. There is a CPU
Time "penalty" for processing compressed SAS data sets, due to SAS having
to decomp the data, but that is often made up for in faster transfer times
of more observations per I/O. If CPU Time is not an issue for you and
disk space is, then you might want to consider SAS data set compression.
>
>Here is a link to the documentation under SAS V9.2:
>
>http://support.sas.com/documentation/cdl/en/lrcon/61722/PDF/default/lrcon.
pdf
>
>...go to page 528 and read from the bottom onward. Note that there
>
>One final note: In some cases, if you attempt to compress a SAS data set
that does not have enough adjacent redundancy in its observations, you can
end up with a compressed SAS data set bigger than the original. So, as
always (at this point, you know what I am going to say, right?) check the
SAS log after running your program! It will tell you the amount of space
saved... and sometimes lost:-)
>
>Myra, best of luck in all of your SAS endeavors!
>
>
>I hope that this suggestion proves helpful now, and in the future!
>
>Of course, all of these opinions and insights are my own, and do not
reflect those of my organization or my associates. All SAS code and/or
methodologies specified in this posting are for illustrative purposes only
and no warranty is stated or implied as to their accuracy or
applicability. People deciding to use information in this posting do so at
their own risk.
>
>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Michael A. Raithel
>"The man who wrote the book on performance"
>E-mail: MichaelRaithel(a)westat.com
>
>Author: Tuning SAS Applications in the MVS Environment
>
>Author: Tuning SAS Applications in the OS/390 and z/OS Environments,
Second Edition
>http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172
>
>Author: The Complete Guide to SAS Indexes
>http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409
>
>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Anyone who lives within their means suffers from a lack of
>imagination. - Oscar Wilde
>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
From: My on
First: Thanks for your comments. It's been a while since I've posted
to SAS-L, but it's great to know everyone's still here.

Art, the SAS link to which you refer is the version I've been using.

My company has a lot of proprietary data in a format that can be
exported to SPSS but not SAS. So, I export to SPSS and then import in
SAS. Nearly all the variables are binary (1,0), but in SPSS they have
a length of 8. There are thousands of variables, so using %squeeze
saves a huge amount of space. I then often use compress=binary in
addition, which saves further space. If the file has 2500 variables,
the savings are tremendous. But like Mike pointed out (thanks Mike!),
if the log says that compressing actually increases the size, I redo
it. (I work exclusively on a PC at this time.)

I'll check out that system option, Chang. Thanks.

Myra

On Nov 12, 11:06 am, gerhard.hellrie...(a)T-ONLINE.DE (Gerhard
Hellriegel) wrote:
> some other things about that.
> First is, it seems like the squeeze macro also reduces lengths of numeric
> variables. That is not always a good idea, specially if you are on a
> mainframe and plan to go to other platforms or use SAS on 2 or more
> platforms. The behaviour of numeric lengths and precision is different and
> you might get problems with that. One of my customer had while
> transferring a dwh from mainframe to unix. There some key-variables were
> nums and suddenly some duplicates occured!
> I'd leave all numerics at length 8 (max).
> Other thing: there are two kinds of compression. One is a simple one,
> activated with
>
> options compress=yes;
>
> The other is a more complex one:
>
> options compress=binary;
>
> Binary works with a binary compression routine - RDC (“Ross Data
> Compression”), something like ZIP under win also and others. That is more
> efficient as the simple one and might have advantages if there are well-
> filled datasets with many numerics. The compression also does not work on
> obs-level, but on block-level, which means, that also repeats over several
> obs are recognized. Disatvantage is for sure that the CPU costs are high.
>
> The simple compression is not too good, like Mike wrote, if you have
> datasets with small amounts of variables (try it with a dataset with 0
> variables and you see that it increases size 200% or 300%) and with many
> numeric variables. Characters can be treated very well and especially if
> you have long observations (many variables and/or long variables) it is
> very good. SAS also tests the number of obs and switches compressing off
> if a ds is too small. Normally char variables are defined with a
> certain "buffer", longer as actually needed to be sure... That are the
> best candidates for simple compression.
> In normal cases the compress=yes is very good and in some cases it does
> not bring much. So there is no "danger" to add it. The cpu costs are small
> for it. On mainframes the cpu costs could be reduced with compression,
> because the overhead is small for doing it and IO-cpu-costs might be a
> important part of the total cpu consumption there. The cases when datasets
> increase with compression are very rare and very exotic (who has a dataset
> with 0 variables?).
>
> A good paper about that and more:http://www.sas.com/offices/asiapacific/sp/usergroups/smug/archive/200...
> entations/BillGibsonPerformanceQ22009.pdf
>
> Gerhard
>
> On Thu, 12 Nov 2009 07:53:16 -0500, Michael Raithel
>
>
>
>
>
> <michaelrait...(a)WESTAT.COM> wrote:
> >Dear SAS-L-ers,
>
> >Myra posted the following:
>
> >> I found this macro a number of years ago, and use it all the time to
> >> reduce file sizes. I'm not sure if this is where I got it, but you can
> >> see it here:http://www.nesug.org/Proceedings/nesug06/io/io18.pdf
>
> >> I only started using SAS 9.2 in the last few months. When I run the
> >> macro in 9.2, I get an error message I never got before:
>
> >> WARNING: Multiple lengths were specified for the variable V1 by input
> >> data set(s). This may cause truncation of data.
>
> >> I believe this message comes after this data step:
>
> >>    data &DSNOUT ;
> >>       &RETAIN ;
>
> >>       %if &N_CHAR > 0 %then %str( &SQZ_CHAR ;     ) ; /* optimize char
> >> var lengths      */
>
> >>       %if &N_NUM  > 0 %then %str( &SQZ_NUM ;      ) ; /* optimize
> >> numeric var lengths   */
>
> >>       %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char
> >> var format lengths */
>
> >>       set &DSNIN ;
> >>    run ;
>
> >> I get the message for every variable. Does anyone know why this would
> >> happen?
>
> >Myra, ah yes, the old %SQUEEZE macro; it's good to see an old friend
>
> still going strong in the workforce!
>
> >I can't help you with that particular problem.  However, I did note that
>
> you "...use it all the time to reduce file sizes..." and thought that I
> would offer an alternative that you may decide to use if my SAS-L brethren
> and sisteren do not come up with the answer that you need.
>
> >SAS data set compression is also a good tool for reducing the size of SAS
>
> data sets.  It is easy to use and pretty effective when your data sets
> have a lot of redundant adjacent data in them.  There is a CPU
> Time "penalty" for processing compressed SAS data sets, due to SAS having
> to decomp the data, but that is often made up for in faster transfer times
> of more observations per I/O.  If CPU Time is not an issue for you and
> disk space is, then you might want to consider SAS data set compression.
>
> >Here is a link to the documentation under SAS V9.2:
>
> >http://support.sas.com/documentation/cdl/en/lrcon/61722/PDF/default/l.....
> pdf
>
> >...go to page 528 and read from the bottom onward.  Note that there
>
> >One final note: In some cases, if you attempt to compress a SAS data set
>
> that does not have enough adjacent redundancy in its observations, you can
> end up with a compressed SAS data set bigger than the original.  So, as
> always (at this point, you know what I am going to say, right?) check the
> SAS log after running your program!  It will tell you the amount of space
> saved... and sometimes lost:-)
>
> >Myra, best of luck in all of your SAS endeavors!
>
> >I hope that this suggestion proves helpful now, and in the future!
>
> >Of course, all of these opinions and insights are my own, and do not
>
> reflect those of my organization or my associates. All SAS code and/or
> methodologies specified in this posting are for illustrative purposes only
> and no warranty is stated or implied as to their accuracy or
> applicability. People deciding to use information in this posting do so at
> their own risk.
>
>
>
>
>
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Michael A. Raithel
> >"The man who wrote the book on performance"
> >E-mail: MichaelRait...(a)westat.com
>
> >Author: Tuning SAS Applications in the MVS Environment
>
> >Author: Tuning SAS Applications in the OS/390 and z/OS Environments,
> Second Edition
> >http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172
>
> >Author: The Complete Guide to SAS Indexes
> >http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409
>
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Anyone who lives within their means suffers from a lack of
> >imagination. - Oscar Wilde
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

From: Jack Hamilton on
I question two of your statements.

On Nov 12, 2009, at 8:06 am, Gerhard Hellriegel wrote:

> Binary works with a binary compression routine - RDC (�Ross Data
> Compression�), something like ZIP under win also and others. That is more
> efficient as the simple one and might have advantages if there are well-
> filled datasets with many numerics. The compression also does not work on
> obs-level, but on block-level, which means, that also repeats over several
> obs are recognized.


The documentation at

http://support.sas.com/onlinedoc/913/getDoc/en/lrdict.hlp/a001288760.htm


"This method is highly effective for compressing medium to large (several hundred bytes or larger) blocks of binary data (numeric variables). Because the compression function operates on a single record at a time, the record length needs to be several hundred bytes or larger for effective compression."

Why do you think that across-obs repeats are compressed?

> The cases when datasets
> increase with compression are very rare and very exotic (who has a dataset
> with 0 variables?).

My estimate is that about 10% of my data sets would be increased by compression - so for me, not rare at all.


--
Jack Hamilton
jfh(a)alumni.stanford.org
Caelum non animum mutant qui trans mare currunt.

On Nov 12, 2009, at 8:06 am, Gerhard Hellriegel wrote:

> some other things about that.
> First is, it seems like the squeeze macro also reduces lengths of numeric
> variables. That is not always a good idea, specially if you are on a
> mainframe and plan to go to other platforms or use SAS on 2 or more
> platforms. The behaviour of numeric lengths and precision is different and
> you might get problems with that. One of my customer had while
> transferring a dwh from mainframe to unix. There some key-variables were
> nums and suddenly some duplicates occured!
> I'd leave all numerics at length 8 (max).
> Other thing: there are two kinds of compression. One is a simple one,
> activated with
>
> options compress=yes;
>
> The other is a more complex one:
>
> options compress=binary;
>
> Binary works with a binary compression routine - RDC (�Ross Data
> Compression�), something like ZIP under win also and others. That is more
> efficient as the simple one and might have advantages if there are well-
> filled datasets with many numerics. The compression also does not work on
> obs-level, but on block-level, which means, that also repeats over several
> obs are recognized. Disatvantage is for sure that the CPU costs are high.
>
> The simple compression is not too good, like Mike wrote, if you have
> datasets with small amounts of variables (try it with a dataset with 0
> variables and you see that it increases size 200% or 300%) and with many
> numeric variables. Characters can be treated very well and especially if
> you have long observations (many variables and/or long variables) it is
> very good. SAS also tests the number of obs and switches compressing off
> if a ds is too small. Normally char variables are defined with a
> certain "buffer", longer as actually needed to be sure... That are the
> best candidates for simple compression.
> In normal cases the compress=yes is very good and in some cases it does
> not bring much. So there is no "danger" to add it. The cpu costs are small
> for it. On mainframes the cpu costs could be reduced with compression,
> because the overhead is small for doing it and IO-cpu-costs might be a
> important part of the total cpu consumption there.

>
> A good paper about that and more:
> http://www.sas.com/offices/asiapacific/sp/usergroups/smug/archive/2009/pres
> entations/BillGibsonPerformanceQ22009.pdf
>
> Gerhard
>
>
>
>
> On Thu, 12 Nov 2009 07:53:16 -0500, Michael Raithel
> <michaelraithel(a)WESTAT.COM> wrote:
>
>> Dear SAS-L-ers,
>>
>> Myra posted the following:
>>
>>>
>>> I found this macro a number of years ago, and use it all the time to
>>> reduce file sizes. I'm not sure if this is where I got it, but you can
>>> see it here: http://www.nesug.org/Proceedings/nesug06/io/io18.pdf
>>>
>>> I only started using SAS 9.2 in the last few months. When I run the
>>> macro in 9.2, I get an error message I never got before:
>>>
>>> WARNING: Multiple lengths were specified for the variable V1 by input
>>> data set(s). This may cause truncation of data.
>>>
>>> I believe this message comes after this data step:
>>>
>>> data &DSNOUT ;
>>> &RETAIN ;
>>>
>>> %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ) ; /* optimize char
>>> var lengths */
>>>
>>> %if &N_NUM > 0 %then %str( &SQZ_NUM ; ) ; /* optimize
>>> numeric var lengths */
>>>
>>> %if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char
>>> var format lengths */
>>>
>>> set &DSNIN ;
>>> run ;
>>>
>>> I get the message for every variable. Does anyone know why this would
>>> happen?
>>>
>> Myra, ah yes, the old %SQUEEZE macro; it's good to see an old friend
> still going strong in the workforce!
>>
>> I can't help you with that particular problem. However, I did note that
> you "...use it all the time to reduce file sizes..." and thought that I
> would offer an alternative that you may decide to use if my SAS-L brethren
> and sisteren do not come up with the answer that you need.
>>
>> SAS data set compression is also a good tool for reducing the size of SAS
> data sets. It is easy to use and pretty effective when your data sets
> have a lot of redundant adjacent data in them. There is a CPU
> Time "penalty" for processing compressed SAS data sets, due to SAS having
> to decomp the data, but that is often made up for in faster transfer times
> of more observations per I/O. If CPU Time is not an issue for you and
> disk space is, then you might want to consider SAS data set compression.
>>
>> Here is a link to the documentation under SAS V9.2:
>>
>> http://support.sas.com/documentation/cdl/en/lrcon/61722/PDF/default/lrcon.
> pdf
>>
>> ...go to page 528 and read from the bottom onward. Note that there
>>
>> One final note: In some cases, if you attempt to compress a SAS data set
> that does not have enough adjacent redundancy in its observations, you can
> end up with a compressed SAS data set bigger than the original. So, as
> always (at this point, you know what I am going to say, right?) check the
> SAS log after running your program! It will tell you the amount of space
> saved... and sometimes lost:-)
>>
>> Myra, best of luck in all of your SAS endeavors!
>>
>>
>> I hope that this suggestion proves helpful now, and in the future!
>>
>> Of course, all of these opinions and insights are my own, and do not
> reflect those of my organization or my associates. All SAS code and/or
> methodologies specified in this posting are for illustrative purposes only
> and no warranty is stated or implied as to their accuracy or
> applicability. People deciding to use information in this posting do so at
> their own risk.
>>
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Michael A. Raithel
>> "The man who wrote the book on performance"
>> E-mail: MichaelRaithel(a)westat.com
>>
>> Author: Tuning SAS Applications in the MVS Environment
>>
>> Author: Tuning SAS Applications in the OS/390 and z/OS Environments,
> Second Edition
>> http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172
>>
>> Author: The Complete Guide to SAS Indexes
>> http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409
>>
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Anyone who lives within their means suffers from a lack of
>> imagination. - Oscar Wilde
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++