tricky sum [SAS]

Prev: 1)if and if-else 2)if and where 3)keep=option and keep statement4)input function and input statement 5)scan and substr6)if and if then do7)informat and format i had good knowledge about those difference with examples.
Next: Spectral Analysis

From: "Data _null_;" on 23 Feb 2010 09:41

So, each ID is a lower triangluar matric of size equal to the rows for each ID.

This seems to work.

data have;
input id $ d0 d1 d2 d3;
cards;
1 10 0 0 0
1 8 5 0 0
1 6 10 6 0
1 4 7 4 3
2 8 0 0 0
2 4 9 0 0
2 7 8 5 0
2 10 3 10 6
4 8 0 0 0
4 4 9 0 0
4 7 8 5 0
;;;;
run;
data have;
do k = 1 by 1 until(last.id);
set have;
by id;
array d[*] d:;
array _x[4];
drop _x:;
do _n_ = 1 to k; * lower triangluar short circuit;
_x[_n_] = sum(_x[_n_],d[_n_]);
end;
end;
drop k;
*put the sums on the rows and output;
do _n_ = 1 by 1 until(last.id);
set have;
by id;
x = _x[_n_];
output;
end;
run;
proc print;
run;

From: Mike Zdeb on 23 Feb 2010 10:17

hi ... if Jack's interpretation of the "problem" is correct, this modified
version of what I posted yesterday will also work

use an INDEX, NOT a MERGE ... the index will begin at the first value within
an id_bb group and work its way though the group with repeats of the
same value of the index

here's a data set with 6 variables (d0-d5) and lots of different group sizes ...

groups 1, 3, 4, and 5 have fewer members than the number of variables

group 2 has 6 members

group 6 has more members than variables ... what should be done here ???
this method just repeats the last sum

* data set with 6 variables to SUM and varying group size;
data have;
input id_bb d0-d5;
datalines;
1 10 0 0 9 0 0
1 8 5 0 9 0 0
2 6 10 6 9 6 0
2 4 7 4 9 4 3
2 8 0 0 9 0 0
2 4 9 0 9 0 0
2 7 8 5 9 5 0
2 10 3 10 9 10 6
3 4 1 0 9 0 0
3 2 8 0 9 0 0
4 7 8 5 9 5 0
5 10 0 0 9 0 0
5 8 5 0 9 0 0
5 6 10 6 9 6 0
6 10 0 0 9 0 0
6 4 7 4 9 4 3
6 4 7 4 9 4 3
6 4 7 4 9 4 3
6 4 7 4 9 4 3
6 4 7 4 9 4 3
6 8 5 0 9 0 0
6 6 10 6 9 6 0
;
run;

* find variable sums within groups;
proc summary data=have nway;
var d0-d5;
class id_bb;
output out=stats (drop=_:) sum=;
run;

* transpose the data and add an INDEX to resulting data set TSTATS;
proc transpose data=stats out=tstats (keep=id_bb col1 rename=(col1=x) index=(id_bb));
by id_bb;
run;

* add the sum to the appropriate observations using the INDEX, not a merge;
data both;
set have;
set tstats key=id_bb;
run;

* what happened;
title 'SUMS FOR VARIABLES WITHIN ID_BB';
proc print data=stats;
run;

title 'SUMS MATCHED TO APPROPRIATE OBSERVATIONS';
proc print data=both;
run;

output ...

SUMS FOR VARIABLES WITHIN ID_BB
Obs id_bb d0 d1 d2 d3 d4 d5
1 1 18 5 0 18 0 0
2 2 39 37 25 54 25 9
3 3 6 9 0 18 0 0
4 4 7 8 5 9 5 0
5 5 24 15 6 27 6 0
6 6 44 50 26 72 26 15

SUMS MATCHED TO APPROPRIATE OBSERVATIONS
Obs id_bb d0 d1 d2 d3 d4 d5 x
1 1 10 0 0 9 0 0 18
2 1 8 5 0 9 0 0 5
3 2 6 10 6 9 6 0 39
4 2 4 7 4 9 4 3 37
5 2 8 0 0 9 0 0 25
6 2 4 9 0 9 0 0 54
7 2 7 8 5 9 5 0 25
8 2 10 3 10 9 10 6 9
9 3 4 1 0 9 0 0 6
10 3 2 8 0 9 0 0 9
11 4 7 8 5 9 5 0 7
12 5 10 0 0 9 0 0 24
13 5 8 5 0 9 0 0 15
14 5 6 10 6 9 6 0 6
15 6 10 0 0 9 0 0 44
16 6 4 7 4 9 4 3 50
17 6 4 7 4 9 4 3 26
18 6 4 7 4 9 4 3 72
19 6 4 7 4 9 4 3 26
20 6 4 7 4 9 4 3 15
21 6 8 5 0 9 0 0 15
22 6 6 10 6 9 6 0 15

--
Mike Zdeb
U(a)Albany School of Public Health
One University Place
Rensselaer, New York 12144-3456
P/518-402-6479 F/630-604-1475

> Chris,
>
>
>
> I understand that your data may have fewer than 126 observations per ID.
> From your description below, it sounds like the number of observations
> for a particular ID will be the number of numeric variables (D0-D125)
> which need to be summed. Is this correct?
>
>
>
> Will the numeric variables always be populated starting with D0 and
> going up toward D125?
>
>
>
> For example, when an ID has 8 observations does this mean that D0 - D07
> will always be the variables which need to be summed? Or could the
> variables needing to be summed be spread out across the D0 - D125 range?
>
>
>
> If the range of variables to be summed always starts with D0, D1, D2,
> etc. and contains a number of numeric variables to be summed equivalent
> to the number of observations for that ID, a slight modification to my
> solution from yesterday may work. The main logic modification is in the
> final merge, where a subsetting IF statement is added to only keep
> observations from the original data. I have added an additional ID
> (which does not contain 4 observations) to the sample data for testing.
>
>
>
>
>
>
>
>
>
> *(using your sample data);
>
> data have;
>
> infile cards missover;
>
> input id_bb d0 d1 d2 d3;
>
> cards;
>
> 1 10 0 0 0
>
> 1 8 5 0 0
>
> 1 6 10 6 0
>
> 1 4 7 4 3
>
> 2 8 0 0 0
>
> 2 4 9 0 0
>
> 2 7 8 5 0
>
> 2 10 3 10 6
>
> 3 4 1 0 0
>
> 3 2 8 0 0
>
> ;
>
> run;
>
>
>
> proc print data = have;
>
> run;
>
>
>
>
>
> * create SUM_VAR containing variable name of summed variable ;
>
> data next (drop=arr_cnt);
>
> set have;
>
> by id_bb;
>
> array dvars(*) d0-d3;
>
> if first.id_bb then arr_cnt = 1;
>
> sum_var = vname(dvars(arr_cnt));
>
> arr_cnt + 1;
>
> run;
>
>
>
> proc print data = next;
>
> run;
>
>
>
> * sum numeric fields by ID_BB ;
>
> proc summary data = have nway missing;
>
> class id_bb;
>
> var d0-d3;
>
> output out = have_sum (drop=_type_ _freq_) sum=; run;
>
>
>
> proc print data = have_sum;
>
> run;
>
>
>
> * transpose summed data - rename fields ;
>
> proc transpose data = have_sum out = have_sumt (rename=(_name_=sum_var
> col1=x));
>
> by id_bb;
>
> run;
>
>
>
> proc print data = have_sumt;
>
> run;
>
>
>
>
>
> * merge summed data to original data ;
>
> data need;
>
> merge next (in=a) have_sumt;
>
> by id_bb sum_var;
>
> if a;
>
> run;
>
>
>
> proc print data = need;
>
> run;
>
>
>
>
>
>
>
> Jack
>
>
> Jack Clark
> Senior Research Analyst
> phone: 410-455-6256
> fax: 410-455-6850
> jclark(a)hilltop.umbc.edu ________________________________
>
> From: Christophe Godlewski [mailto:chris.godlewski(a)gmail.com]
> Sent: Monday, February 22, 2010 2:46 PM
> To: Jack Clark
> Cc: SAS-L(a)listserv.uga.edu
> Subject: Re: tricky sum
>
>
>
> Hello Jack
>
> Thank you very much for your help
> it works fine with the data for the example
> unfortunately, in the full dataset, each id_bb can have less than 126
> observations...
> it can be 1, 3, 89 or 126...
> 126 is the maximum actually
> more precisely, the "matrix" has 126 terms for each id_bb
> but for instance if an id_bb has only 89 observations, then for d89 to
> d125, there will be 0 in the cells...
> now i am thinking of a trick to make your code applicable to my data...
> but if you have a smart idea im taking it !
>
> thank you again for your help
> best
> CG
>
> On Mon, Feb 22, 2010 at 8:18 PM, Jack Clark <jclark(a)hilltop.umbc.edu>
> wrote:
>
> Chris,
>
> If I understand the question correctly, this means that each ID_BB value
> in your data has exactly 126 observations? This is because you will
> need 126 observations per ID to hold the sums of D0-D125. Is that
> correct?
>
> This solution includes the creation of a new variable called SUM_VAR
> which has the value of the variable name being summed as the value of X.
> You can drop it at the end if you don't want it. The data are
> summarized by ID_BB to get the sum of each numeric variable. That
> summarized data is transposed and then merged to the original data by
> ID_BB and SUM_VAR.
>
> Let me know if this meets your needs.
>
>
> *(using your sample data);
> data have;
> infile cards missover;
> input id_bb d0 d1 d2 d3;
> cards;
>
> 1 10 0 0 0
> 1 8 5 0 0
> 1 6 10 6 0
> 1 4 7 4 3
> 2 8 0 0 0
> 2 4 9 0 0
> 2 7 8 5 0
> 2 10 3 10 6
>
> ;
> run;
>
>
> * create SUM_VAR containing variable name of summed variable ;
> data next (drop=arr_cnt);
> set have;
> by id_bb;
> array dvars(*) d0-d3;
> if first.id_bb then arr_cnt = 1;
> sum_var = vname(dvars(arr_cnt));
> arr_cnt + 1;
> run;
>
> proc print data = next;
> run;
>
> * sum numeric fields by ID_BB ;
> proc summary data = have nway missing;
> class id_bb;
> var d0-d3;
> output out = have_sum (drop=_type_ _freq_) sum=;
> run;
>
> proc print data = have_sum;
> run;
>
> * transpose summed data - rename fields ;
> proc transpose data = have_sum out = have_sumt (rename=(_name_=sum_var
> col1=x));
> by id_bb;
> run;
>
> proc print data = have_sumt;
> run;
>
>
> * merge summed data to original data ;
> data need;
> merge next have_sumt;
> by id_bb sum_var;
> run;
>
> proc print data = need;
> run;
>
>
>
>
> Jack
>
>
>
>
>
>
> Jack Clark
> Senior Research Analyst
> phone: 410-455-6256
> fax: 410-455-6850
> jclark(a)hilltop.umbc.edu
>
> University of Maryland, Baltimore County
> Sondheim Hall, 3rd Floor
> 1000 Hilltop Circle
> Baltimore, MD 21250
>
>
>
>
> Confidentiality Notice: This e-mail may contain information that is
> legally privileged and that is intended only for the use of the
> addressee(s) named above. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying of this e-mail,
> distribution, or action taken in reliance on the contents of this e-mail
> and/or documents attributed to this e-mail is strictly prohibited. If
> you have received this information in error, please notify the sender
> immediately by phone and delete this entire e-mail. Thank
> you.-----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
> ChrisG
> Sent: Monday, February 22, 2010 11:07 AM
> To: SAS-L(a)LISTSERV.UGA.EDU
> Subject: tricky sum
>
>
> Hi everyone
>
> Here is my tricky sum problem
> I have such a dataset
>
> id_bb d0 d1 d2 d3
> 1 10 0 0 0
> 1 8 5 0 0
> 1 6 10 6 0
> 1 4 7 4 3
> 2 8 0 0 0
> 2 4 9 0 0
> 2 7 8 5 0
> 2 10 3 10 6
>
> id_bb = identifier, then some numeric variables d0-d3
> in reality i have more than 50000 lines, more than 3000 unique
> identifiers and it goes from d0 until d125 !
>
> now my dream is to have this:
>
> id_bb d0 d1 d2 d3 x
> 1 10 0 0 0 28
> 1 8 5 0 0 22
> 1 6 10 6 0 10
> 1 4 7 4 3 3
> 2 8 0 0 0 29
> 2 4 9 0 0 20
> 2 7 8 5 0 15
> 2 10 3 10 6 6
>
> this x variable is the trick sum i need
> and it goes like this:
> for the first id_bb:
> x in the first line = sum(d0), i.e. 10+8+6+4
> x in the second line = sum(d1), i.e. 0+5+10+7
> so on until the last id_bb=1, i.e. x = sum(d3) = 0+0+0+3
> then again for the second id_bb:
> x in the first line = sum(d0), ie. x = 8+4+7+10
> and so on until the last variable, ie. d3
>
> and at the end, in reality, this should happen for every id_bb (more
> than 3000 unique id_bb) and for the d0 until d125...
>
> i have no clue how to do it in SAS...
> in Excel this will surely be a pain as well and i don't know how to
> use VBA...
>
> So if someone could help me out with this, i would be very greatful !
> thank you in advance
> Best
> CG
>
>
>
>
> --
> Christophe J. GODLEWSKI
> Strasbourg, France
>

From: ChrisG on 23 Feb 2010 17:02

On 23 fév, 16:17, ms...(a)albany.edu (Mike Zdeb) wrote:
> hi ... if Jack's interpretation of the "problem" is correct, this modified
> version of what I posted yesterday will also work
>
> use an INDEX, NOT a MERGE ... the index will begin at the first value within
> an id_bb group and work its way though the group with repeats of the
> same value of the index
>
> here's a data set with 6 variables (d0-d5) and lots of different group sizes ...
>
> groups 1, 3, 4, and 5 have fewer members than the number of variables
>
> group 2 has 6 members
>
> group 6 has more members than variables ... what should be done here ???
> this method just repeats the last sum
>
> * data set with 6 variables to SUM and varying group size;
> data have;
> input id_bb d0-d5;
> datalines;
> 1 10 0 0 9 0 0
> 1 8 5 0 9 0 0
> 2 6 10 6 9 6 0
> 2 4 7 4 9 4 3
> 2 8 0 0 9 0 0
> 2 4 9 0 9 0 0
> 2 7 8 5 9 5 0
> 2 10 3 10 9 10 6
> 3 4 1 0 9 0 0
> 3 2 8 0 9 0 0
> 4 7 8 5 9 5 0
> 5 10 0 0 9 0 0
> 5 8 5 0 9 0 0
> 5 6 10 6 9 6 0
> 6 10 0 0 9 0 0
> 6 4 7 4 9 4 3
> 6 4 7 4 9 4 3
> 6 4 7 4 9 4 3
> 6 4 7 4 9 4 3
> 6 4 7 4 9 4 3
> 6 8 5 0 9 0 0
> 6 6 10 6 9 6 0
> ;
> run;
>
> * find variable sums within groups;
> proc summary data=have nway;
> var d0-d5;
> class id_bb;
> output out=stats (drop=_:) sum=;
> run;
>
> * transpose the data and add an INDEX to resulting data set TSTATS;
> proc transpose data=stats out=tstats (keep=id_bb col1 rename=(col1=x) index=(id_bb));
> by id_bb;
> run;
>
> * add the sum to the appropriate observations using the INDEX, not a merge;
> data both;
> set have;
> set tstats key=id_bb;
> run;
>
> * what happened;
> title 'SUMS FOR VARIABLES WITHIN ID_BB';
> proc print data=stats;
> run;
>
> title 'SUMS MATCHED TO APPROPRIATE OBSERVATIONS';
> proc print data=both;
> run;
>
> output ...
>
> SUMS FOR VARIABLES WITHIN ID_BB
> Obs id_bb d0 d1 d2 d3 d4 d5
> 1 1 18 5 0 18 0 0
> 2 2 39 37 25 54 25 9
> 3 3 6 9 0 18 0 0
> 4 4 7 8 5 9 5 0
> 5 5 24 15 6 27 6 0
> 6 6 44 50 26 72 26 15
>
> SUMS MATCHED TO APPROPRIATE OBSERVATIONS
> Obs id_bb d0 d1 d2 d3 d4 d5 x
> 1 1 10 0 0 9 0 0 18
> 2 1 8 5 0 9 0 0 5
> 3 2 6 10 6 9 6 0 39
> 4 2 4 7 4 9 4 3 37
> 5 2 8 0 0 9 0 0 25
> 6 2 4 9 0 9 0 0 54
> 7 2 7 8 5 9 5 0 25
> 8 2 10 3 10 9 10 6 9
> 9 3 4 1 0 9 0 0 6
> 10 3 2 8 0 9 0 0 9
> 11 4 7 8 5 9 5 0 7
> 12 5 10 0 0 9 0 0 24
> 13 5 8 5 0 9 0 0 15
> 14 5 6 10 6 9 6 0 6
> 15 6 10 0 0 9 0 0 44
> 16 6 4 7 4 9 4 3 50
> 17 6 4 7 4 9 4 3 26
> 18 6 4 7 4 9 4 3 72
> 19 6 4 7 4 9 4 3 26
> 20 6 4 7 4 9 4 3 15
> 21 6 8 5 0 9 0 0 15
> 22 6 6 10 6 9 6 0 15
>
> --
> Mike Zdeb
> U(a)Albany School of Public Health
> One University Place
> Rensselaer, New York 12144-3456
> P/518-402-6479 F/630-604-1475
>
> > Chris,
>
> > I understand that your data may have fewer than 126 observations per ID..
> > From your description below, it sounds like the number of observations
> > for a particular ID will be the number of numeric variables (D0-D125)
> > which need to be summed. Is this correct?
>
> > Will the numeric variables always be populated starting with D0 and
> > going up toward D125?
>
> > For example, when an ID has 8 observations does this mean that D0 - D07
> > will always be the variables which need to be summed? Or could the
> > variables needing to be summed be spread out across the D0 - D125 range?
>
> > If the range of variables to be summed always starts with D0, D1, D2,
> > etc. and contains a number of numeric variables to be summed equivalent
> > to the number of observations for that ID, a slight modification to my
> > solution from yesterday may work. The main logic modification is in the
> > final merge, where a subsetting IF statement is added to only keep
> > observations from the original data. I have added an additional ID
> > (which does not contain 4 observations) to the sample data for testing.
>
> > *(using your sample data);
>
> > data have;
>
> > infile cards missover;
>
> > input id_bb d0 d1 d2 d3;
>
> > cards;
>
> > 1 10 0 0 0
>
> > 1 8 5 0 0
>
> > 1 6 10 6 0
>
> > 1 4 7 4 3
>
> > 2 8 0 0 0
>
> > 2 4 9 0 0
>
> > 2 7 8 5 0
>
> > 2 10 3 10 6
>
> > 3 4 1 0 0
>
> > 3 2 8 0 0
>
> > ;
>
> > run;
>
> > proc print data = have;
>
> > run;
>
> > * create SUM_VAR containing variable name of summed variable ;
>
> > data next (drop=arr_cnt);
>
> > set have;
>
> > by id_bb;
>
> > array dvars(*) d0-d3;
>
> > if first.id_bb then arr_cnt = 1;
>
> > sum_var = vname(dvars(arr_cnt));
>
> > arr_cnt + 1;
>
> > run;
>
> > proc print data = next;
>
> > run;
>
> > * sum numeric fields by ID_BB ;
>
> > proc summary data = have nway missing;
>
> > class id_bb;
>
> > var d0-d3;
>
> > output out = have_sum (drop=_type_ _freq_) sum=; run;
>
> > proc print data = have_sum;
>
> > run;
>
> > * transpose summed data - rename fields ;
>
> > proc transpose data = have_sum out = have_sumt (rename=(_name_=sum_var
> > col1=x));
>
> > by id_bb;
>
> > run;
>
> > proc print data = have_sumt;
>
> > run;
>
> > * merge summed data to original data ;
>
> > data need;
>
> > merge next (in=a) have_sumt;
>
> > by id_bb sum_var;
>
> > if a;
>
> > run;
>
> > proc print data = need;
>
> > run;
>
> > Jack
>
> > Jack Clark
> > Senior Research Analyst
> > phone: 410-455-6256
> > fax: 410-455-6850
> > jcl...(a)hilltop.umbc.edu ________________________________
>
> > From: Christophe Godlewski [mailto:chris.godlew...(a)gmail.com]
> > Sent: Monday, February 22, 2010 2:46 PM
> > To: Jack Clark
> > Cc: SA...(a)listserv.uga.edu
> > Subject: Re: tricky sum
>
> > Hello Jack
>
> > Thank you very much for your help
> > it works fine with the data for the example
> > unfortunately, in the full dataset, each id_bb can have less than 126
> > observations...
> > it can be 1, 3, 89 or 126...
> > 126 is the maximum actually
> > more precisely, the "matrix" has 126 terms for each id_bb
> > but for instance if an id_bb has only 89 observations, then for d89 to
> > d125, there will be 0 in the cells...
> > now i am thinking of a trick to make your code applicable to my data...
> > but if you have a smart idea im taking it !
>
> > thank you again for your help
> > best
> > CG
>
> > On Mon, Feb 22, 2010 at 8:18 PM, Jack Clark <jcl...(a)hilltop.umbc.edu>
> > wrote:
>
> > Chris,
>
> > If I understand the question correctly, this means that each ID_BB value
> > in your data has exactly 126 observations? This is because you will
> > need 126 observations per ID to hold the sums of D0-D125. Is that
> > correct?
>
> > This solution includes the creation of a new variable called SUM_VAR
> > which has the value of the variable name being summed as the value of X..
> > You can drop it at the end if you don't want it. The data are
> > summarized by ID_BB to get the sum of each numeric variable. That
> > summarized data is transposed and then merged to the original data by
> > ID_BB and SUM_VAR.
>
> > Let me know if this meets your needs.
>
> > *(using your sample data);
> > data have;
> > infile cards missover;
> > input id_bb d0 d1 d2 d3;
> > cards;
>
> > 1 10 0 0 0
> > 1 8 5 0 0
> > 1 6 10 6 0
> > 1 4 7 4 3
> > 2 8 0 0 0
> > 2 4 9 0 0
> > 2 7 8 5 0
> > 2 10 3 10 6
>
> > ;
> > run;
>
> > * create SUM_VAR containing variable name of summed variable ;
> > data next (drop=arr_cnt);
> > set have;
> > by id_bb;
> > array dvars(*) d0-d3;
> > if first.id_bb then arr_cnt = 1;
> > sum_var = vname(dvars(arr_cnt));
> > arr_cnt + 1;
> > run;
>
> > proc print data = next;
> > run;
>
> > * sum numeric fields by ID_BB ;
> > proc summary data = have nway missing;
> > class id_bb;
> > var d0-d3;
> > output out = have_sum (drop=_type_ _freq_) sum=;
> > run;
>
> > proc print data = have_sum;
> > run;
>
> > * transpose summed data - rename fields ;
> > proc transpose data = have_sum out = have_sumt (rename=(_name_=sum_var
> > col1=x));
> > by id_bb;
> > run;
>
> > proc print data = have_sumt;
> > run;
>
> > * merge summed data to original data ;
> > data need;
> > merge next have_sumt;
> > by id_bb sum_var;
> > run;
>
> > proc print data = need;
> > run;
>
> > Jack
>
> > Jack Clark
> > Senior Research Analyst
> > phone: 410-455-6256
> > fax: 410-455-6850
> > jcl...(a)hilltop.umbc.edu
>
> > University of Maryland, Baltimore County
> > Sondheim Hall, 3rd Floor
> > 1000 Hilltop Circle
>
> ...
>
> plus de détails »

Hello to all

thank you very much for your kind help
i am afraid i am not a good pedagog ... so i apologize if i have
mislead you with my unprecise explanations
i will test your proposition and get back to you soon
my example was maybe to simple
in fact variables go from d0 to d125 for each of the 3125 (if i
remember well) id (each id is a bank actually)
but as i am not a sas expert (i guess you figured that out already) my
code was written in a not very flexible way, i.e. d125 is a "maximum"
and might be true for a few id (for these id d0 until d125 won't be
null) but for other id (most of them) d0 until d34 or d89 and so on
wil have non 0 values and these are of interest for me
so to make a long story short each id has a d0 but not every id will
have some relevant values until d125, most of them will stop before,
for instance d3, d34 or d89
the problem is taht the full dataset is quite big (in my opnion, it
has more than 50.000 obs.) and i get lost quite fast when i try to
spot something by simply visualizing it
so as i said i will get back to that issue tomorrow morning (local
time in France) and will test your propositions

again thank you very much for your kind help
best
CG

From: ChrisG on 24 Feb 2010 15:22

On 23 fév, 23:02, ChrisG <chris.godlew...(a)gmail.com> wrote:
> On 23 fév, 16:17, ms...(a)albany.edu (Mike Zdeb) wrote:
>
> > hi ... if Jack's interpretation of the "problem" is correct, this modified
> > version of what I posted yesterday will also work
>
> > use an INDEX, NOT a MERGE ... the index will begin at the first value within
> > an id_bb group and work its way though the group with repeats of the
> > same value of the index
>
> > here's a data set with 6 variables (d0-d5) and lots of different group sizes ...
>
> > groups 1, 3, 4, and 5 have fewer members than the number of variables
>
> > group 2 has 6 members
>
> > group 6 has more members than variables ... what should be done here ???
> > this method just repeats the last sum
>
> > * data set with 6 variables to SUM and varying group size;
> > data have;
> > input id_bb d0-d5;
> > datalines;
> > 1 10 0 0 9 0 0
> > 1 8 5 0 9 0 0
> > 2 6 10 6 9 6 0
> > 2 4 7 4 9 4 3
> > 2 8 0 0 9 0 0
> > 2 4 9 0 9 0 0
> > 2 7 8 5 9 5 0
> > 2 10 3 10 9 10 6
> > 3 4 1 0 9 0 0
> > 3 2 8 0 9 0 0
> > 4 7 8 5 9 5 0
> > 5 10 0 0 9 0 0
> > 5 8 5 0 9 0 0
> > 5 6 10 6 9 6 0
> > 6 10 0 0 9 0 0
> > 6 4 7 4 9 4 3
> > 6 4 7 4 9 4 3
> > 6 4 7 4 9 4 3
> > 6 4 7 4 9 4 3
> > 6 4 7 4 9 4 3
> > 6 8 5 0 9 0 0
> > 6 6 10 6 9 6 0
> > ;
> > run;
>
> > * find variable sums within groups;
> > proc summary data=have nway;
> > var d0-d5;
> > class id_bb;
> > output out=stats (drop=_:) sum=;
> > run;
>
> > * transpose the data and add an INDEX to resulting data set TSTATS;
> > proc transpose data=stats out=tstats (keep=id_bb col1 rename=(col1=x) index=(id_bb));
> > by id_bb;
> > run;
>
> > * add the sum to the appropriate observations using the INDEX, not a merge;
> > data both;
> > set have;
> > set tstats key=id_bb;
> > run;
>
> > * what happened;
> > title 'SUMS FOR VARIABLES WITHIN ID_BB';
> > proc print data=stats;
> > run;
>
> > title 'SUMS MATCHED TO APPROPRIATE OBSERVATIONS';
> > proc print data=both;
> > run;
>
> > output ...
>
> > SUMS FOR VARIABLES WITHIN ID_BB
> > Obs id_bb d0 d1 d2 d3 d4 d5
> > 1 1 18 5 0 18 0 0
> > 2 2 39 37 25 54 25 9
> > 3 3 6 9 0 18 0 0
> > 4 4 7 8 5 9 5 0
> > 5 5 24 15 6 27 6 0
> > 6 6 44 50 26 72 26 15
>
> > SUMS MATCHED TO APPROPRIATE OBSERVATIONS
> > Obs id_bb d0 d1 d2 d3 d4 d5 x
> > 1 1 10 0 0 9 0 0 18
> > 2 1 8 5 0 9 0 0 5
> > 3 2 6 10 6 9 6 0 39
> > 4 2 4 7 4 9 4 3 37
> > 5 2 8 0 0 9 0 0 25
> > 6 2 4 9 0 9 0 0 54
> > 7 2 7 8 5 9 5 0 25
> > 8 2 10 3 10 9 10 6 9
> > 9 3 4 1 0 9 0 0 6
> > 10 3 2 8 0 9 0 0 9
> > 11 4 7 8 5 9 5 0 7
> > 12 5 10 0 0 9 0 0 24
> > 13 5 8 5 0 9 0 0 15
> > 14 5 6 10 6 9 6 0 6
> > 15 6 10 0 0 9 0 0 44
> > 16 6 4 7 4 9 4 3 50
> > 17 6 4 7 4 9 4 3 26
> > 18 6 4 7 4 9 4 3 72
> > 19 6 4 7 4 9 4 3 26
> > 20 6 4 7 4 9 4 3 15
> > 21 6 8 5 0 9 0 0 15
> > 22 6 6 10 6 9 6 0 15
>
> > --
> > Mike Zdeb
> > U(a)Albany School of Public Health
> > One University Place
> > Rensselaer, New York 12144-3456
> > P/518-402-6479 F/630-604-1475
>
> > > Chris,
>
> > > I understand that your data may have fewer than 126 observations per ID.
> > > From your description below, it sounds like the number of observations
> > > for a particular ID will be the number of numeric variables (D0-D125)
> > > which need to be summed. Is this correct?
>
> > > Will the numeric variables always be populated starting with D0 and
> > > going up toward D125?
>
> > > For example, when an ID has 8 observations does this mean that D0 - D07
> > > will always be the variables which need to be summed? Or could the
> > > variables needing to be summed be spread out across the D0 - D125 range?
>
> > > If the range of variables to be summed always starts with D0, D1, D2,
> > > etc. and contains a number of numeric variables to be summed equivalent
> > > to the number of observations for that ID, a slight modification to my
> > > solution from yesterday may work. The main logic modification is in the
> > > final merge, where a subsetting IF statement is added to only keep
> > > observations from the original data. I have added an additional ID
> > > (which does not contain 4 observations) to the sample data for testing.
>
> > > *(using your sample data);
>
> > > data have;
>
> > > infile cards missover;
>
> > > input id_bb d0 d1 d2 d3;
>
> > > cards;
>
> > > 1 10 0 0 0
>
> > > 1 8 5 0 0
>
> > > 1 6 10 6 0
>
> > > 1 4 7 4 3
>
> > > 2 8 0 0 0
>
> > > 2 4 9 0 0
>
> > > 2 7 8 5 0
>
> > > 2 10 3 10 6
>
> > > 3 4 1 0 0
>
> > > 3 2 8 0 0
>
> > > ;
>
> > > run;
>
> > > proc print data = have;
>
> > > run;
>
> > > * create SUM_VAR containing variable name of summed variable ;
>
> > > data next (drop=arr_cnt);
>
> > > set have;
>
> > > by id_bb;
>
> > > array dvars(*) d0-d3;
>
> > > if first.id_bb then arr_cnt = 1;
>
> > > sum_var = vname(dvars(arr_cnt));
>
> > > arr_cnt + 1;
>
> > > run;
>
> > > proc print data = next;
>
> > > run;
>
> > > * sum numeric fields by ID_BB ;
>
> > > proc summary data = have nway missing;
>
> > > class id_bb;
>
> > > var d0-d3;
>
> > > output out = have_sum (drop=_type_ _freq_) sum=; run;
>
> > > proc print data = have_sum;
>
> > > run;
>
> > > * transpose summed data - rename fields ;
>
> > > proc transpose data = have_sum out = have_sumt (rename=(_name_=sum_var
> > > col1=x));
>
> > > by id_bb;
>
> > > run;
>
> > > proc print data = have_sumt;
>
> > > run;
>
> > > * merge summed data to original data ;
>
> > > data need;
>
> > > merge next (in=a) have_sumt;
>
> > > by id_bb sum_var;
>
> > > if a;
>
> > > run;
>
> > > proc print data = need;
>
> > > run;
>
> > > Jack
>
> > > Jack Clark
> > > Senior Research Analyst
> > > phone: 410-455-6256
> > > fax: 410-455-6850
> > > jcl...(a)hilltop.umbc.edu ________________________________
>
> > > From: Christophe Godlewski [mailto:chris.godlew...(a)gmail.com]
> > > Sent: Monday, February 22, 2010 2:46 PM
> > > To: Jack Clark
> > > Cc: SA...(a)listserv.uga.edu
> > > Subject: Re: tricky sum
>
> > > Hello Jack
>
> > > Thank you very much for your help
> > > it works fine with the data for the example
> > > unfortunately, in the full dataset, each id_bb can have less than 126
> > > observations...
> > > it can be 1, 3, 89 or 126...
> > > 126 is the maximum actually
> > > more precisely, the "matrix" has 126 terms for each id_bb
> > > but for instance if an id_bb has only 89 observations, then for d89 to
> > > d125, there will be 0 in the cells...
> > > now i am thinking of a trick to make your code applicable to my data....
> > > but if you have a smart idea im taking it !
>
> > > thank you again for your help
> > > best
> > > CG
>
> > > On Mon, Feb 22, 2010 at 8:18 PM, Jack Clark <jcl...(a)hilltop.umbc.edu>
> > > wrote:
>
> > > Chris,
>
> > > If I understand the question correctly, this means that each ID_BB value
> > > in your data has exactly 126 observations? This is because you will
> > > need 126 observations per ID to hold the sums of D0-D125. Is that
> > > correct?
>
> > > This solution includes the creation of a new variable called SUM_VAR
> > > which has the value of the variable name being summed as the value of X.
> > > You can drop it at the end if you don't want it. The data are
> > > summarized by ID_BB to get the sum of each numeric variable. That
> > > summarized data is transposed and then merged to the original data by
> > > ID_BB and SUM_VAR.
>
> > > Let me know if this meets your
>
> ...
>
> plus de détails »

Hello everybody

all the solutions (by Jack, Mike, Data_null) work
and i thank you all for your help
the example dataset that best describes the original dataset is
perhaps this one :
data have;

infile cards missover;

input id_bb d0 d1 d2 d3;

cards;
1 10 0 0 0
1 8 5 0 0
1 6 10 6 0
1 4 7 4 3
2 8 0 0 0
2 4 9 0 0
2 7 8 5 0
2 10 3 10 6
3 4 0 0 0
3 2 8 0 0
;
run;

for id 1 and 2 we have a triangular matrix for each, i.e. number of
obs = number of variables
but the other cases can happen and actually they happen quite often in
the original data
for id 3 we have a number of observations < number of variables, i.e.
there are 0 for the d2 and d3 variables
cases like this are the most frequent in the original database
actually the d125 is filled with a number > 0 in maybe 1 or 2 cases
(i.e. for 1 or 2 id)
for the rest, there is always number of obs.< number of variables and
so many 0 around !

thank you again very very much for your jind help
best regards
CG

From: ChrisG on 26 Feb 2010 05:58

On 24 fév, 21:22, ChrisG <chris.godlew...(a)gmail.com> wrote:
> On 23 fév, 23:02, ChrisG <chris.godlew...(a)gmail.com> wrote:
>
> > On 23 fév, 16:17, ms...(a)albany.edu (Mike Zdeb) wrote:
>
> > > hi ... if Jack's interpretation of the "problem" is correct, this modified
> > > version of what I posted yesterday will also work
>
> > > use an INDEX, NOT a MERGE ... the index will begin at the first value within
> > > an id_bb group and work its way though the group with repeats of the
> > > same value of the index
>
> > > here's a data set with 6 variables (d0-d5) and lots of different group sizes ...
>
> > > groups 1, 3, 4, and 5 have fewer members than the number of variables
>
> > > group 2 has 6 members
>
> > > group 6 has more members than variables ... what should be done here ???
> > > this method just repeats the lastsum
>
> > > * data set with 6 variables toSUMand varying group size;
> > > data have;
> > > input id_bb d0-d5;
> > > datalines;
> > > 1 10 0 0 9 0 0
> > > 1 8 5 0 9 0 0
> > > 2 6 10 6 9 6 0
> > > 2 4 7 4 9 4 3
> > > 2 8 0 0 9 0 0
> > > 2 4 9 0 9 0 0
> > > 2 7 8 5 9 5 0
> > > 2 10 3 10 9 10 6
> > > 3 4 1 0 9 0 0
> > > 3 2 8 0 9 0 0
> > > 4 7 8 5 9 5 0
> > > 5 10 0 0 9 0 0
> > > 5 8 5 0 9 0 0
> > > 5 6 10 6 9 6 0
> > > 6 10 0 0 9 0 0
> > > 6 4 7 4 9 4 3
> > > 6 4 7 4 9 4 3
> > > 6 4 7 4 9 4 3
> > > 6 4 7 4 9 4 3
> > > 6 4 7 4 9 4 3
> > > 6 8 5 0 9 0 0
> > > 6 6 10 6 9 6 0
> > > ;
> > > run;
>
> > > * find variable sums within groups;
> > > proc summary data=have nway;
> > > var d0-d5;
> > > class id_bb;
> > > output out=stats (drop=_:) sum=;
> > > run;
>
> > > * transpose the data and add an INDEX to resulting data set TSTATS;
> > > proc transpose data=stats out=tstats (keep=id_bb col1 rename=(col1=x) index=(id_bb));
> > > by id_bb;
> > > run;
>
> > > * add thesumto the appropriate observations using the INDEX, not a merge;
> > > data both;
> > > set have;
> > > set tstats key=id_bb;
> > > run;
>
> > > * what happened;
> > > title 'SUMS FOR VARIABLES WITHIN ID_BB';
> > > proc print data=stats;
> > > run;
>
> > > title 'SUMS MATCHED TO APPROPRIATE OBSERVATIONS';
> > > proc print data=both;
> > > run;
>
> > > output ...
>
> > > SUMS FOR VARIABLES WITHIN ID_BB
> > > Obs id_bb d0 d1 d2 d3 d4 d5
> > > 1 1 18 5 0 18 0 0
> > > 2 2 39 37 25 54 25 9
> > > 3 3 6 9 0 18 0 0
> > > 4 4 7 8 5 9 5 0
> > > 5 5 24 15 6 27 6 0
> > > 6 6 44 50 26 72 26 15
>
> > > SUMS MATCHED TO APPROPRIATE OBSERVATIONS
> > > Obs id_bb d0 d1 d2 d3 d4 d5 x
> > > 1 1 10 0 0 9 0 0 18
> > > 2 1 8 5 0 9 0 0 5
> > > 3 2 6 10 6 9 6 0 39
> > > 4 2 4 7 4 9 4 3 37
> > > 5 2 8 0 0 9 0 0 25
> > > 6 2 4 9 0 9 0 0 54
> > > 7 2 7 8 5 9 5 0 25
> > > 8 2 10 3 10 9 10 6 9
> > > 9 3 4 1 0 9 0 0 6
> > > 10 3 2 8 0 9 0 0 9
> > > 11 4 7 8 5 9 5 0 7
> > > 12 5 10 0 0 9 0 0 24
> > > 13 5 8 5 0 9 0 0 15
> > > 14 5 6 10 6 9 6 0 6
> > > 15 6 10 0 0 9 0 0 44
> > > 16 6 4 7 4 9 4 3 50
> > > 17 6 4 7 4 9 4 3 26
> > > 18 6 4 7 4 9 4 3 72
> > > 19 6 4 7 4 9 4 3 26
> > > 20 6 4 7 4 9 4 3 15
> > > 21 6 8 5 0 9 0 0 15
> > > 22 6 6 10 6 9 6 0 15
>
> > > --
> > > Mike Zdeb
> > > U(a)Albany School of Public Health
> > > One University Place
> > > Rensselaer, New York 12144-3456
> > > P/518-402-6479 F/630-604-1475
>
> > > > Chris,
>
> > > > I understand that your data may have fewer than 126 observations per ID.
> > > > From your description below, it sounds like the number of observations
> > > > for a particular ID will be the number of numeric variables (D0-D125)
> > > > which need to be summed. Is this correct?
>
> > > > Will the numeric variables always be populated starting with D0 and
> > > > going up toward D125?
>
> > > > For example, when an ID has 8 observations does this mean that D0 - D07
> > > > will always be the variables which need to be summed? Or could the
> > > > variables needing to be summed be spread out across the D0 - D125 range?
>
> > > > If the range of variables to be summed always starts with D0, D1, D2,
> > > > etc. and contains a number of numeric variables to be summed equivalent
> > > > to the number of observations for that ID, a slight modification to my
> > > > solution from yesterday may work. The main logic modification is in the
> > > > final merge, where a subsetting IF statement is added to only keep
> > > > observations from the original data. I have added an additional ID
> > > > (which does not contain 4 observations) to the sample data for testing.
>
> > > > *(using your sample data);
>
> > > > data have;
>
> > > > infile cards missover;
>
> > > > input id_bb d0 d1 d2 d3;
>
> > > > cards;
>
> > > > 1 10 0 0 0
>
> > > > 1 8 5 0 0
>
> > > > 1 6 10 6 0
>
> > > > 1 4 7 4 3
>
> > > > 2 8 0 0 0
>
> > > > 2 4 9 0 0
>
> > > > 2 7 8 5 0
>
> > > > 2 10 3 10 6
>
> > > > 3 4 1 0 0
>
> > > > 3 2 8 0 0
>
> > > > ;
>
> > > > run;
>
> > > > proc print data = have;
>
> > > > run;
>
> > > > * create SUM_VAR containing variable name of summed variable ;
>
> > > > data next (drop=arr_cnt);
>
> > > > set have;
>
> > > > by id_bb;
>
> > > > array dvars(*) d0-d3;
>
> > > > if first.id_bb then arr_cnt = 1;
>
> > > > sum_var = vname(dvars(arr_cnt));
>
> > > > arr_cnt + 1;
>
> > > > run;
>
> > > > proc print data = next;
>
> > > > run;
>
> > > > *sumnumeric fields by ID_BB ;
>
> > > > proc summary data = have nway missing;
>
> > > > class id_bb;
>
> > > > var d0-d3;
>
> > > > output out = have_sum (drop=_type_ _freq_)sum=; run;
>
> > > > proc print data = have_sum;
>
> > > > run;
>
> > > > * transpose summed data - rename fields ;
>
> > > > proc transpose data = have_sum out = have_sumt (rename=(_name_=sum_var
> > > > col1=x));
>
> > > > by id_bb;
>
> > > > run;
>
> > > > proc print data = have_sumt;
>
> > > > run;
>
> > > > * merge summed data to original data ;
>
> > > > data need;
>
> > > > merge next (in=a) have_sumt;
>
> > > > by id_bb sum_var;
>
> > > > if a;
>
> > > > run;
>
> > > > proc print data = need;
>
> > > > run;
>
> > > > Jack
>
> > > > Jack Clark
> > > > Senior Research Analyst
> > > > phone: 410-455-6256
> > > > fax: 410-455-6850
> > > > jcl...(a)hilltop.umbc.edu ________________________________
>
> > > > From: Christophe Godlewski [mailto:chris.godlew...(a)gmail.com]
> > > > Sent: Monday, February 22, 2010 2:46 PM
> > > > To: Jack Clark
> > > > Cc: SA...(a)listserv.uga.edu
> > > > Subject: Re:trickysum
>
> > > > Hello Jack
>
> > > > Thank you very much for your help
> > > > it works fine with the data for the example
> > > > unfortunately, in the full dataset, each id_bb can have less than 126
> > > > observations...
> > > > it can be 1, 3, 89 or 126...
> > > > 126 is the maximum actually
> > > > more precisely, the "matrix" has 126 terms for each id_bb
> > > > but for instance if an id_bb has only 89 observations, then for d89 to
> > > > d125, there will be 0 in the cells...
> > > > now i am thinking of a trick to make your code applicable to my data...
> > > > but if you have a smart idea im taking it !
>
> > > > thank you again for your help
> > > > best
> > > > CG
>
> > > > On Mon, Feb 22, 2010 at 8:18 PM, Jack Clark <jcl...(a)hilltop.umbc.edu>
> > > > wrote:
>
> > > > Chris,
>
> > > > If I understand the question correctly, this means that each ID_BB value
> > > > in your data has exactly 126 observations? This is because you will
> > > > need 126 observations per ID to hold the sums of D0-D125. Is that
> > > > correct?
>
> > > > This solution includes the creation of a new variable called SUM_VAR
> > > > which has the value of the variable name being summed as the value of X.
> > > > You can drop it at the end if you don't want it. The data are
> > > > summarized by ID_BB to get thesumof each numeric variable. That
> > > > summarized data is transposed and then merged to the original data by
> > > > ID_BB and SUM_VAR.
>
> > > > Let me know if this meets your
>
> > ...
>
> > plus de détails »
>
> Hello everybody
>
> all the solutions (by Jack, Mike, Data_null) work
> and i thank you all for your help
> the example dataset that best describes the original dataset is
> perhaps this one :
> data have;
>
> infile cards missover;
>
> input id_bb d0 d1 d2 d3;
>
> cards;
> 1 10 0 0 0
> 1 8 5 0 0
> 1 6 10 6 0
> 1 4 7 4 3
> 2 8 0 0 0
> 2 4 9 0 0
> 2 7 8 5 0
> 2 10 3 10 6
> 3 4 0 0 0
> 3 2 8 0 0
> ;
> run;
>
> for id 1 and 2 we have a triangular matrix for each, i.e. number of
> obs = number of variables
> but the other cases can happen and actually they happen quite often in
> the original data
> for id 3 we have a number of observations < number of variables, i.e.
> there are 0 for the d2 and d3 variables
> cases like this are the most frequent in the original database
> actually the d125 is filled with a number > 0 in maybe 1 or 2 cases
> (i.e. for 1 or 2 id)
> for the rest, there is always number of obs.< number of variables and
> so many 0 around !
>
> thank you again very very much for your jind help
> best regards
> CG

Hi again

i am very sorry to annoy you again but as the full dataset is quite
large we discover it step by step and the associated problems as well

here is a snapshot of how the things look like:

id_bb d0 d1 d2 d3 d4 d5
1 10 0 0 0 0 0
1 8 5 0 0 0 0
1 6 10 6 0 0 0
1 4 7 4 3 0 0
2 8 0 0 0 0 0
2 4 9 0 0 0 0
2 7 8 5 0 0 0
2 10 3 10 6 0 0
3 4 0 0 0 0 0
3 2 8 0 0 0 0
3 5 6 0 0 0 0
4 9 0 0 0 0 0
4 7 8 0 0 0 0
4 4 3 1 0 0 0
4 5 6 4 0 0 0
4 2 6 7 0 0 0
4 8 9 7 2 0 0
5 3 0 0 0 0 0
6 1 0 0 0 0 0
6 5 8 0 0 0 0

as yo ucan see there are some kinds of "jumps" in the sense that the
matrix are not that triangular for each id
for instance for id = 4 there are 3 lines forming a 3x3 block as here:
4 4 3 1 0 0 0
4 5 6 4 0 0 0
4 2 6 7 0 0 0
this is what i call a "jump"

and of course, the expected result is the sum by id of the terms in
each d0, d1, and so on as here :

id_bb d0 d1 d2 d3 d4 d5 x
1 10 0 0 0 0 0 28
1 8 5 0 0 0 0 22
1 6 10 6 0 0 0 10
1 4 7 4 3 0 0 3
2 8 0 0 0 0 0 29
2 4 9 0 0 0 0 20
2 7 8 5 0 0 0 15
2 10 3 10 6 0 0 6
3 4 0 0 0 0 0 11
3 2 8 0 0 0 0 14
3 5 6 0 0 0 0 0
4 9 0 0 0 0 0 35
4 7 8 0 0 0 0 32
4 4 3 1 0 0 0 19
4 5 6 4 0 0 0 2
4 2 6 7 0 0 0 2
4 8 9 7 2 0 0 2
5 3 0 0 0 0 0 3
6 1 0 0 0 0 0 6
6 5 8 0 0 0 0 8

do you have any clue to solve that problem ?

thank you in advance
best
CG

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: 1)if and if-else 2)if and where 3)keep=option and keep statement4)input function and input statement 5)scan and substr6)if and if then do7)informat and format i had good knowledge about those difference with examples.
Next: Spectral Analysis