From: Lance Smith on
Dear all,

I have a database of 50 SNP variables. Each SNP variable has 3 levels
let’s say AA, AG, GG. The levels vary with different SNPs, so another
one may be CC CT and TT and still another may be AA AC and CC.

I also have levels of four markers that are on a continuous scale.
I need to do univariate linear regression to predict the level of
biomarkers using wach SNP seperately.
Thus I need to do 50*4 = 200 univariate linear regressions.
The SNPs need to be recoded to 0,1,2 for the regression as we want to
treat them as a continuous variable with the heterozygotes (AG or CT
or AC) coded as 1.

Is there a way to efficiently do the recoding to 0,1,2 in SAS without
having to recode all the 50 SNPs separately? Or is there a way to tell
SAS to treat them as continuous variables even though they are coded
as character variables?

Thank you
From: Richard A. DeVenezia on
On Mar 10, 6:53 pm, Lance Smith <medicaltr...(a)gmail.com> wrote:
> Dear all,
>
> I have a database of 50 SNP variables. Each SNP variable has 3 levels
> let’s say AA, AG, GG. The levels vary with different SNPs, so another
> one may be CC CT and TT and still another may be AA AC and CC.
>
> I also have levels of four markers that are on a continuous scale.
> I need to do univariate linear regression to predict the level of
> biomarkers using wach SNP seperately.
> Thus I need to do 50*4 = 200 univariate linear regressions.
> The SNPs need to be recoded to 0,1,2 for the regression as we want to
> treat them as a continuous variable with the heterozygotes (AG or CT
> or AC) coded as 1.
>
> Is there a way to efficiently do the recoding to 0,1,2 in SAS without
> having to recode all the 50 SNPs separately? Or is there a way to tell
> SAS to treat them as continuous variables even though they are coded
> as character variables?
>
> Thank you

Yes, there is a way.

Q: How many rows are in the database ? You might want to tranpose the
entire kaboodle in order to be able to use BY or CLASS statements.

If the allowed levels of each SNP variable are specified in a separate
table, you can use that table to create a view to map the textual
level value to a numeric value.

If the allowed level are not known apriori, a pass through the
collected data _can_ extract the observed level values and map based
on that. However, if some SNP variables have fewer than 3 different
level values, the regression might be misleading or require closer
examination.

There is a unfortunate side-effect from mapping to 0,1,2 -- you can't
use a single format to reverse map a 0,1,2 to its original level value
(because each SNP variable has a different set of levels)

This sample code will pass over a study's collected data to determine
the level values and compute an appropriate recode value. The recode
data is used to create a custom informat that is applied to each SNP
variable to create an SNPX variable. The regressions would use
SNPX.

Note: A hash table approach could also perform the same type of
recoding.

--------------------
* fake snp level values are as such
* AA, AB, BB
* BB, BC, CC
* aa, ab, bb
*;

data fake_study;
length sampleid biomarker 4;

array snp $2 snp1-snp50 ;

do sampleid = 1 to 100;
biomarker = ceil(10*ranuni(1234));
do _n_ = 1 to dim(snp);
x = floor(3*ranuni(1234));
if _n_ < 26 then
code = rank('A') + _n_ - 1 ;
else
code = rank('a') + _n_ - 26;

snp(_n_) = byte(code + x/2) || byte(code + (x+1)/2);
end;
output;
end;
drop code x;
run;

proc transpose data=fake_study
out=level_values(rename=col1=level_value);
by sampleid;
var snp:;
run;

proc sort data=level_values nodupkey;
by _name_ level_value;
run;

data level_informat_data;
set level_values;
by _name_;
if first._name_ then label=0; else label+1;

start = catx ('_', upcase(_name_), upcase(level_value));

fmtname = 'SNP_LEVEL_NUM';
type = 'I';

keep start label fmtname type;
run;

proc format cntlin = level_informat_data;
run;

data fake_study_snpX / view = fake_study_snpX;
set fake_study;
array snp snp1-snp50;
array snpx snpx1-snpx50; format snpx: 1.;

do _n_ = 1 to dim(snp);
name_cat_level = catx
( '_'
, upcase(VNAME(snp(_n_)))
, upcase(snp(_n_))
);

snpx(_n_) = input (name_cat_level, SNP_LEVEL_NUM.);
end;

drop name_cat_level;
run;
--------------------


Richard A. DeVenezia
http://www.devenezia.com