Prev: df for confidence interval with a random effect (maybe Satterthwaite)
Next: Histogram by the Class ?
From: Lance Smith on 10 Mar 2010 18:53 Dear all, I have a database of 50 SNP variables. Each SNP variable has 3 levels lets say AA, AG, GG. The levels vary with different SNPs, so another one may be CC CT and TT and still another may be AA AC and CC. I also have levels of four markers that are on a continuous scale. I need to do univariate linear regression to predict the level of biomarkers using wach SNP seperately. Thus I need to do 50*4 = 200 univariate linear regressions. The SNPs need to be recoded to 0,1,2 for the regression as we want to treat them as a continuous variable with the heterozygotes (AG or CT or AC) coded as 1. Is there a way to efficiently do the recoding to 0,1,2 in SAS without having to recode all the 50 SNPs separately? Or is there a way to tell SAS to treat them as continuous variables even though they are coded as character variables? Thank you
From: Richard A. DeVenezia on 11 Mar 2010 12:54
On Mar 10, 6:53 pm, Lance Smith <medicaltr...(a)gmail.com> wrote: > Dear all, > > I have a database of 50 SNP variables. Each SNP variable has 3 levels > lets say AA, AG, GG. The levels vary with different SNPs, so another > one may be CC CT and TT and still another may be AA AC and CC. > > I also have levels of four markers that are on a continuous scale. > I need to do univariate linear regression to predict the level of > biomarkers using wach SNP seperately. > Thus I need to do 50*4 = 200 univariate linear regressions. > The SNPs need to be recoded to 0,1,2 for the regression as we want to > treat them as a continuous variable with the heterozygotes (AG or CT > or AC) coded as 1. > > Is there a way to efficiently do the recoding to 0,1,2 in SAS without > having to recode all the 50 SNPs separately? Or is there a way to tell > SAS to treat them as continuous variables even though they are coded > as character variables? > > Thank you Yes, there is a way. Q: How many rows are in the database ? You might want to tranpose the entire kaboodle in order to be able to use BY or CLASS statements. If the allowed levels of each SNP variable are specified in a separate table, you can use that table to create a view to map the textual level value to a numeric value. If the allowed level are not known apriori, a pass through the collected data _can_ extract the observed level values and map based on that. However, if some SNP variables have fewer than 3 different level values, the regression might be misleading or require closer examination. There is a unfortunate side-effect from mapping to 0,1,2 -- you can't use a single format to reverse map a 0,1,2 to its original level value (because each SNP variable has a different set of levels) This sample code will pass over a study's collected data to determine the level values and compute an appropriate recode value. The recode data is used to create a custom informat that is applied to each SNP variable to create an SNPX variable. The regressions would use SNPX. Note: A hash table approach could also perform the same type of recoding. -------------------- * fake snp level values are as such * AA, AB, BB * BB, BC, CC * aa, ab, bb *; data fake_study; length sampleid biomarker 4; array snp $2 snp1-snp50 ; do sampleid = 1 to 100; biomarker = ceil(10*ranuni(1234)); do _n_ = 1 to dim(snp); x = floor(3*ranuni(1234)); if _n_ < 26 then code = rank('A') + _n_ - 1 ; else code = rank('a') + _n_ - 26; snp(_n_) = byte(code + x/2) || byte(code + (x+1)/2); end; output; end; drop code x; run; proc transpose data=fake_study out=level_values(rename=col1=level_value); by sampleid; var snp:; run; proc sort data=level_values nodupkey; by _name_ level_value; run; data level_informat_data; set level_values; by _name_; if first._name_ then label=0; else label+1; start = catx ('_', upcase(_name_), upcase(level_value)); fmtname = 'SNP_LEVEL_NUM'; type = 'I'; keep start label fmtname type; run; proc format cntlin = level_informat_data; run; data fake_study_snpX / view = fake_study_snpX; set fake_study; array snp snp1-snp50; array snpx snpx1-snpx50; format snpx: 1.; do _n_ = 1 to dim(snp); name_cat_level = catx ( '_' , upcase(VNAME(snp(_n_))) , upcase(snp(_n_)) ); snpx(_n_) = input (name_cat_level, SNP_LEVEL_NUM.); end; drop name_cat_level; run; -------------------- Richard A. DeVenezia http://www.devenezia.com |