From: SAS Techies on 2 Dec 2009 11:55 Every SAS programmer would know that Proc Sort NODUPRECS would remove exact duplicates but often forgets / misuses it and alas gets into trouble. It is very important that the programmer should properly code the variables in the BY statement. For the procedure to actually remove the duplicates, the duplicate records should be brought together by sorting the data with ALL the variables in the BY statement. Let me explain it through an example here Watch the lines in RED data prdsal3; set sashelp.prdsal3; where product="SOFA" and date between "01mar1998"d and "30apr1998"d and state="California"; keep State Date Actual; run; proc print data=prdsal3; run; Output: 09:29 Tuesday, December 1, 2009 17 Obs STATE ACTUAL DATE 1 California $973.50 MAR98 2 California $566.50 APR98 3 California $845.90 MAR98 4 California $2,196.70 APR98 5 California $973.50 MAR98 6 California $313.50 APR98 7 California $1,428.90 MAR98 8 California $1,446.50 APR98 9 California $1,144.00 MAR98 10 California $1,577.40 APR98 11 California $856.90 MAR98 12 California $1,380.50 APR98 Here the input data has 2 duplicates obs 1 and 5.. When a Programmer mistakenly codes like this..i.e. omitting the variable actual he get the output like the one below proc sort data=prdsal3 out=sortedrandomly noduprecs; by state date; run; proc print data=sortedrandomly; run; Output: 09:29 Tuesday, December 1, 2009 18 Obs STATE ACTUAL DATE 1 California $973.50 MAR98 2 California $845.90 MAR98 3 California $973.50 MAR98 4 California $1,428.90 MAR98 5 California $1,144.00 MAR98 6 California $856.90 MAR98 7 California $566.50 APR98 8 California $2,196.70 APR98 9 California $313.50 APR98 10 California $1,446.50 APR98 11 California $1,577.40 APR98 12 California $1,380.50 APR98 The output shows the sorted order by state and date the duplicates moved to 1 and 3. In the Output with the proc sort NODUPRECS did not eliminate the duplicates because the original duplicate records (1 and 5) were NOT sorted properly for NODUPRECS to eliminate consecutive Duplicate records The SAS documentation says that NODUPRECS checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set. Note: See NODUPKEY for information about eliminating observations with duplicate BY values. Alias : NODUP Interaction: When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed. When the Programmer codes it correctly with all the variables the desired result is achieved proc sort data=prdsal3 out=prdsal3_sorted noduprecs; by state actual date; run; proc print data=prdsal3_sorted; run; 09:29 Tuesday, December 1, 2009 19 Obs STATE ACTUAL DATE 1 California $313.50 APR98 2 California $566.50 APR98 3 California $845.90 MAR98 4 California $856.90 MAR98 5 California $973.50 MAR98 6 California $1,144.00 MAR98 7 California $1,380.50 APR98 8 California $1,428.90 MAR98 9 California $1,446.50 APR98 10 California $1,577.40 APR98 11 California $2,196.70 APR98 The best possible to achieve this result is to use the _all_ auto variable in the BY statement instead of specifying all the variables. proc sort data=prdsal3 out=prdsal3_byall noduprecs; by _all_; run; proc print data=prdsal3_byall; run; 09:29 Tuesday, December 1, 2009 19 Obs STATE ACTUAL DATE 1 California $313.50 APR98 2 California $566.50 APR98 3 California $845.90 MAR98 4 California $856.90 MAR98 5 California $973.50 MAR98 6 California $1,144.00 MAR98 7 California $1,380.50 APR98 8 California $1,428.90 MAR98 9 California $1,446.50 APR98 10 California $1,577.40 APR98 11 California $2,196.70 APR98
|
Pages: 1 Prev: Proc GLMSELECT Next: How to check if a file exists on webdav? |