Common Programming Mistake with Proc Sort NODUPRECS [SAS]

Prev: Proc GLMSELECT
Next: How to check if a file exists on webdav?

From: SAS Techies on 2 Dec 2009 11:55

Every SAS programmer would know that Proc Sort NODUPRECS would remove
exact duplicatesbut often forgets / misuses it and alas gets into
trouble.

It is very important that the programmer should properly code the
variables in the BY statement. For the procedure to actually remove
the duplicates, the duplicate records should be brought together by
sorting the data with ALL the variables in the BY statement.

Let me explain it through an example hereWatch the lines in RED

data prdsal3;
set sashelp.prdsal3;
where product="SOFA"
and date between "01mar1998"d and "30apr1998"d
and state="California";
keep State Date Actual;
run;

proc print data=prdsal3;
run;
Output:
09:29
Tuesday, December 1, 2009 17

Obs STATE ACTUAL
DATE

1 California $973.50
MAR98
2 California $566.50
APR98
3 California $845.90
MAR98
4 California $2,196.70
APR98
5 California $973.50
MAR98
6 California $313.50
APR98
7 California $1,428.90
MAR98
8 California $1,446.50
APR98
9 California $1,144.00
MAR98
10 California $1,577.40
APR98
11 California $856.90
MAR98
12 California $1,380.50
APR98

Here the input data has 2 duplicates obs 1 and 5..
When a Programmer mistakenly codes like this..i.e. omitting the
variable actual he get the output like the one below

proc sort data=prdsal3 out=sortedrandomly noduprecs;
by state date;
run;

proc print data=sortedrandomly;
run;
Output:
09:29
Tuesday, December 1, 2009 18

Obs STATE ACTUAL
DATE

1 California $973.50
MAR98
2 California $845.90
MAR98
3 California $973.50
MAR98
4 California $1,428.90
MAR98
5 California $1,144.00
MAR98
6 California $856.90
MAR98
7 California $566.50
APR98
8 California $2,196.70
APR98
9 California $313.50
APR98
10 California $1,446.50
APR98
11 California $1,577.40
APR98
12 California $1,380.50
APR98

The output shows the sorted order by state and datethe duplicates
moved to 1 and 3.
In the Output with the proc sort NODUPRECS did not eliminate the
duplicates because the original duplicate records (1 and 5) were NOT
sorted properly for NODUPRECS to eliminate consecutive Duplicate
records

The SAS documentation says that

NODUPRECS
checks for and eliminates duplicate observations. If you specify this
option, then PROC SORT compares all variable values for each
observation to those for the previous observation that was written to
the output data set. If an exact match is found, then the observation
is not written to the output data set.

Note: See NODUPKEY for information about eliminating observations
with duplicate BY values.
Alias : NODUP
Interaction: When you are removing consecutive duplicate observations
in the output data set with NODUPRECS, the choice of EQUALS or
NOEQUALS can have an effect on which observations are removed.

When the Programmer codes it correctly with all the variablesthe
desired result is achieved

proc sort data=prdsal3 out=prdsal3_sorted noduprecs;
by state actual date;
run;

proc print data=prdsal3_sorted;
run;

09:29
Tuesday, December 1, 2009 19

Obs STATE ACTUAL
DATE

1 California $313.50
APR98
2 California $566.50
APR98
3 California $845.90
MAR98
4 California $856.90
MAR98
5 California $973.50
MAR98
6 California $1,144.00
MAR98
7 California $1,380.50
APR98
8 California $1,428.90
MAR98
9 California $1,446.50
APR98
10 California $1,577.40
APR98
11 California $2,196.70
APR98

The best possible to achieve this result is to use the _all_ auto
variable in the BY statement instead of specifying all the variables.

proc sort data=prdsal3 out=prdsal3_byall noduprecs;
by _all_;
run;

proc print data=prdsal3_byall;
run;
09:29
Tuesday, December 1, 2009 19

Obs STATE ACTUAL
DATE

1 California $313.50
APR98
2 California $566.50
APR98
3 California $845.90
MAR98
4 California $856.90
MAR98
5 California $973.50
MAR98
6 California $1,144.00
MAR98
7 California $1,380.50
APR98
8 California $1,428.90
MAR98
9 California $1,446.50
APR98
10 California $1,577.40
APR98
11 California $2,196.70
APR98

|
Pages: 1
Prev: Proc GLMSELECT
Next: How to check if a file exists on webdav?