From: Joe Matise on
If you have macros defined for it already, then a non-programmer can do it
trivially.

I however would disagree about it being a waste; a data-savvy programmer can
be highly useful in data cleaning, as it's not necessarily trivial to make
decisions and/or see issues that require additional cleaning steps. Trivial
data cleaning is, well, trivial, and shouldn't take an appreciable amount of
a programmer's actual physical time; data cleaning that is not truly
trivial, but instead requires analysis, should be done by a programmer, in
my book.

-Joe

On Mon, Jan 4, 2010 at 2:43 PM, Jonathan Goldberg
<jgoldberg(a)biomedsys.com>wrote:

> We are currently using SAS to do validation. We write programs to check
> things like ranges, all fields present, etc., etc..
> Since this is a clinical trials environment it is also necessary to check
> across records for visit squence, missing visits, etc..
>
> While we have a lot of this packaged into macros, it seems to me that
> there should be tools available that allow non-programmers to do a lot
> (preferably, all) of this. It seems a waste to need programmers to do
> something so low-level.
>
> Anyone have suggestions for products that might fill the bill?
>
> TIA.
>
> Jonathan
>
From: Seth StJames on
Have to agree will the others here, but if you ever move to SDD, then SAS
does an excellent job of doing this for programmers and non-programmers.



On Mon, Jan 4, 2010 at 3:43 PM, Jonathan Goldberg
<jgoldberg(a)biomedsys.com>wrote:

> We are currently using SAS to do validation. We write programs to check
> things like ranges, all fields present, etc., etc..
> Since this is a clinical trials environment it is also necessary to check
> across records for visit squence, missing visits, etc..
>
> While we have a lot of this packaged into macros, it seems to me that
> there should be tools available that allow non-programmers to do a lot
> (preferably, all) of this. It seems a waste to need programmers to do
> something so low-level.
>
> Anyone have suggestions for products that might fill the bill?
>
> TIA.
>
> Jonathan
>
From: Joe Whitehurst on
In fact, data cleaning is so non-trivial that SAS bought DataFlux (
http://www.sas.com/news/preleases/GartnerVisionariesQuadrantDITools.html).
So, stop your whining and learn what SAS offers to help with the
monumental, decidedly non-trivial task of cleaning data. Garbage in;Garbage
out! Let's start the new decade the right way, let's insist on CLEAN data!

On Mon, Jan 4, 2010 at 4:21 PM, Kevin F. Spratt <
Kevin.F.Spratt(a)dartmouth.edu> wrote:

> At 04:02 PM 1/4/2010, Joe Matise wrote:
>
>> If you have macros defined for it already, then a non-programmer can do it
>> trivially.
>>
>> I however would disagree about it being a waste; a data-savvy programmer
>> can
>> be highly useful in data cleaning, as it's not necessarily trivial to make
>> decisions and/or see issues that require additional cleaning steps.
>> Trivial
>> data cleaning is, well, trivial, and shouldn't take an appreciable amount
>> of
>> a programmer's actual physical time; data cleaning that is not truly
>> trivial, but instead requires analysis, should be done by a programmer, in
>> my book.
>>
>> -Joe
>>
>
>
> I second Joe's comments.
>
> Data cleaning can be particularly non-trivial when the data is
> gathered according to various
> normalization rules across a number of tables.
>
> The "trivial" part is documenting the variable names and creating
> formats. The non-trival
> part is making sure that the various "joins" that you often need to
> do to structure the
> data for particular analyses are merged correctly. Handling missing
> data can also be
> a major issue when the database is coded with different numeric
> values indicating missing
> as you need to convert these to valid SAS missing data values.
>
> In my experience, even when attempting to get this done in a coherent
> way, some preliminary
> analyses often result in identifying some additional cleaning
> problems, which, of course,
> is much better when some late stage analysis results in identifying
> such problems.
>
> The biggest problem I tend to have when some extract comes my what is
> when the comma
> delimited file has the response string in a cell rather than the
> respond numeric code.
>
> For example "Much of the time" rather than 4. This is especially
> troublesome when the
> forms have multiple versions and the version to version documentation
> does not make
> if clear that in version 1 "Much of the time" corresponds to 4, but
> in version 2 is corresponds
> to 3.
>
> When these kinds of things get discovered, the program who made the
> version 1 to version 2 changes
> is often following the instructions of a PI, who wants this change
> but has not actually consulted
> with the study methodologist and/or statistician who would typically
> argue against such a mid-stream
> change.
>
> PIs often seem so surprised that such a "little" thing can cause so
> much angst. The worst of it
> is, after explaining why this is a problem and one that potentially
> is not easily corrected, the
> same PI on the next study does it again.
>
> All I can say is that it's good when you get to a point in your
> career when you can just say no
> when asked to work with someone.
>
>
> ______________________________________________________________________
>
> Kevin F. Spratt, Ph.D.
> Department of Orthopaedic Surgery
> Dartmouth Medical School
> One Medical Center Drive
> DHMC
> Lebanon, NH USA 03756
> (603) 653-6012 (voice)
> (603) 653-6013 (fax)
> Kevin.F.Spratt(a)Dartmouth.Edu (e-mail)
> _______________________________________________________________________
>
From: "Kevin F. Spratt" on
At 04:02 PM 1/4/2010, Joe Matise wrote:
>If you have macros defined for it already, then a non-programmer can do it
>trivially.
>
>I however would disagree about it being a waste; a data-savvy programmer can
>be highly useful in data cleaning, as it's not necessarily trivial to make
>decisions and/or see issues that require additional cleaning steps. Trivial
>data cleaning is, well, trivial, and shouldn't take an appreciable amount of
>a programmer's actual physical time; data cleaning that is not truly
>trivial, but instead requires analysis, should be done by a programmer, in
>my book.
>
>-Joe


I second Joe's comments.

Data cleaning can be particularly non-trivial when the data is
gathered according to various
normalization rules across a number of tables.

The "trivial" part is documenting the variable names and creating
formats. The non-trival
part is making sure that the various "joins" that you often need to
do to structure the
data for particular analyses are merged correctly. Handling missing
data can also be
a major issue when the database is coded with different numeric
values indicating missing
as you need to convert these to valid SAS missing data values.

In my experience, even when attempting to get this done in a coherent
way, some preliminary
analyses often result in identifying some additional cleaning
problems, which, of course,
is much better when some late stage analysis results in identifying
such problems.

The biggest problem I tend to have when some extract comes my what is
when the comma
delimited file has the response string in a cell rather than the
respond numeric code.

For example "Much of the time" rather than 4. This is especially
troublesome when the
forms have multiple versions and the version to version documentation
does not make
if clear that in version 1 "Much of the time" corresponds to 4, but
in version 2 is corresponds
to 3.

When these kinds of things get discovered, the program who made the
version 1 to version 2 changes
is often following the instructions of a PI, who wants this change
but has not actually consulted
with the study methodologist and/or statistician who would typically
argue against such a mid-stream
change.

PIs often seem so surprised that such a "little" thing can cause so
much angst. The worst of it
is, after explaining why this is a problem and one that potentially
is not easily corrected, the
same PI on the next study does it again.

All I can say is that it's good when you get to a point in your
career when you can just say no
when asked to work with someone.


______________________________________________________________________

Kevin F. Spratt, Ph.D.
Department of Orthopaedic Surgery
Dartmouth Medical School
One Medical Center Drive
DHMC
Lebanon, NH USA 03756
(603) 653-6012 (voice)
(603) 653-6013 (fax)
Kevin.F.Spratt(a)Dartmouth.Edu (e-mail)
_______________________________________________________________________
From: Dennis Fisher on
It would seem as though a reference to the following book might be in order
here

Cody, R. (1999). Cody's data cleaning techniques using SAS software. Cary,
NC: SAS Institute inc.

HTH
Dennis Fisher

Dennis G. Fisher, Ph.D.
Professor and Director
Center for Behavioral Research and Services
California State University, Long Beach
1090 Atlantic Avenue
Long Beach, CA 90813
tel: 562-495-2330 x121
fax: 562-983-1421
http://www.csulb.edu/centers/cbrs


-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of Seth
StJames
Sent: Monday, January 04, 2010 1:41 PM
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: Re: Data Validation/Cleansing Tool Query

Have to agree will the others here, but if you ever move to SDD, then SAS
does an excellent job of doing this for programmers and non-programmers.



On Mon, Jan 4, 2010 at 3:43 PM, Jonathan Goldberg
<jgoldberg(a)biomedsys.com>wrote:

> We are currently using SAS to do validation. We write programs to check
> things like ranges, all fields present, etc., etc..
> Since this is a clinical trials environment it is also necessary to check
> across records for visit squence, missing visits, etc..
>
> While we have a lot of this packaged into macros, it seems to me that
> there should be tools available that allow non-programmers to do a lot
> (preferably, all) of this. It seems a waste to need programmers to do
> something so low-level.
>
> Anyone have suggestions for products that might fill the bill?
>
> TIA.
>
> Jonathan
>