Prev: Latent Class Analysis - Question
Next: New Comparison Operators? - WAS: missing numerical values = -
From: Joe Matise on 4 Jan 2010 16:02 If you have macros defined for it already, then a non-programmer can do it trivially. I however would disagree about it being a waste; a data-savvy programmer can be highly useful in data cleaning, as it's not necessarily trivial to make decisions and/or see issues that require additional cleaning steps. Trivial data cleaning is, well, trivial, and shouldn't take an appreciable amount of a programmer's actual physical time; data cleaning that is not truly trivial, but instead requires analysis, should be done by a programmer, in my book. -Joe On Mon, Jan 4, 2010 at 2:43 PM, Jonathan Goldberg <jgoldberg(a)biomedsys.com>wrote: > We are currently using SAS to do validation. We write programs to check > things like ranges, all fields present, etc., etc.. > Since this is a clinical trials environment it is also necessary to check > across records for visit squence, missing visits, etc.. > > While we have a lot of this packaged into macros, it seems to me that > there should be tools available that allow non-programmers to do a lot > (preferably, all) of this. It seems a waste to need programmers to do > something so low-level. > > Anyone have suggestions for products that might fill the bill? > > TIA. > > Jonathan >
From: Seth StJames on 4 Jan 2010 16:40 Have to agree will the others here, but if you ever move to SDD, then SAS does an excellent job of doing this for programmers and non-programmers. On Mon, Jan 4, 2010 at 3:43 PM, Jonathan Goldberg <jgoldberg(a)biomedsys.com>wrote: > We are currently using SAS to do validation. We write programs to check > things like ranges, all fields present, etc., etc.. > Since this is a clinical trials environment it is also necessary to check > across records for visit squence, missing visits, etc.. > > While we have a lot of this packaged into macros, it seems to me that > there should be tools available that allow non-programmers to do a lot > (preferably, all) of this. It seems a waste to need programmers to do > something so low-level. > > Anyone have suggestions for products that might fill the bill? > > TIA. > > Jonathan >
From: Joe Whitehurst on 4 Jan 2010 16:38 In fact, data cleaning is so non-trivial that SAS bought DataFlux ( http://www.sas.com/news/preleases/GartnerVisionariesQuadrantDITools.html). So, stop your whining and learn what SAS offers to help with the monumental, decidedly non-trivial task of cleaning data. Garbage in;Garbage out! Let's start the new decade the right way, let's insist on CLEAN data! On Mon, Jan 4, 2010 at 4:21 PM, Kevin F. Spratt < Kevin.F.Spratt(a)dartmouth.edu> wrote: > At 04:02 PM 1/4/2010, Joe Matise wrote: > >> If you have macros defined for it already, then a non-programmer can do it >> trivially. >> >> I however would disagree about it being a waste; a data-savvy programmer >> can >> be highly useful in data cleaning, as it's not necessarily trivial to make >> decisions and/or see issues that require additional cleaning steps. >> Trivial >> data cleaning is, well, trivial, and shouldn't take an appreciable amount >> of >> a programmer's actual physical time; data cleaning that is not truly >> trivial, but instead requires analysis, should be done by a programmer, in >> my book. >> >> -Joe >> > > > I second Joe's comments. > > Data cleaning can be particularly non-trivial when the data is > gathered according to various > normalization rules across a number of tables. > > The "trivial" part is documenting the variable names and creating > formats. The non-trival > part is making sure that the various "joins" that you often need to > do to structure the > data for particular analyses are merged correctly. Handling missing > data can also be > a major issue when the database is coded with different numeric > values indicating missing > as you need to convert these to valid SAS missing data values. > > In my experience, even when attempting to get this done in a coherent > way, some preliminary > analyses often result in identifying some additional cleaning > problems, which, of course, > is much better when some late stage analysis results in identifying > such problems. > > The biggest problem I tend to have when some extract comes my what is > when the comma > delimited file has the response string in a cell rather than the > respond numeric code. > > For example "Much of the time" rather than 4. This is especially > troublesome when the > forms have multiple versions and the version to version documentation > does not make > if clear that in version 1 "Much of the time" corresponds to 4, but > in version 2 is corresponds > to 3. > > When these kinds of things get discovered, the program who made the > version 1 to version 2 changes > is often following the instructions of a PI, who wants this change > but has not actually consulted > with the study methodologist and/or statistician who would typically > argue against such a mid-stream > change. > > PIs often seem so surprised that such a "little" thing can cause so > much angst. The worst of it > is, after explaining why this is a problem and one that potentially > is not easily corrected, the > same PI on the next study does it again. > > All I can say is that it's good when you get to a point in your > career when you can just say no > when asked to work with someone. > > > ______________________________________________________________________ > > Kevin F. Spratt, Ph.D. > Department of Orthopaedic Surgery > Dartmouth Medical School > One Medical Center Drive > DHMC > Lebanon, NH USA 03756 > (603) 653-6012 (voice) > (603) 653-6013 (fax) > Kevin.F.Spratt(a)Dartmouth.Edu (e-mail) > _______________________________________________________________________ >
From: "Kevin F. Spratt" on 4 Jan 2010 16:21 At 04:02 PM 1/4/2010, Joe Matise wrote: >If you have macros defined for it already, then a non-programmer can do it >trivially. > >I however would disagree about it being a waste; a data-savvy programmer can >be highly useful in data cleaning, as it's not necessarily trivial to make >decisions and/or see issues that require additional cleaning steps. Trivial >data cleaning is, well, trivial, and shouldn't take an appreciable amount of >a programmer's actual physical time; data cleaning that is not truly >trivial, but instead requires analysis, should be done by a programmer, in >my book. > >-Joe I second Joe's comments. Data cleaning can be particularly non-trivial when the data is gathered according to various normalization rules across a number of tables. The "trivial" part is documenting the variable names and creating formats. The non-trival part is making sure that the various "joins" that you often need to do to structure the data for particular analyses are merged correctly. Handling missing data can also be a major issue when the database is coded with different numeric values indicating missing as you need to convert these to valid SAS missing data values. In my experience, even when attempting to get this done in a coherent way, some preliminary analyses often result in identifying some additional cleaning problems, which, of course, is much better when some late stage analysis results in identifying such problems. The biggest problem I tend to have when some extract comes my what is when the comma delimited file has the response string in a cell rather than the respond numeric code. For example "Much of the time" rather than 4. This is especially troublesome when the forms have multiple versions and the version to version documentation does not make if clear that in version 1 "Much of the time" corresponds to 4, but in version 2 is corresponds to 3. When these kinds of things get discovered, the program who made the version 1 to version 2 changes is often following the instructions of a PI, who wants this change but has not actually consulted with the study methodologist and/or statistician who would typically argue against such a mid-stream change. PIs often seem so surprised that such a "little" thing can cause so much angst. The worst of it is, after explaining why this is a problem and one that potentially is not easily corrected, the same PI on the next study does it again. All I can say is that it's good when you get to a point in your career when you can just say no when asked to work with someone. ______________________________________________________________________ Kevin F. Spratt, Ph.D. Department of Orthopaedic Surgery Dartmouth Medical School One Medical Center Drive DHMC Lebanon, NH USA 03756 (603) 653-6012 (voice) (603) 653-6013 (fax) Kevin.F.Spratt(a)Dartmouth.Edu (e-mail) _______________________________________________________________________
From: Dennis Fisher on 4 Jan 2010 16:45 It would seem as though a reference to the following book might be in order here Cody, R. (1999). Cody's data cleaning techniques using SAS software. Cary, NC: SAS Institute inc. HTH Dennis Fisher Dennis G. Fisher, Ph.D. Professor and Director Center for Behavioral Research and Services California State University, Long Beach 1090 Atlantic Avenue Long Beach, CA 90813 tel: 562-495-2330 x121 fax: 562-983-1421 http://www.csulb.edu/centers/cbrs -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of Seth StJames Sent: Monday, January 04, 2010 1:41 PM To: SAS-L(a)LISTSERV.UGA.EDU Subject: Re: Data Validation/Cleansing Tool Query Have to agree will the others here, but if you ever move to SDD, then SAS does an excellent job of doing this for programmers and non-programmers. On Mon, Jan 4, 2010 at 3:43 PM, Jonathan Goldberg <jgoldberg(a)biomedsys.com>wrote: > We are currently using SAS to do validation. We write programs to check > things like ranges, all fields present, etc., etc.. > Since this is a clinical trials environment it is also necessary to check > across records for visit squence, missing visits, etc.. > > While we have a lot of this packaged into macros, it seems to me that > there should be tools available that allow non-programmers to do a lot > (preferably, all) of this. It seems a waste to need programmers to do > something so low-level. > > Anyone have suggestions for products that might fill the bill? > > TIA. > > Jonathan >
|
Next
|
Last
Pages: 1 2 3 4 Prev: Latent Class Analysis - Question Next: New Comparison Operators? - WAS: missing numerical values = - |