Prev: Error on SAS ACCESS TO ORACLE
Next: PROC QLIM - Heckman
From: amw5gster on 29 Nov 2006 11:50 Howdy, Silly question that's likely to show I'm overlooking something simple, but I'm stumped. I have a dset of approx 8M observations and I'm trying to grow an EMiner decision tree on a binary target variable. There are about 20 independent variables, mostly interval (dates), but some nominal, a few binary and one ordinal. The proportion of true events is about 12%. I have not set any prior probabilities, nor profit/cost values. The tree runs, but returns no splits. It just won't grow. I've tried dropping the signif value to .00001, using upwards of 11 maximum branches and my max depth to 10. I also tried having the tree build on as few as 2 IVs. I was able to build a tree when I took a sample of 100K records and forced the %age of true events in the sample to be 50%. Naturally I don't want to misrepresent the proportion, and I figured that 12% wasn't terribly rare for a d-tree. Am I outright doing something wrong or is this expected behavior?
From: Sigurd Hermansen on 29 Nov 2006 16:01 Dropping the significance value may have the opposite of the effect that you expect. Generally it takes a 'purer' separation of true events from others to attain 1% as opposed to a 5% Type 1 error 'significance'. A small proportion of true events in data makes it even harder (due to 'long-tail' distributions of errors). SAS/EM takes statistical significance seriously and won't produce results in some situations unless the user explicitly increases the acceptable level of Type 1 error. I would prefer a 'decision cost' basis for exploratory data analyses that do not pretend to conduct a hypothesis test, but I understand why SAS implements decision trees this way. Sig -----Original Message----- From: owner-sas-l(a)listserv.uga.edu [mailto:owner-sas-l(a)listserv.uga.edu] On Behalf Of amw5gster(a)gmail.com Sent: Wednesday, November 29, 2006 11:50 AM To: sas-l(a)uga.edu Subject: Decision Tree refuses to grow Howdy, Silly question that's likely to show I'm overlooking something simple, but I'm stumped. I have a dset of approx 8M observations and I'm trying to grow an EMiner decision tree on a binary target variable. There are about 20 independent variables, mostly interval (dates), but some nominal, a few binary and one ordinal. The proportion of true events is about 12%. I have not set any prior probabilities, nor profit/cost values. The tree runs, but returns no splits. It just won't grow. I've tried dropping the signif value to .00001, using upwards of 11 maximum branches and my max depth to 10. I also tried having the tree build on as few as 2 IVs. I was able to build a tree when I took a sample of 100K records and forced the %age of true events in the sample to be 50%. Naturally I don't want to misrepresent the proportion, and I figured that 12% wasn't terribly rare for a d-tree. Am I outright doing something wrong or is this expected behavior?
From: Peter Flom on 29 Nov 2006 16:13 <<< Silly question that's likely to show I'm overlooking something simple, but I'm stumped. I have a dset of approx 8M observations and I'm trying to grow an EMiner decision tree on a binary target variable. There are about 20 independent variables, mostly interval (dates), but some nominal, a few binary and one ordinal. The proportion of true events is about 12%. I have not set any prior probabilities, nor profit/cost values. The tree runs, but returns no splits. It just won't grow. I've tried dropping the signif value to .00001, using upwards of 11 maximum branches and my max depth to 10. I also tried having the tree build on as few as 2 IVs. I was able to build a tree when I took a sample of 100K records and forced the %age of true events in the sample to be 50%. Naturally I don't want to misrepresent the proportion, and I figured that 12% wasn't terribly rare for a d-tree. >>> I don't know how trees work in SAS, but in other software, this can easily happen. It could be that none of the IVs are very good at separating the DV. Peter
From: Vadim Pliner on 30 Nov 2006 11:22 I'm afraid you set too many branches for your independent variables when some of those variables apparently have a lot of distinctive values. It looks like the number of competitive splits at each level should be astronomical in your case. What decision tree node of Eminer does, it adjusts all p-values for multiple comparisons, and since the number of those comparisons looks to be huge from what you wrote, it may produce very big adjusted p-values. To grow your tree, try to decrease the "maximum number of branches from a node" to, say, 2 or 3 and increase your p-value to, say, 0.05 or even 0.1. HTH, Vadim Pliner amw5gster(a)gmail.com wrote: > Howdy, > > Silly question that's likely to show I'm overlooking something simple, > but I'm stumped. I have a dset of approx 8M observations and I'm > trying to grow an EMiner decision tree on a binary target variable. > There are about 20 independent variables, mostly interval (dates), but > some nominal, a few binary and one ordinal. The proportion of true > events is about 12%. I have not set any prior probabilities, nor > profit/cost values. > > The tree runs, but returns no splits. It just won't grow. I've tried > dropping the signif value to .00001, using upwards of 11 maximum > branches and my max depth to 10. I also tried having the tree build on > as few as 2 IVs. > > I was able to build a tree when I took a sample of 100K records and > forced the %age of true events in the sample to be 50%. Naturally I > don't want to misrepresent the proportion, and I figured that 12% > wasn't terribly rare for a d-tree. > > Am I outright doing something wrong or is this expected behavior?
From: David L Cassell on 1 Dec 2006 01:44
amw5gster(a)GMAIL.COM wrote: > >Howdy, > >Silly question that's likely to show I'm overlooking something simple, >but I'm stumped. I have a dset of approx 8M observations and I'm >trying to grow an EMiner decision tree on a binary target variable. >There are about 20 independent variables, mostly interval (dates), but >some nominal, a few binary and one ordinal. The proportion of true >events is about 12%. I have not set any prior probabilities, nor >profit/cost values. > >The tree runs, but returns no splits. It just won't grow. I've tried >dropping the signif value to .00001, using upwards of 11 maximum >branches and my max depth to 10. I also tried having the tree build on >as few as 2 IVs. > >I was able to build a tree when I took a sample of 100K records and >forced the %age of true events in the sample to be 50%. Naturally I >don't want to misrepresent the proportion, and I figured that 12% >wasn't terribly rare for a d-tree. > >Am I outright doing something wrong or is this expected behavior? In addition to the excellent advice from Sig and Vadim, let me add a note about taking subsamples. SAS EM uses the same underlying protocols as PROC SURVEYSELECT when it does a sample (the first step in the SEMMA model of data mining). It treats your data as if you requested a stratified sample, with your 0/1 DV providing the strata, and it does a simple random sample of your data within each stratum. [This may or may NOT be ideal for your setting.] And it can (in theory) handle the weights that accrue as a result of this sampling. So cutting the data down to 2 million - 1 million 'true' and 1 million 'false' will give you that 50% while keeping all the 'true' events. Or subset even further. But the smaller you cut these pieces, the more you run into problems with probabilities of missing rare events that may be related to independent variables. Which can really be a source of misrepresentation, in a different way. At this point, you need to think about how the sampling should really be done. HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 _________________________________________________________________ Fixing up the home? Live Search can help http://imagine-windowslive.com/search/kits/default.aspx?kit=improve&locale=en-US&source=hmemailtaglinenov06&FORM=WLMTAG |