From: dougarobertson on 28 Dec 2006 16:43 Hi, I've got a large dataset (350,000,000 records) containing transactions, from which I want to select a subset of around 250,000,000 records and then dedupe on id, place and date, which takes it down to around 15,000,000 records. It's (just, given memory constraints) possible to do this using a simple proc sort but, partly because some further data manipulation may be desirable and partly to get to grips with them for the first time, I've tried using hash tables. On the first couple of attempts it ran out of memory, but with a bit of tinkering I got it to just below the limit; the (slightly edited) log is below. I am now trying to understand why so much memory is used for this datastep. I realise 15,000,000 records is a lot to hold as a hash table in memory, but there are only 3 variables (all numeric, total length 11) and the final data set that is output only uses about 160MB on the disk, so why does the datastep need 919MB of memory? Is there any way I can reduce this? I'll be rerunning the process with several similar sized data sets so I can't really afford to be teetering so close to the edge of our available memory. I am working with 1GB of memory on a UNIX platform, working remotely on a Windows based PC through Enterprise Guide 4. Any comments would be very welcome. Thanks, Doug 21 options fullstimer; 22 data _null_; 23 length hhd_id 4; 24 if 0 then set bigdata (rename=(hhd_id=c_hi)); 25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ; 26 hh.definekey ('id','place','date') ; 27 hh.definedata ('id','place','date') ; 28 hh.definedone () ; 29 do until (eof) ; 30 set bigdata (where=(flag='Y') keep=id place date flag rename=(id=c_id)) end = eof ; 31 id=input(c_id,10.); 32 hh.replace () ; 33 end ; 34 rc = hh.output (dataset: "visits") ; 35 36 run; NOTE: The data set VISITS has 14374264 observations and 3 variables. NOTE: There were 243729816 observations read from the data set bigdata WHERE flag='Y'; NOTE: DATA statement used (Total process time): real time 11:50.98 user cpu time 7:05.57 system cpu time 1:03.51 Memory 918868k Page Faults 185 Page Reclaims 224845 Page Swaps 0 Voluntary Context Switches 0 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0
From: yzg9 on 28 Dec 2006 16:54 Check this: http://support.sas.com/techsup/unotes/SN/016/016920.html=20 SN-V9-016920=20 "FATAL: Insufficient memory to execute data step program" message when creating a large hash object =09 The following messages will be written to the SAS log when you run out of memory when adding keys and associated data items to a hash object. =20 FATAL: Insufficient memory to execute data step program. Aborted during the EXECUTION phase. NOTE: The SAS System stopped processing this step because of insufficient memory. =20 In some cases, these messages may be generated even though there is still memory available. These types of failures may occur on machines which have an XMRLMEM value which will overflow a 4-byte integer. This causes in an excessively large memory allocation request which results in the inaccurate 'out of memory' condition. =20 The code below illustrates how to get the value of XMRLMEM. =20 data _null_; amt =3D getoption('xmrlmem'); put amt=3D; run; =20 If the value of amt is 2,147,483,647 or greater then you may be getting the insufficient memory error incorrectly. This problem is fixed in SAS 9.1.3 Service Pack 4. =20 Also beginning with SP4, when the system option MSGLEVEL=3DI is specified, a message is written to the SAS log reporting the number of items added to the hash object when the memory shortage occurs. ***** John Gerstle, MS=20 Biostatistician Northrop Grumman CDC Information Technological Support Contract (CITS) NCHSTP \DHAP \HICSB \Research, Analysis, and Evaluation Section Centers for Disease Control and Prevention "Boss. We've got cats." "Meow" "All truth passes through three stages: First, it is ridiculed; Second, it is violently opposed; Third, it is accepted as being self-evident." - Arthur Schopenhauer (1830) >>-----Original Message----- >>From: owner-sas-l(a)listserv.uga.edu [mailto:owner-sas-l(a)listserv.uga.edu] >>On Behalf Of dougarobertson(a)gmail.com >>Sent: Thursday, December 28, 2006 4:44 PM >>To: sas-l(a)uga.edu >>Subject: Hash Table Memory Usage >> >>Hi, >>I've got a large dataset (350,000,000 records) containing transactions, >>from which I want to select a subset of around 250,000,000 records and >>then dedupe on id, place and date, which takes it down to around >>15,000,000 records. It's (just, given memory constraints) possible to >>do this using a simple proc sort but, partly because some further data >>manipulation may be desirable and partly to get to grips with them for >>the first time, I've tried using hash tables. >>On the first couple of attempts it ran out of memory, but with a bit of >>tinkering I got it to just below the limit; the (slightly edited) log >>is below. >>I am now trying to understand why so much memory is used for this >>datastep. I realise 15,000,000 records is a lot to hold as a hash table >>in memory, but there are only 3 variables (all numeric, total length >>11) and the final data set that is output only uses about 160MB on the >>disk, so why does the datastep need 919MB of memory? Is there any way I >>can reduce this? >>I'll be rerunning the process with several similar sized data sets so I >>can't really afford to be teetering so close to the edge of our >>available memory. >>I am working with 1GB of memory on a UNIX platform, working remotely on >>a Windows based PC through Enterprise Guide 4. >> >>Any comments would be very welcome. >> >>Thanks, >> >>Doug >> >> >> >>21 options fullstimer; >>22 data _null_; >>23 length hhd_id 4; >>24 if 0 then set bigdata (rename=3D(hhd_id=3Dc_hi)); >>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ; >>26 hh.definekey ('id','place','date') ; >>27 hh.definedata ('id','place','date') ; >>28 hh.definedone () ; >>29 do until (eof) ; >>30 set bigdata (where=3D(flag=3D'Y') keep=3Did place date = flag >>rename=3D(id=3Dc_id)) end =3D eof ; >>31 id=3Dinput(c_id,10.); >>32 hh.replace () ; >>33 end ; >>34 rc =3D hh.output (dataset: "visits") ; >>35 >>36 run; >> >>NOTE: The data set VISITS has 14374264 observations and 3 variables. >>NOTE: There were 243729816 observations read from the data set bigdata >> WHERE flag=3D'Y'; >>NOTE: DATA statement used (Total process time): >> real time 11:50.98 >> user cpu time 7:05.57 >> system cpu time 1:03.51 >> Memory 918868k >> Page Faults 185 >> Page Reclaims 224845 >> Page Swaps 0 >> Voluntary Context Switches 0 >> Involuntary Context Switches 0 >> Block Input Operations 0 >> Block Output Operations 0
From: dougarobertson on 28 Dec 2006 17:11 John, Thanks for getting back to me so quickly. The XMRLMEM value on my system is 858,962,944, so I don't think this problem applies in my case. Thanks for the suggestion though. Doug "Gerstle, John (CDC/CCID/NCHHSTP) (CTR)" wrote: > Check this: http://support.sas.com/techsup/unotes/SN/016/016920.html=20 > > SN-V9-016920=20 > "FATAL: Insufficient memory to execute data step program" message when > creating a large hash object =09 > > > The following messages will be written to the SAS log when you run > out > of memory when adding keys and associated data items to a hash > object. > =20 > FATAL: Insufficient memory to execute data step program. Aborted > during the EXECUTION phase. > NOTE: The SAS System stopped processing this step because of > insufficient memory. > =20 > In some cases, these messages may be generated even though there is > still memory available. These types of failures may occur on > machines > which have an XMRLMEM value which will overflow a 4-byte integer. > This > causes in an excessively large memory allocation request which > results > in the inaccurate 'out of memory' condition. > =20 > The code below illustrates how to get the value of XMRLMEM. > =20 > data _null_; > amt =3D getoption('xmrlmem'); > put amt=3D; > run; > =20 > If the value of amt is 2,147,483,647 or greater then you may be > getting > the insufficient memory error incorrectly. This problem is fixed in > SAS > 9.1.3 Service Pack 4. > =20 > Also beginning with SP4, when the system option MSGLEVEL=3DI is > specified, a message is written to the SAS log reporting the number > of > items added to the hash object when the memory shortage occurs. > > ***** > > John Gerstle, MS=20 > Biostatistician > Northrop Grumman > CDC Information Technological Support Contract (CITS) > NCHSTP \DHAP \HICSB \Research, Analysis, and Evaluation Section > Centers for Disease Control and Prevention > > "Boss. We've got cats." "Meow" > > "All truth passes through three stages: > First, it is ridiculed; > Second, it is violently opposed; > Third, it is accepted as being self-evident." > - Arthur Schopenhauer (1830) > > > >>-----Original Message----- > >>From: owner-sas-l(a)listserv.uga.edu > [mailto:owner-sas-l(a)listserv.uga.edu] > >>On Behalf Of dougarobertson(a)gmail.com > >>Sent: Thursday, December 28, 2006 4:44 PM > >>To: sas-l(a)uga.edu > >>Subject: Hash Table Memory Usage > >> > >>Hi, > >>I've got a large dataset (350,000,000 records) containing > transactions, > >>from which I want to select a subset of around 250,000,000 records and > >>then dedupe on id, place and date, which takes it down to around > >>15,000,000 records. It's (just, given memory constraints) possible to > >>do this using a simple proc sort but, partly because some further data > >>manipulation may be desirable and partly to get to grips with them for > >>the first time, I've tried using hash tables. > >>On the first couple of attempts it ran out of memory, but with a bit > of > >>tinkering I got it to just below the limit; the (slightly edited) log > >>is below. > >>I am now trying to understand why so much memory is used for this > >>datastep. I realise 15,000,000 records is a lot to hold as a hash > table > >>in memory, but there are only 3 variables (all numeric, total length > >>11) and the final data set that is output only uses about 160MB on the > >>disk, so why does the datastep need 919MB of memory? Is there any way > I > >>can reduce this? > >>I'll be rerunning the process with several similar sized data sets so > I > >>can't really afford to be teetering so close to the edge of our > >>available memory. > >>I am working with 1GB of memory on a UNIX platform, working remotely > on > >>a Windows based PC through Enterprise Guide 4. > >> > >>Any comments would be very welcome. > >> > >>Thanks, > >> > >>Doug > >> > >> > >> > >>21 options fullstimer; > >>22 data _null_; > >>23 length hhd_id 4; > >>24 if 0 then set bigdata (rename=3D(hhd_id=3Dc_hi)); > >>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ; > >>26 hh.definekey ('id','place','date') ; > >>27 hh.definedata ('id','place','date') ; > >>28 hh.definedone () ; > >>29 do until (eof) ; > >>30 set bigdata (where=3D(flag=3D'Y') keep=3Did place date = > flag > >>rename=3D(id=3Dc_id)) end =3D eof ; > >>31 id=3Dinput(c_id,10.); > >>32 hh.replace () ; > >>33 end ; > >>34 rc =3D hh.output (dataset: "visits") ; > >>35 > >>36 run; > >> > >>NOTE: The data set VISITS has 14374264 observations and 3 variables. > >>NOTE: There were 243729816 observations read from the data set bigdata > >> WHERE flag=3D'Y'; > >>NOTE: DATA statement used (Total process time): > >> real time 11:50.98 > >> user cpu time 7:05.57 > >> system cpu time 1:03.51 > >> Memory 918868k > >> Page Faults 185 > >> Page Reclaims 224845 > >> Page Swaps 0 > >> Voluntary Context Switches 0 > >> Involuntary Context Switches 0 > >> Block Input Operations 0 > >> Block Output Operations 0
From: Ken Borowiak on 28 Dec 2006 17:08 On Thu, 28 Dec 2006 13:43:31 -0800, dougarobertson(a)GMAIL.COM wrote: >Hi, >I've got a large dataset (350,000,000 records) containing transactions, >from which I want to select a subset of around 250,000,000 records and >then dedupe on id, place and date, which takes it down to around >15,000,000 records. It's (just, given memory constraints) possible to >do this using a simple proc sort but, partly because some further data >manipulation may be desirable and partly to get to grips with them for >the first time, I've tried using hash tables. >On the first couple of attempts it ran out of memory, but with a bit of >tinkering I got it to just below the limit; the (slightly edited) log >is below. >I am now trying to understand why so much memory is used for this >datastep. I realise 15,000,000 records is a lot to hold as a hash table >in memory, but there are only 3 variables (all numeric, total length >11) and the final data set that is output only uses about 160MB on the >disk, so why does the datastep need 919MB of memory? Is there any way I >can reduce this? >I'll be rerunning the process with several similar sized data sets so I >can't really afford to be teetering so close to the edge of our >available memory. >I am working with 1GB of memory on a UNIX platform, working remotely on >a Windows based PC through Enterprise Guide 4. > >Any comments would be very welcome. > >Thanks, > >Doug > > > >21 options fullstimer; >22 data _null_; >23 length hhd_id 4; >24 if 0 then set bigdata (rename=(hhd_id=c_hi)); >25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ; >26 hh.definekey ('id','place','date') ; >27 hh.definedata ('id','place','date') ; >28 hh.definedone () ; >29 do until (eof) ; >30 set bigdata (where=(flag='Y') keep=id place date flag >rename=(id=c_id)) end = eof ; >31 id=input(c_id,10.); >32 hh.replace () ; >33 end ; >34 rc = hh.output (dataset: "visits") ; >35 >36 run; > >NOTE: The data set VISITS has 14374264 observations and 3 variables. >NOTE: There were 243729816 observations read from the data set bigdata > WHERE flag='Y'; >NOTE: DATA statement used (Total process time): > real time 11:50.98 > user cpu time 7:05.57 > system cpu time 1:03.51 > Memory 918868k > Page Faults 185 > Page Reclaims 224845 > Page Swaps 0 > Voluntary Context Switches 0 > Involuntary Context Switches 0 > Block Input Operations 0 > Block Output Operations 0 Doug, You could avoid goobling up as much memory by not putting all the keys in the data portion of the hash table. Check out this thread (and reference): http://listserv.uga.edu/cgi-bin/wa?A2=ind0611C&L=sas-l&P=R37784&m=217688 HTH, Ken
From: Ken Borowiak on 29 Dec 2006 08:54 On Thu, 28 Dec 2006 13:43:31 -0800, dougarobertson(a)GMAIL.COM wrote: >Hi, >I've got a large dataset (350,000,000 records) containing transactions, >from which I want to select a subset of around 250,000,000 records and >then dedupe on id, place and date, which takes it down to around >15,000,000 records. It's (just, given memory constraints) possible to >do this using a simple proc sort but, partly because some further data >manipulation may be desirable and partly to get to grips with them for >the first time, I've tried using hash tables. >On the first couple of attempts it ran out of memory, but with a bit of >tinkering I got it to just below the limit; the (slightly edited) log >is below. >I am now trying to understand why so much memory is used for this >datastep. I realise 15,000,000 records is a lot to hold as a hash table >in memory, but there are only 3 variables (all numeric, total length >11) and the final data set that is output only uses about 160MB on the >disk, so why does the datastep need 919MB of memory? Is there any way I >can reduce this? >I'll be rerunning the process with several similar sized data sets so I >can't really afford to be teetering so close to the edge of our >available memory. >I am working with 1GB of memory on a UNIX platform, working remotely on >a Windows based PC through Enterprise Guide 4. > >Any comments would be very welcome. > >Thanks, > >Doug > > > >21 options fullstimer; >22 data _null_; >23 length hhd_id 4; >24 if 0 then set bigdata (rename=(hhd_id=c_hi)); >25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ; >26 hh.definekey ('id','place','date') ; >27 hh.definedata ('id','place','date') ; >28 hh.definedone () ; >29 do until (eof) ; >30 set bigdata (where=(flag='Y') keep=id place date flag >rename=(id=c_id)) end = eof ; >31 id=input(c_id,10.); >32 hh.replace () ; >33 end ; >34 rc = hh.output (dataset: "visits") ; >35 >36 run; > >NOTE: The data set VISITS has 14374264 observations and 3 variables. >NOTE: There were 243729816 observations read from the data set bigdata > WHERE flag='Y'; >NOTE: DATA statement used (Total process time): > real time 11:50.98 > user cpu time 7:05.57 > system cpu time 1:03.51 > Memory 918868k > Page Faults 185 > Page Reclaims 224845 > Page Swaps 0 > Voluntary Context Switches 0 > Involuntary Context Switches 0 > Block Input Operations 0 > Block Output Operations 0 Doug, I am still pondering a better hash-ish solution than the one I already gave, but consider this index solution. IIRC, Richard Devenezia provided such a solution to a post earlier this year. data has_dupkeys ; input a b c d ; datalines ; 1 2 3 1 1 2 3 2 2 2 2 1 2 2 2 2 1 2 3 3 2 2 2 4 ; run ; /*-- Define table with an index with unique keys --*/ proc sql ; create table nodupkeys ( a num, b num, c num ) ; create unique index abc on nodupkeys(a, b, c) ; quit ; proc append base=nodupkeys data=has_dupkeys force ; run ; /* ... can then drop the index from NODUPKEYS */ It won't run as fast, but it will alleviate your memory concerns (at the expense of disk space). HTH, Ken
|
Next
|
Last
Pages: 1 2 Prev: How to supply macro value to macro in proc sql Next: SAS message Log: Generic critical error. |