Prev: Stored Compiled Data Step Program REDIRECT INPUT.
Next: Extended staggered (ESN) designs for variance components
From: Tom Abernathy on 27 Jan 2010 19:41 Art - I looked at this a little for Elvira earlier this week. It looks like the first 2 bytes are the record length and the third is a record type. Try running this program against the whole file. data records; infile tmpfile1 recfm=n; length len 4 rec $1 hex $400 string $200 ; input len ib2.; input string $varying200. len; rec=substr(string,1,1); string=substr(string,2); hex=putc(string,'$hex'||compress(put((len-1)*2,3.))); run; proc freq ; tables rec*len / list; run; proc print data=records (obs=100); var len rec hex; run; In the subset you sent I see 16 records. 5 H, 8 R, 1 S and 2 T. Each record type seems to have a consistent length. The T value looks like an integer. Not sure what S is. The H and R appear like they sub structures. Notice the pattern of where the spaces ('20'x) and nulls ('00'x) are located. Obs len rec hex 1 5 T 00005471 2 6 S 23767F7F4F 3 5 T 0000587F 4 18 R 221807004120202020205420000000644E 5 18 R 221807004120202020205420000000644E 6 18 R 222749404141202020205420000000644E 7 17 H 22367F7F412020202020542020202020 8 18 R 22367F7F41412D2020205420000000644E 9 17 H 22367F7F412020202020542020202020 10 18 R 22457F7F4141432020205420000000644E 11 17 H 22457F7F414120202020542020202020 12 18 R 22457F7F4141434320204720000000644E 13 17 H 22457F7F41412D202020542020202020 14 18 R 22457F7F4141492020205420000000644E 15 17 H 22457F7F414143202020542020202020 16 18 R 22457F7F41414D45202047202020202020 - Tom On Jan 27, 1:41 pm, art...(a)NETSCAPE.NET (Arthur Tabachneck) wrote: > Dan, > > Yes it is that same file. > > Art > --------- > On Wed, 27 Jan 2010 10:32:43 -0800, Nordlund, Dan (DSHS/RDA) > > > > > > <Nord...(a)DSHS.WA.GOV> wrote: > >> -----Original Message----- > >> From: SAS(r) Discussion [mailto:SA...(a)LISTSERV.UGA.EDU] On Behalf Of > >> Arthur Tabachneck > >> Sent: Wednesday, January 27, 2010 10:26 AM > >> To: SA...(a)LISTSERV.UGA.EDU > >> Subject: Re: How to read a large odd looking file > > >> What I posted was indeed a hex viewer filedump as I didn't think the > list > >> would appreciate my posting the actual 8gig of characters. These were > >> simply the first 272 characters. > >> On Wed, 27 Jan 2010 13:12:48 -0500, Proc Me <procme(a)CONCEPT- > >> DELIVERY.COM> > >> wrote: > > >Art, > > >Is this the "binary" data file that we were discussing a few days ago on > > the list? If not, do you have any additional information about the > supposed structure of the dataset? > > > > > > >Dan > > >Daniel J. Nordlund > >Washington State Department of Social and Health Services > >Planning, Performance, and Accountability > >Research and Data Analysis Division > >Olympia, WA 98504-5204- Hide quoted text - > > - Show quoted text -- Hide quoted text - > > - Show quoted text -
From: Mark D H Miller on 27 Jan 2010 20:06 Art, I have some experience unscrambling issues like this, but here I believe we have insufficient information to get very far. In your sample, some incipient patterns seem to emerge after splitting data lines at "54" (where possible) and by inserting blanks in several lines (marked with %) 00 05 54 00 00 54 71 00 06 53 23 76 FA C0 4F 00 05 54 %% %% 00 00 58 BD 00 12 52 22 18 07 00 41 20 20 20 20 20 54 20 00 00 00 64 4E 00 12 52 22 18 07 00 41 20 20 20 20 20 54 20 00 00 00 64 4E 00 12 52 22 27 49 40 41 41 20 20 20 20 54 20 00 00 00 64 4E 00 11 48 22 36 8B 80 41 20 20 20 20 20 54 %% 20 20 20 20 20 00 12 52 22 36 8B 80 41 41 2D 20 20 20 54 20 00 00 00 64 4E 00 11 48 22 36 8B 80 41 20 20 20 20 20 54 %% 20 20 20 20 20 00 12 52 22 45 CD C0 41 41 43 20 20 20 54 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41 20 20 20 20 54 %% 20 20 20 20 20 00 12 52 22 45 CD C0 41 41 43 43 20 20 47 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41 2D 20 20 20 54 %% 20 20 20 20 20 00 12 52 22 45 CD C0 41 41 49 20 20 20 54 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41 43 20 20 20 54 %% 20 20 20 20 20 00 12 52 22 45 CD C0 41 41 4D 45 20 20 47 20 Without access to the complete file (no..I don't want that) it is very useful to have data chunks from at least the beginning and the end -- such that each chunk is larger than any "blocksize" we might expect to find Typically this means chunks between 32k-128k (32756--131024) which is enough data to discern and verify patterns. Do we know ? file -- exact size in bytes ? ( so we maybe we could find record size) ? file -- record/line count ? source -- which platform ? because this may assist us to know whether we should be looking for 2/4/8 byte ints ? whether we might find imbedded lrecl/recfm data (mainframe) or not .... Mark Miller On 1/27/2010 10:37 AM, Arthur Tabachneck wrote: > One of our posters sent me their data file (all 8 gig of it) and, frankly, > I'm totally lost trying to figure out how to read it. The following is a > dump of the first 272 characters in the hope that it might trigger > someone's stroke of brilliance: > > 00000000 00 05 54 00 00 54 71 00 06 53 23 76 FA C0 4F 00 > 00000010 05 54 00 00 58 BD 00 12 52 22 18 07 00 41 20 20 > 00000020 20 20 20 54 20 00 00 00 64 4E 00 12 52 22 18 07 > 00000030 00 41 20 20 20 20 20 54 20 00 00 00 64 4E 00 12 > 00000040 52 22 27 49 40 41 41 20 20 20 20 54 20 00 00 00 > 00000050 64 4E 00 11 48 22 36 8B 80 41 20 20 20 20 20 54 > 00000060 20 20 20 20 20 00 12 52 22 36 8B 80 41 41 2D 20 > 00000070 20 20 54 20 00 00 00 64 4E 00 11 48 22 36 8B 80 > 00000080 41 20 20 20 20 20 54 20 20 20 20 20 00 12 52 22 > 00000090 45 CD C0 41 41 43 20 20 20 54 20 00 00 00 64 4E > 000000A0 00 11 48 22 45 CD C0 41 41 20 20 20 20 54 20 20 > 000000B0 20 20 20 00 12 52 22 45 CD C0 41 41 43 43 20 20 > 000000C0 47 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41 > 000000D0 2D 20 20 20 54 20 20 20 20 20 00 12 52 22 45 CD > 000000E0 C0 41 41 49 20 20 20 54 20 00 00 00 64 4E 00 11 > 000000F0 48 22 45 CD C0 41 41 43 20 20 20 54 20 20 20 20 > 00000100 20 00 12 52 22 45 CD C0 41 41 4D 45 20 20 47 20 > > Art > >
From: NordlDJ on 28 Jan 2010 03:56 > -----Original Message----- > From: Arthur Tabachneck [mailto:art297(a)NETSCAPE.NET] > Sent: Wednesday, January 27, 2010 10:42 AM > To: SAS-L(a)LISTSERV.UGA.EDU; Nordlund, Dan (DSHS/RDA) > Subject: Re: How to read a large odd looking file > > Dan, > > Yes it is that same file. > > Art > --------- > On Wed, 27 Jan 2010 10:32:43 -0800, Nordlund, Dan (DSHS/RDA) > <NordlDJ(a)DSHS.WA.GOV> wrote: > > >> -----Original Message----- > >> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of > >> Arthur Tabachneck > >> Sent: Wednesday, January 27, 2010 10:26 AM > >> To: SAS-L(a)LISTSERV.UGA.EDU > >> Subject: Re: How to read a large odd looking file > >> > >> What I posted was indeed a hex viewer filedump as I didn't think the > list > >> would appreciate my posting the actual 8gig of characters. These were > >> simply the first 272 characters. > >> On Wed, 27 Jan 2010 13:12:48 -0500, Proc Me <procme(a)CONCEPT- > >> DELIVERY.COM> > >> wrote: > >> > >Art, > > > >Is this the "binary" data file that we were discussing a few days ago on > the list? If not, do you have any additional information about the > supposed structure of the dataset? > > > >Dan > > > >Daniel J. Nordlund > >Washington State Department of Social and Health Services > >Planning, Performance, and Accountability > >Research and Data Analysis Division > >Olympia, WA 98504-5204 Art, Based on the OP's original description and looking at the file, here is a possible program for reading the data. Here is the original description. >Dear all, > >I have been given a binary file with two fields, integer fields (big endian >binary encoded numbers) and alpha fields (left justified and padded on the >right with spaces). The example of the equivalent file in ASCII format >(which is not avaialbe to me) is: > > T23584 6 > M148 4 > SO 2 > T24110 6 > M283 4 > RA T 100N 16 > M284 4 > RAA T 100N 16 > RAA- T 100N 16 > M285 4 > RAAC T 100N 16 > M286 4 > > >The first alphabet is always a "message" type which can be "T", "M","S" and >"R". > >If the message is "T" (length 1), a numeric will be given (23584 with length >of 5). The length of the message will be given last (6). There are three set >of information associated with message "T" >If the message is "M" (length 1), a numeric will be given (148 with length >of 3). The length of the message will be given last (1+3=4). There are three >set of information associated with message "T" >If the message is "S" (length 1), an alphanumeric will be given ("O" with >length of 1). The length of the message will be given last (1+1=2). There >are three set of information associated with message "T" >If the message is "R" (length 1), > >1) An alphabetic will be given (A (with spaces) with length of 6). >2) An alphanumeric will be given ("T" with length of 1) >3) An alphanumeric will be given ("empty Space" with length of 1) >4) A numeric will be given (100 with length of 6) >5) An alphabetic will be given (N with length of 1) >6) total length (1+6+1+1+6+1=16) The description is not exact, and message type "H" was not mentioned, but with the extract that you posted it was enough to at least get a first pass at reading the file. I discarded the first two bytes of the file, and there were bytes the don't appear in the description for messages R and S. I couldn't verify the message type M because it didn't appear in your excerpt. data want(drop=dummy); infile "c:\sas_examples\binary.bin" recfm=N ; if _n_=1 then input dummy $2.; input msg $1. ; select(msg) ; when('R') do; input n1 s370fpib4. a1 $6. a2 $1. dummy $1. n2 s370fpib4. a3 $1. msglen s370fpib2. ; end; when('T') do; input n1 s370fpib4. msglen s370fpib2. ; end; when('M') do; input n1 s370fpib2. msglen s370fpib2. ; end; when('S') do; input n1 s370fpib4. a1 $1. msglen s370fpib2. ; end; when('H') do; input n1 s370fpib4. a1 $6. a2 $1. a3 $5. msglen s370fpib2. ; end; otherwise put "WARNING: something went wrong"; end; run ; proc print data=want; run; And here is the output; Obs msg n1 a1 a2 n2 a3 msglen 1 T 21617 . 6 2 S 595000000 O . 5 3 T 22717 . 18 4 R 572000000 A T 100 N 18 5 R 572000000 A T 100 N 18 6 R 573000000 AA T 100 N 17 7 H 574000000 A T . 18 8 R 574000000 AA- T 100 N 17 9 H 574000000 A T . 18 10 R 575000000 AAC T 100 N 17 11 H 575000000 AA T . 18 12 R 575000000 AACC G 100 N 17 13 H 575000000 AA- T . 18 14 R 575000000 AAI T 100 N 17 15 H 575000000 AAC T . 18 This should at least get the OP started with reading the file. Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
From: NordlDJ on 28 Jan 2010 14:19 data want(drop=dummy); infile "c:\sas_examples\binary.bin" recfm=N ; input msglen s370fpib2. msg $1. ; select(msg) ; when('R') do; input n1 s370fpib4. a1 $6. a2 $1. dummy $1. n2 s370fpib4. a3 $1. ; end; when('T') do; input n1 s370fpib4. ; end; when('M') do; input n1 s370fpib2. ; end; when('S') do; input n1 s370fpib4. a1 $1. ; end; when('H') do; input n1 s370fpib4. a1 $6. a2 $1. a3 $5. ; end; otherwise put "WARNING: Unknown message"; end; run ; Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of > Tom Abernathy > Sent: Wednesday, January 27, 2010 4:42 PM > To: SAS-L(a)LISTSERV.UGA.EDU > Subject: Re: How to read a large odd looking file > > Art - > I looked at this a little for Elvira earlier this week. It looks > like the first 2 bytes are the record length and the third is a record > type. Try running this program against the whole file. > > data records; > infile tmpfile1 recfm=n; > length len 4 rec $1 hex $400 string $200 ; > input len ib2.; > input string $varying200. len; > rec=substr(string,1,1); > string=substr(string,2); > hex=putc(string,'$hex'||compress(put((len-1)*2,3.))); > run; > proc freq ; > tables rec*len / list; > run; > proc print data=records (obs=100); > var len rec hex; > run; > > In the subset you sent I see 16 records. 5 H, 8 R, 1 S and 2 T. Each > record type seems to have a consistent length. > The T value looks like an integer. Not sure what S is. The H and R > appear like they sub structures. Notice the pattern of where the > spaces ('20'x) and nulls ('00'x) are located. > > > Obs len rec hex > > 1 5 T 00005471 > 2 6 S 23767F7F4F > 3 5 T 0000587F > 4 18 R 221807004120202020205420000000644E > 5 18 R 221807004120202020205420000000644E > 6 18 R 222749404141202020205420000000644E > 7 17 H 22367F7F412020202020542020202020 > 8 18 R 22367F7F41412D2020205420000000644E > 9 17 H 22367F7F412020202020542020202020 > 10 18 R 22457F7F4141432020205420000000644E > 11 17 H 22457F7F414120202020542020202020 > 12 18 R 22457F7F4141434320204720000000644E > 13 17 H 22457F7F41412D202020542020202020 > 14 18 R 22457F7F4141492020205420000000644E > 15 17 H 22457F7F414143202020542020202020 > 16 18 R 22457F7F41414D45202047202020202020 > > - Tom > Tom, Nice catch on the message length coming first (as opposed to the description). My code below is modified appropriately. A decision still needs to be made about message types H, R, and S, i.e., how to interpret the 4 bytes following the message type. Also, the message type M needs to be verified. data want(drop=dummy); infile "c:\sas_examples\binary.bin" recfm=N ; input msglen s370fpib2. msg $1. ; select(msg) ; when('R') do; input n1 s370fpib4. a1 $6. a2 $1. dummy $1. n2 s370fpib4. a3 $1. ; end; when('T') do; input n1 s370fpib4. ; end; when('M') do; input n1 s370fpib2. ; end; when('S') do; input n1 s370fpib4. a1 $1. ; end; when('H') do; input n1 s370fpib4. a1 $6. a2 $1. a3 $5. ; end; otherwise put "WARNING: Unknown message"; end; run ; Hope this helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
From: Arthur Tabachneck on 30 Jan 2010 18:04
First, while this thread was in response to someone else's post, many, many thanks to everyone who offered suggestions (both on and off-line). I'd name everyone, but the list is simply too big! Yes, there were definitely a lot of essentials we didn't know regarding the data structure the OP actually confronted. I'm posting my code here in the event that anyone can either catch a glaring error or if anyone attempts to search SAS-L regarding the same task. The data, as it turns out, was a NASDAQ product called TotalView-ITCH 4. It's difficult to describe as the data is a mixture of ascii characters and 2, 4 and 8 byte endian values in a file that has one long line of characters. To confuse matters, many of the endian values turned out to include null values, carriage returns, line feeds, EOFs and other assorted control characters. For simplicity, in my example, I'll show 2 byte endian as ASCII 01, four byte as 0001, etc. Thus given the following file (c:\have.txt) that contains what represents 3 records (in reality the file had over 9gb of such data): 05T000106S0001Q05T0001 I was able to parse it, I think correctly, with the following code: data want; infile "c:\have.txt" RECFM=N; input width s370fpib2. msg_type $1.; select(msg_type); /* Time Stamp - Seconds */ when('T') do; input seconds s370fpib4.; end; /* System Event */ when('S') do; input nanoseconds s370fpib4. event $1.; end; /* Stock Related Messages */ when('R') do; input nanoseconds s370fpib4. stock $6. market_category $1. financial_status $1. round_lot_size s370fpib4. round_lots_only $1.; end; /* Stock Trading Action */ when('H') do; input nanoseconds s370fpib4. stock $6. trading_state $1. reserved $1. reason $4.; end; /* Market Participant Position */ when('L') do; input nanoseconds s370fpib4. mpid $4. stock $6. primary_market_maker $1. market_maker_mode $1. market_participant_state $1.; end; /* Add Order No MPID Attribution */ when('A') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. buy_sell_indicator $1. shares s370fpib4. stock $6. price s370fpib4. display $1.; end; /* Add Order with MPID Attribution */ when('F') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. buy_sell_indicator $1. shares s370fpib4. stock $6. price s370fpib4. attribution $4.; end; /* Order Executed */ when('E') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. executed_shares s370fpib4. match_number s370fpib8.; end; /* Order Executed with Price */ when('C') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. executed_shares s370fpib4. match_number s370fpib8. printable $1. execution_price s370fpib4.; end; /* Order Cancel */ when('X') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. canceled_shares s370fpib4.; end; /* Order Delete */ when('D') do; input nanoseconds s370fpib4. order_reference_number s370fpib8.; end; /* Order Replace */ when('U') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. new_order_reference_number s370fpib8. shares s370fpib4. price s370fpib4. display $1.; end; /* Order Display */ when('V') do; input nanoseconds s370fpib4. order_reference_number s370fpib8.; end; /* Trade Message Non-Cross */ when('P') do; input nanoseconds s370fpib4. order_reference_number s370fpib8. buy_sell_indicator $1. shares s370fpib4. stock $6. price s370fpib4. match_number s370fpib8.; end; /* Cross Trade Message */ when('Q') do; input nanoseconds s370fpib4. shares_big s370fpib8. stock $6. cross_price s370fpib4. match_number s370fpib8. cross_type $1.; end; /* Broken Trade Order Execution */ when('B') do; input nanoseconds s370fpib4. match_number s370fpib8.; end; /* Net Order Imbalance msg_type = 'I'*/ otherwise do; input nanoseconds s370fpib4. paired_shares s370fpib8. imbalance_shares s370fpib8. imbalance_direction $1. stock $6. fair_price s370fpib4. near_price s370fpib4. current_reference_price s370fpib4. cross_type $1. price_variation_indicator $1.; end; end; run ; Art -------- On Thu, 28 Jan 2010 11:19:23 -0800, Nordlund, Dan (DSHS/RDA) <NordlDJ(a)DSHS.WA.GOV> wrote: >data want(drop=dummy); > infile "c:\sas_examples\binary.bin" recfm=N ; > input msglen s370fpib2. msg $1. ; > select(msg) ; > when('R') do; > input n1 s370fpib4. a1 $6. a2 $1. dummy $1. n2 s370fpib4. a3 $1. ; > end; > when('T') do; > input n1 s370fpib4. ; > end; > when('M') do; > input n1 s370fpib2. ; > end; > when('S') do; > input n1 s370fpib4. a1 $1. ; > end; > when('H') do; > input n1 s370fpib4. a1 $6. a2 $1. a3 $5. ; > end; > otherwise put "WARNING: Unknown message"; > end; >run ; > >Daniel J. Nordlund >Washington State Department of Social and Health Services >Planning, Performance, and Accountability >Research and Data Analysis Division >Olympia, WA 98504-5204 > > >> -----Original Message----- >> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of >> Tom Abernathy >> Sent: Wednesday, January 27, 2010 4:42 PM >> To: SAS-L(a)LISTSERV.UGA.EDU >> Subject: Re: How to read a large odd looking file >> >> Art - >> I looked at this a little for Elvira earlier this week. It looks >> like the first 2 bytes are the record length and the third is a record >> type. Try running this program against the whole file. >> >> data records; >> infile tmpfile1 recfm=n; >> length len 4 rec $1 hex $400 string $200 ; >> input len ib2.; >> input string $varying200. len; >> rec=substr(string,1,1); >> string=substr(string,2); >> hex=putc(string,'$hex'||compress(put((len-1)*2,3.))); >> run; >> proc freq ; >> tables rec*len / list; >> run; >> proc print data=records (obs=100); >> var len rec hex; >> run; >> >> In the subset you sent I see 16 records. 5 H, 8 R, 1 S and 2 T. Each >> record type seems to have a consistent length. >> The T value looks like an integer. Not sure what S is. The H and R >> appear like they sub structures. Notice the pattern of where the >> spaces ('20'x) and nulls ('00'x) are located. >> >> >> Obs len rec hex >> >> 1 5 T 00005471 >> 2 6 S 23767F7F4F >> 3 5 T 0000587F >> 4 18 R 221807004120202020205420000000644E >> 5 18 R 221807004120202020205420000000644E >> 6 18 R 222749404141202020205420000000644E >> 7 17 H 22367F7F412020202020542020202020 >> 8 18 R 22367F7F41412D2020205420000000644E >> 9 17 H 22367F7F412020202020542020202020 >> 10 18 R 22457F7F4141432020205420000000644E >> 11 17 H 22457F7F414120202020542020202020 >> 12 18 R 22457F7F4141434320204720000000644E >> 13 17 H 22457F7F41412D202020542020202020 >> 14 18 R 22457F7F4141492020205420000000644E >> 15 17 H 22457F7F414143202020542020202020 >> 16 18 R 22457F7F41414D45202047202020202020 >> >> - Tom >> > >Tom, > >Nice catch on the message length coming first (as opposed to the description). My code below is modified appropriately. A decision still needs to be made about message types H, R, and S, i.e., how to interpret the 4 bytes following the message type. Also, the message type M needs to be verified. > >data want(drop=dummy); > infile "c:\sas_examples\binary.bin" recfm=N ; > input msglen s370fpib2. msg $1. ; > select(msg) ; > when('R') do; > input n1 s370fpib4. a1 $6. a2 $1. dummy $1. n2 s370fpib4. a3 $1. ; > end; > when('T') do; > input n1 s370fpib4. ; > end; > when('M') do; > input n1 s370fpib2. ; > end; > when('S') do; > input n1 s370fpib4. a1 $1. ; > end; > when('H') do; > input n1 s370fpib4. a1 $6. a2 $1. a3 $5. ; > end; > otherwise put "WARNING: Unknown message"; > end; >run ; > >Hope this helpful, > >Dan > >Daniel J. Nordlund >Washington State Department of Social and Health Services >Planning, Performance, and Accountability >Research and Data Analysis Division >Olympia, WA 98504-5204 |