How to read a large odd looking file [SAS]

Prev: Stored Compiled Data Step Program REDIRECT INPUT.
Next: Extended staggered (ESN) designs for variance components

From: Arthur Tabachneck on 27 Jan 2010 13:26

What I posted was indeed a hex viewer filedump as I didn't think the list
would appreciate my posting the actual 8gig of characters. These were
simply the first 272 characters.
On Wed, 27 Jan 2010 13:12:48 -0500, Proc Me <procme(a)CONCEPT-DELIVERY.COM>
wrote:

>Art,
>
>I've not got very far, but the following may help:
>
>data have;
>input tid $ (t1-t16) ($) ;
>datalines;
>00000000 00 05 54 00 00 54 71 00 06 53 23 76 FA C0 4F 00
>00000010 05 54 00 00 58 BD 00 12 52 22 18 07 00 41 20 20
>00000020 20 20 20 54 20 00 00 00 64 4E 00 12 52 22 18 07
>00000030 00 41 20 20 20 20 20 54 20 00 00 00 64 4E 00 12
>00000040 52 22 27 49 40 41 41 20 20 20 20 54 20 00 00 00
>00000050 64 4E 00 11 48 22 36 8B 80 41 20 20 20 20 20 54
>00000060 20 20 20 20 20 00 12 52 22 36 8B 80 41 41 2D 20
>00000070 20 20 54 20 00 00 00 64 4E 00 11 48 22 36 8B 80
>00000080 41 20 20 20 20 20 54 20 20 20 20 20 00 12 52 22
>00000090 45 CD C0 41 41 43 20 20 20 54 20 00 00 00 64 4E
>000000A0 00 11 48 22 45 CD C0 41 41 20 20 20 20 54 20 20
>000000B0 20 20 20 00 12 52 22 45 CD C0 41 41 43 43 20 20
>000000C0 47 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41
>000000D0 2D 20 20 20 54 20 20 20 20 20 00 12 52 22 45 CD
>000000E0 C0 41 41 49 20 20 20 54 20 00 00 00 64 4E 00 11
>000000F0 48 22 45 CD C0 41 41 43 20 20 20 54 20 20 20 20
>00000100 20 00 12 52 22 45 CD C0 41 41 4D 45 20 20 47 20
>;
>run;
>
>This looks like a file dump, with 16 bytes per line, and a line counter. I
>think this structure has been superimposed on the data, rather than being
>intrinsic: it looks like output from a hex viewer.
>
>To break out of this strucuture, I would write it to a dataset per byte,
>and then look for patterns that would allow it to be put back together
>more meaningfully.
>
>data want(keep=line i h n b);
> set have;
> format line comma9.;
> array t(16) t1--t16;
> line = input(substr(tid, 1, 7), hex7.);
> do i = 1 to 16;
> h = t(i);
> n = input(t(i), hex2.);
> b = byte(n);
> output;
> end;
>run;
>
>I'm not sure how to put this back together, but I would hazard a guess
>that sequences of bytes go together, either delimited by the 0 characters,
>or based on position. I would hazard that some of the data is numeric, and
>some is character: obs 4, for example, has a sequence of 20's, the hex for
>an Ascii space (you might notice %20 in URL's where spaces are encoded).
>
>There appears to be some structure to the data: witness the repeating
>sequence 64 4E, if a number this would equate to 16462, which is in the
>range of a SAS date. The sequence 41 20 20 20 20 54 20 also appears to
>recur. This could be the text: "A T "? 12 52 22 18 07 also recurs, as
>do sequences of a similar structure, possibly an ID? The twenty bytes
>starting in position 7 of row 1 are repeat themselves. The twenty byte
>pattern appears to break down, though.
>
>A visual way of looking for structure in this data might be to export this
>dataset into Excel and colour code "runs" of structure, and then zoom out
>to look for patterns.
>
>Can we know a little more about the provenance of this blob of data, could
>the poster get the original source data in another format - I'm curious as
>to why they are having to reverse engineer this data.
>
>I doubt I've got much further than you had, but I hope this helps,
>
>Proc Me

From: Proc Me on 27 Jan 2010 13:12

Art,

I've not got very far, but the following may help:

data have;
input tid $ (t1-t16) ($) ;
datalines;
00000000 00 05 54 00 00 54 71 00 06 53 23 76 FA C0 4F 00
00000010 05 54 00 00 58 BD 00 12 52 22 18 07 00 41 20 20
00000020 20 20 20 54 20 00 00 00 64 4E 00 12 52 22 18 07
00000030 00 41 20 20 20 20 20 54 20 00 00 00 64 4E 00 12
00000040 52 22 27 49 40 41 41 20 20 20 20 54 20 00 00 00
00000050 64 4E 00 11 48 22 36 8B 80 41 20 20 20 20 20 54
00000060 20 20 20 20 20 00 12 52 22 36 8B 80 41 41 2D 20
00000070 20 20 54 20 00 00 00 64 4E 00 11 48 22 36 8B 80
00000080 41 20 20 20 20 20 54 20 20 20 20 20 00 12 52 22
00000090 45 CD C0 41 41 43 20 20 20 54 20 00 00 00 64 4E
000000A0 00 11 48 22 45 CD C0 41 41 20 20 20 20 54 20 20
000000B0 20 20 20 00 12 52 22 45 CD C0 41 41 43 43 20 20
000000C0 47 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41
000000D0 2D 20 20 20 54 20 20 20 20 20 00 12 52 22 45 CD
000000E0 C0 41 41 49 20 20 20 54 20 00 00 00 64 4E 00 11
000000F0 48 22 45 CD C0 41 41 43 20 20 20 54 20 20 20 20
00000100 20 00 12 52 22 45 CD C0 41 41 4D 45 20 20 47 20
;
run;

This looks like a file dump, with 16 bytes per line, and a line counter. I
think this structure has been superimposed on the data, rather than being
intrinsic: it looks like output from a hex viewer.

To break out of this strucuture, I would write it to a dataset per byte,
and then look for patterns that would allow it to be put back together
more meaningfully.

data want(keep=line i h n b);
set have;
format line comma9.;
array t(16) t1--t16;
line = input(substr(tid, 1, 7), hex7.);
do i = 1 to 16;
h = t(i);
n = input(t(i), hex2.);
b = byte(n);
output;
end;
run;

I'm not sure how to put this back together, but I would hazard a guess
that sequences of bytes go together, either delimited by the 0 characters,
or based on position. I would hazard that some of the data is numeric, and
some is character: obs 4, for example, has a sequence of 20's, the hex for
an Ascii space (you might notice %20 in URL's where spaces are encoded).

There appears to be some structure to the data: witness the repeating
sequence 64 4E, if a number this would equate to 16462, which is in the
range of a SAS date. The sequence 41 20 20 20 20 54 20 also appears to
recur. This could be the text: "A T "? 12 52 22 18 07 also recurs, as
do sequences of a similar structure, possibly an ID? The twenty bytes
starting in position 7 of row 1 are repeat themselves. The twenty byte
pattern appears to break down, though.

A visual way of looking for structure in this data might be to export this
dataset into Excel and colour code "runs" of structure, and then zoom out
to look for patterns.

Can we know a little more about the provenance of this blob of data, could
the poster get the original source data in another format - I'm curious as
to why they are having to reverse engineer this data.

I doubt I've got much further than you had, but I hope this helps,

Proc Me

From: NordlDJ on 27 Jan 2010 13:32

> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
> Arthur Tabachneck
> Sent: Wednesday, January 27, 2010 10:26 AM
> To: SAS-L(a)LISTSERV.UGA.EDU
> Subject: Re: How to read a large odd looking file
>
> What I posted was indeed a hex viewer filedump as I didn't think the list
> would appreciate my posting the actual 8gig of characters. These were
> simply the first 272 characters.
> On Wed, 27 Jan 2010 13:12:48 -0500, Proc Me <procme(a)CONCEPT-
> DELIVERY.COM>
> wrote:
>
Art,

Is this the "binary" data file that we were discussing a few days ago on the list? If not, do you have any additional information about the supposed structure of the dataset?

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204

From: Arthur Tabachneck on 27 Jan 2010 13:41

Dan,

Yes it is that same file.

Art
---------
On Wed, 27 Jan 2010 10:32:43 -0800, Nordlund, Dan (DSHS/RDA)
<NordlDJ(a)DSHS.WA.GOV> wrote:

>> -----Original Message-----
>> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
>> Arthur Tabachneck
>> Sent: Wednesday, January 27, 2010 10:26 AM
>> To: SAS-L(a)LISTSERV.UGA.EDU
>> Subject: Re: How to read a large odd looking file
>>
>> What I posted was indeed a hex viewer filedump as I didn't think the
list
>> would appreciate my posting the actual 8gig of characters. These were
>> simply the first 272 characters.
>> On Wed, 27 Jan 2010 13:12:48 -0500, Proc Me <procme(a)CONCEPT-
>> DELIVERY.COM>
>> wrote:
>>
>Art,
>
>Is this the "binary" data file that we were discussing a few days ago on
the list? If not, do you have any additional information about the
supposed structure of the dataset?
>
>Dan
>
>Daniel J. Nordlund
>Washington State Department of Social and Health Services
>Planning, Performance, and Accountability
>Research and Data Analysis Division
>Olympia, WA 98504-5204

From: Savian on 27 Jan 2010 16:53

On Jan 27, 8:37 am, art...(a)NETSCAPE.NET (Arthur Tabachneck) wrote:
> One of our posters sent me their data file (all 8 gig of it) and, frankly,
> I'm totally lost trying to figure out how to read it. The following is a
> dump of the first 272 characters in the hope that it might trigger
> someone's stroke of brilliance:
>
> 00000000 00 05 54 00 00 54 71 00 06 53 23 76 FA C0 4F 00
> 00000010 05 54 00 00 58 BD 00 12 52 22 18 07 00 41 20 20
> 00000020 20 20 20 54 20 00 00 00 64 4E 00 12 52 22 18 07
> 00000030 00 41 20 20 20 20 20 54 20 00 00 00 64 4E 00 12
> 00000040 52 22 27 49 40 41 41 20 20 20 20 54 20 00 00 00
> 00000050 64 4E 00 11 48 22 36 8B 80 41 20 20 20 20 20 54
> 00000060 20 20 20 20 20 00 12 52 22 36 8B 80 41 41 2D 20
> 00000070 20 20 54 20 00 00 00 64 4E 00 11 48 22 36 8B 80
> 00000080 41 20 20 20 20 20 54 20 20 20 20 20 00 12 52 22
> 00000090 45 CD C0 41 41 43 20 20 20 54 20 00 00 00 64 4E
> 000000A0 00 11 48 22 45 CD C0 41 41 20 20 20 20 54 20 20
> 000000B0 20 20 20 00 12 52 22 45 CD C0 41 41 43 43 20 20
> 000000C0 47 20 00 00 00 64 4E 00 11 48 22 45 CD C0 41 41
> 000000D0 2D 20 20 20 54 20 20 20 20 20 00 12 52 22 45 CD
> 000000E0 C0 41 41 49 20 20 20 54 20 00 00 00 64 4E 00 11
> 000000F0 48 22 45 CD C0 41 41 43 20 20 20 54 20 20 20 20
> 00000100 20 00 12 52 22 45 CD C0 41 41 4D 45 20 20 47 20
>
> Art

Art,

I pasted your stuff into ultraedit, column ripped the line numbers,
then pasted into a hex editor as hext text to see what you have. My
guess is tha the first 272 bytes are merely part of a file's metadata
and do not have any record values in them. If they do have records,
they are all numeric (most likely a series of short ints). I see no
meaningful text.

Without a guidepost for what the data could be, it is almost
impossible to figure it out. You need data deeper in (try and get past
metadata info. Metadata info should be contained in megabyte marker
locations. Hence, 1024, 2048, etc.) or you would need a reference for
where the file was generated so as to hazard a guess at the meaning of
this data.

HTH,

Alan
http://www.savian.net

| Next | Last
Pages: 1 2
Prev: Stored Compiled Data Step Program REDIRECT INPUT.
Next: Extended staggered (ESN) designs for variance components