Search through a (large) binary file. [CSharp]

Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?

From: Michelle on 11 Sep 2009 08:21

This is what i've done so far:

-<code>-

void ReadFile()
{
string pattern = "FF56131A1B087B15610800151E";
string inFile = @"D:\myfile.bin";
string outFile = @"D:\myfile.txt";
long blockSize = 512000;
int prevBytes = 4;

FileStream stream = File.OpenRead(inFile);
if (stream.Length < blockSize)
{
blockSize = stream.Length;
}
long lastBlock = stream.Length % blockSize;
byte[] buffer = new byte[blockSize];
int extraBytes = pattern.Length / 2 + prevBytes - 1;

// ERROR
// long blocks = (long)Math.Ceiling(stream.Length / blockSize);
// The call is ambiguous between the following methods or
properties: 'System.Math.Ceiling(decimal)' and 'System.Math.Ceiling(double)'

int block = 1;

while (block <= blocks)
{
if (block==blocks && block==lastBlock)
{
blockSize = lastBlock;

// ERROR
// byte[] buffer = new byte[blockSize];
// A local variable named 'buffer' cannot be declared in
this scope because it would give a different meaning to 'buffer',
// which is already used in a 'parent or current' scope
to denote something else

}
}

stream.Read(buffer, 0, blockSize);
string hexStr = String.Join("", (rimBytes + buffer));
String.Format("{0:X2}",hexStr);
// more to go
}

-<code>-

It's based on a Powershell script i found with Google.

-<script.ps1>-

param (
[string]$pattern = $(throw '-pattern can''t be $null'),
[string]$file = $(throw '-file can''t be $null'),
[long]$blockSize = 50kb,
[int]$prevBytes = 8
)
if ($blockSize -lt 1kb) {throw '-blockSize must be at least 1kb'}
$file = $(if (test-path $file) {(rvpa $file).path} else {
throw "Cannot find $path"})
$stream = [io.file]::OpenRead($file)
if ($stream.length -lt $blockSize) {$blockSize = $stream.length}
$lastBlock = $stream.length % $blockSize
$buffer = new-object byte[] $blockSize
$extraBytes = $pattern.length / 2 + $prevBytes - 1
$blocks = [math]::ceiling($stream.length / $blockSize)
$block = 1
while ($block -le $blocks) {
if ($block -eq $blocks -and $lastBlock) {$blockSize = $lastBlock
$buffer = new-object byte[] $blockSize}

[void]$stream.read($buffer, 0, $blockSize)
$hexStr = [string]::join('', ($rimBytes + $buffer | % {'{0:X2}' -f $_}))
$rimBytes = $buffer[-$extraBytes..-1]
[regex]::matches($hexStr, $pattern, 'ignoreCase') |
% {$i = $_.index - $prevBytes * 2
[string]::join('', $hexStr[$i..($i + $prevBytes * 2 - 1)]) |
% {$target = $_
$hexBytes = 0..($_.length - 1) | ? {!($_ -band 1)} |
% {"0x$($target.subString($_,2))"}
}
}
$block++
}
[void]$stream.close
[void]$stream.dispose

-<script.ps1>-

From: Tom Spink on 11 Sep 2009 10:59

Michelle wrote:

> Hi Tom,
>
>> Do you know if the hex value you are searching for is aligned at all?
>
> I don't exactly understand what you mean with "aligned" (maybe it's because
> English is not my native language)
> I need to search in all the rubbish for the pattern
> (FF56131A1B087B15610800151E)
> and then read the previous 4 bytes. (For example:
> 0923080709C0224FFF56131A1B087B15610800151E -> 09C0224F)
> The pattern is a kind of 'record footer'. But the 'records' doesn't have a
> fixed length..
>
> Is this the information you're needed?
>
> Michelle

Hi Michelle,

I posted a possible solution in reply to another message - I was waiting
for community approval, however!

--
Tom

From: Peter Duniho on 11 Sep 2009 13:05

On Fri, 11 Sep 2009 02:10:36 -0700, Tom Spink <tspink(a)gmail.com> wrote:

> [...]
> public static long FindArrayOffset(Stream s, byte[] arr)
> {
> int testByte, testIndex = 0;
> long startingOffset = 0;
>
> for (testByte = s.ReadByte(); testByte >= 0; testByte =
> s.ReadByte()) {
> if (arr[testIndex++] != (byte)testByte) {
> testIndex = 0;
> startingOffset = s.Position;
> }
>
> if (testIndex == arr.Length)
> return startingOffset;
> }
>
> return -1;
> }
> ///
>
> I haven't thought through all the code paths, but I've got a brain cell
> nagging me about off-by-one errors, so pay particular attention to the
> post-increment operator, as I may have cocked that up.

I would say that the biggest issue is that if the real match is found
starting within a partial match, that won't be found. For example,
searching for "ABC" within the string "AABC". By the time the code
"knows" that the string "AA" doesn't match with "AB", it's passed where
the actual location of the searched-for string starts (the next character
it's going to examing for a match with the beginning of the search string
is 'B').

The problem is fixable, but would result in a more elaborate FSM
implementation (it's not as simple as just backing up one byte, because
the problem is more generalized than that...e.g. search for "AABC" within
"AAABC", etc.). It's not clear that the OP really needs this level of
complexity; managing cross-block searches isn't really _that_ hard, while
implementing a correct FSM can actually be a little tricky. And assuming
false starts during the search are few and very short, repeatedly
examining a given byte after a false start shouldn't hurt performance too
much.

All that said, it's true that one significant advantage of an FSM approach
is the ability to scan strictly sequentially, one byte at a time, through
the file, no cross-block issues to worry about. If that's an appealing
feature to the OP, in spite of the added complexity the FSM will bring,
there have been several good solutions suggested along these lines in past
threads in this newsgroup that essentially address the "search for a
string in a stream of characters" problem. Applying the same solution to
bytes instead of characters is trivial.

Here's a link to one such message thread:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/browse_thread/thread/462037d91c08e693/

It includes a bit of a discussion on the topic, includes suggestions for
at least two different viable approaches, and one of my posts in the
thread links to a couple of posts in yet another previous thread where (in
the first post) I posted a non-tricky, basic, generic state graph
implementation, and then (in the follow-up) a little performance-fix to
that original code.

Pete

From: Peter Duniho on 11 Sep 2009 13:08

On Fri, 11 Sep 2009 02:59:42 -0700, Michelle <michelle(a)notvalid.nomail>
wrote:

> Hi Tom,
>
>> Do you know if the hex value you are searching for is aligned at all?
>
> I don't exactly understand what you mean with "aligned" (maybe it's
> because
> English is not my native language)

I believe that Tom's asking if the pattern you're looking for will always
begin at a byte the offset of which is exactly some multiple of some
number larger than 1.

For example, a 32-bit-aligned pattern could be found at byte offset 0, 4,
8, etc. but never at 1, 2, 3, 5, 6, 7, etc.

If it's aligned, then that obviously can reduce somewhat the overhead of
non-matching characters, because you have to examine on average 1/cb the
number of bytes, where "cb" is the byte length of the alignment (e.g. for
32-bit alignment, "cb" is "4").

Pete

From: Michelle on 11 Sep 2009 13:18

It's not aligned.

"Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote in message
news:op.uz3q77shvmc1hu(a)macbook-pro.local...
> On Fri, 11 Sep 2009 02:59:42 -0700, Michelle <michelle(a)notvalid.nomail>
> wrote:
>
>> Hi Tom,
>>
>>> Do you know if the hex value you are searching for is aligned at all?
>>
>> I don't exactly understand what you mean with "aligned" (maybe it's
>> because
>> English is not my native language)
>
> I believe that Tom's asking if the pattern you're looking for will always
> begin at a byte the offset of which is exactly some multiple of some
> number larger than 1.
>
> For example, a 32-bit-aligned pattern could be found at byte offset 0, 4,
> 8, etc. but never at 1, 2, 3, 5, 6, 7, etc.
>
> If it's aligned, then that obviously can reduce somewhat the overhead of
> non-matching characters, because you have to examine on average 1/cb the
> number of bytes, where "cb" is the byte length of the alignment (e.g. for
> 32-bit alignment, "cb" is "4").
>
> Pete

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?