Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?
From: Michelle on 13 Sep 2009 06:15 Peter, > Try something like this (warning: uncompiled, untested, unoptimized, no > error-checking, doesn't handle files > 2GB, and assumes that the search > pattern is always no longer than the length of a single block that's > read...proof of concept only): With the help of Tom I've solved some errors. So it works :-)) I want to first understand and optimize this part, before I get busy with reading the previous four bytes. Why doesn't it handle files > 2GB ? (this could be an issue for me. The files are sometimes even larger then 2 Gb) Is it correct that it finds only the first occurence of the pattern. (if so, how to find all occurences ?) About the error-checking (and as part of that) compare the search pattern length and the length of a single block, I think I can solve that. But can you give me some hits about the optimalisation (and what about the "Boyer-Moore Algorithm") Thanks in advance, Michelle
From: Tom Spink on 13 Sep 2009 07:03 Hello, Michelle. Michelle wrote: > Why doesn't it handle files > 2GB ? (this could be an issue for me. The > files are sometimes even larger then 2 Gb) It's the old, "There's not enough atoms in the Universe" problem - but in this case, there's not enough bits in an integer. An 'int' is 32 bits. 2^32 = 4294967296 This is 4 GiB, I hear you think. But, an int is signed, so in actual fact, the largest value for an int is: int.MaxValue = (2^31 - 1) = 2147483647 = 2GiB (it's "minus one" because the maximum value is one less than the number of possible values) This is because of the way negative numbers are represented, half the possible values represent a positive value (0 to 2147483647) and half the possible values represent a negative value (-2,147,483,648 to -1). > Michelle -- Tom
From: Peter Duniho on 13 Sep 2009 16:01 On Sun, 13 Sep 2009 03:15:36 -0700, Michelle <michelle(a)notvalid.nomail> wrote: > Peter, > >> Try something like this (warning: uncompiled, untested, unoptimized, no >> error-checking, doesn't handle files > 2GB, and assumes that the search >> pattern is always no longer than the length of a single block that's >> read...proof of concept only): > > With the help of Tom I've solved some errors. So it works :-)) > I want to first understand and optimize this part, before I get busy with > reading the previous four bytes. > > Why doesn't it handle files > 2GB ? (this could be an issue for me. The > files are sometimes even larger then 2 Gb) Tom's answered that. If you want larger files to be handled, you need to use 64-bit integers instead of 32-bit integers anywhere the code has an int used as an offset within the file. > Is it correct that it finds only the first occurence of the pattern. (if > so, > how to find all occurences ?) You are correct...it returns only the first occurrence. To find all occurrences, instead of returning when a match is found, you'd simply do some appropriate action there and then continue processing as if you hadn't found a match. That appropriate action could be to call a delegate (e.g. Action<int>, where the delegate takes the offset into the file and does something), or perhaps just add the offset to a List<int> and then return the List<int>.ToArray() result of that at the end. > About the error-checking (and as part of that) compare the search pattern > length and the length of a single block, I think I can solve that. > But can you give me some hits about the optimalisation (and what about > the > "Boyer-Moore Algorithm") Boyer-Moore is a special-case of the finite-state-machine solution I alluded to earlier. For a search for a single string, the general-purpose FSM solution has a small potential to be slightly faster than the current brute-force approach, and the Boyer-Moore the potential to be slightly faster than that. Any speed improvement when searching for a single string would be highly dependent on the input data. The reason those might be a little faster is that they both guarantee that you only ever have to examine a given byte from the file once. The brute-force solution I posted has the potential for examining a given byte multiple times, depending on how many times it follows something that looks like a potential match. In practice, both the generalized FSM and the Boyer-Moore solution involve a little bit of overhead themselves. At the same time, assuming random data, on average the brute-force algorithm will only have a 2-byte false start (the smallest false start possible) only 1 out of 256 iterations, and only longer than that 1 out of 65536 iterations. I.e. most of the time even the brute-force algorithm only examines each byte once, if you only searching for one string. Where Boyer-Moore and the generalized FSM shine is when you have multiple strings to compare against, because they continue to allow you to only ever examine each byte of the file once, whereas the brute-force solution increases exponentially in cost as your list of search strings increases in size. If you have only a single string at a time you're ever looking for, I wouldn't bother with the FSM or Boyer-Moore (or similar). As far as optimizing the code I posted goes, I haven't really thought about it much. I would say that if it performs acceptably well for you, you might as well spend your time on other things. Don't bother trying to make it faster until you know for sure that making it faster will make a difference to your user, and then make it faster by measuring the performance of the code with a profiler so you know what parts are slower and should be improved. Pete
From: rossum on 13 Sep 2009 18:42 On Sun, 13 Sep 2009 13:01:40 -0700, "Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote: >Where Boyer-Moore and the generalized FSM shine is when you have multiple >strings to compare against In that case Rabin-Karp is also a possibility. rossum
From: Michelle on 14 Sep 2009 03:38
Peter, > Tom's answered that. If you want larger files to be handled, you need to > use 64-bit integers instead of 32-bit integers anywhere the code has an > int used as an offset within the file. Solved that already. If you read the explanation is actually very logical and simple. > . . To find all occurrences, instead of returning when a match is found, > you'd simply do some appropriate action there and then continue > processing as if you hadn't found a match. Solved that already. Rest and take away what, works enlightening, > Boyer-Moore is a special-case of the finite-state-machine solution I > alluded to earlier. For a search for a single string, the general-purpose Thanks for the clear explanation. > .... I would say that if it performs acceptably well for you, you might > as well spend your time on other things. It performs much faster then the Powershell script. For now it is more than enough. I am now involved in reading the previous four bytes. Michelle |