Prev: cracking DES ciphertext with known plaintext using custom character set.
Next: Large prime for easily finding discreet logs
From: Joseph Ashwood on 5 Dec 2006 18:26 "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message news:1165339092.911172.123630(a)79g2000cws.googlegroups.com... > Hi * > > I have a question. There is few doesns of __terabytes__ of data. I need > to split this data to blocks of fixed size. Than I need to calculate a > sha1 hash for each block and store it. The question is which minimal > size of block I have to choose to minimize the probability of existence > of two equal hashes? What probability of such a coincidence would be if > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and > md5 and crc32)? Since you're talking about 2 generated values colliding, in a cryptographic hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about 2^64, and SHA about 2^80 hashes generated before you collide. Considering the size of your data pile, you'll probably get collisions sooner because there is almost certainly duplicate data. Here's how I'd do it, and I'd include encryption as well: Encrypt everything with AES in CTR mode, the odds of a collision in the data are now very small, about 1/16777216 for a petabyte of data. Then I'd use SHA512 with 512KB chunks and build up a Merkle tree, this will save you enormous time later on in figuring out what was changed or damaged. Assuming 1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk, you can store ~8000 of them per chunk with proper indexing (this will be shy of the 8192 apparent maximum if you implement the indexing that is necessary for security). This gives 268435 chunks, building a third level of 34 chunks, and a root/top/fourth level that is tiny. For a total of about 2*10^9 chunks, 2*10^9 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit conveniently on a floppy disk, 1200 of them to a CD, and I don't even want to do the math on the other options. On a system with modern cpu you should be able to achieve this in about a month of compute time (assuming a quad core, scale appropriately), Tom's benchmark website http://libtomcrypt.com:80/ltc113.html has more information to help you run the numbers. The overall odds of a collision in this are in the neighborhood of 1/10^68, which is quite frankly not gonna happen. It will be reasonably secure though, so don't lose the AES key. Without the AES encryption the odds of collision rise sharply, but assuming your data cannot be compressed beyond a reasonable amount still ignorable, and if they do collide you don't have to ship one of the blocks because they are identical. Joe
From: Mike Amling on 5 Dec 2006 10:12 Unruh wrote: > Mike Amling <spamonly(a)allspam.com> writes: > >> Dmitry Chumack wrote: >>> Hi * >>> >>> I have a question. There is few doesns of __terabytes__ of data. I need >>> to split this data to blocks of fixed size. Than I need to calculate a >>> sha1 hash for each block and store it. The question is which minimal >>> size of block I have to choose to minimize the probability of existence >>> of two equal hashes? What probability of such a coincidence would be if >>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll >>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and >>> md5 and crc32)? > > Unless the block size is less than 16 bytes do not worry about collisions. > (If it is less than 16 bytes you will get collisions because the two blocks > are identical) ?? Two identical inputs producing identical outputs is not regarded as a collision. Collisions are harder to find than that. And identical inputs can occur regardless of whether the block size is less than 16 bytes. --Mike Amling
From: Luc The Perverse on 5 Dec 2006 21:14 "Mike Amling" <spamonly(a)allspam.com> wrote in message news:el55do$3rn(a)dispatch.concentric.net... > Unruh wrote: >> Mike Amling <spamonly(a)allspam.com> writes: >> >>> Dmitry Chumack wrote: >>>> Hi * >>>> >>>> I have a question. There is few doesns of __terabytes__ of data. I need >>>> to split this data to blocks of fixed size. Than I need to calculate a >>>> sha1 hash for each block and store it. The question is which minimal >>>> size of block I have to choose to minimize the probability of existence >>>> of two equal hashes? What probability of such a coincidence would be if >>>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll >>>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and >>>> md5 and crc32)? >> >> Unless the block size is less than 16 bytes do not worry about >> collisions. (If it is less than 16 bytes you will get collisions because >> the two blocks >> are identical) > > ?? Two identical inputs producing identical outputs is not regarded as a > collision. Collisions are harder to find than that. > > And identical inputs can occur regardless of whether the block size is > less than 16 bytes. I interpreted it as he didn't want any non equal blocks to produce collisions. Obviously if you have a spot with all zeros there is going to be a collision (or two equal blocks) -- LTP :)
From: Sergei on 6 Dec 2006 12:20 Joseph Ashwood wrote: > "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message > news:1165339092.911172.123630(a)79g2000cws.googlegroups.com... > > Hi * > > > > I have a question. There is few doesns of __terabytes__ of data. I need > > to split this data to blocks of fixed size. Than I need to calculate a > > sha1 hash for each block and store it. The question is which minimal > > size of block I have to choose to minimize the probability of existence > > of two equal hashes? What probability of such a coincidence would be if > > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll > > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and > > md5 and crc32)? > > Since you're talking about 2 generated values colliding, in a cryptographic > hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about > 2^64, and SHA about 2^80 hashes generated before you collide. Considering > the size of your data pile, you'll probably get collisions sooner because > there is almost certainly duplicate data. > > Here's how I'd do it, and I'd include encryption as well: > Encrypt everything with AES in CTR mode, the odds of a collision in the data > are now very small, about 1/16777216 for a petabyte of data. Then I'd use > SHA512 with 512KB chunks and build up a Merkle tree, this will save you > enormous time later on in figuring out what was changed or damaged. Assuming > 1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk, > you can store ~8000 of them per chunk with proper indexing (this will be shy > of the 8192 apparent maximum if you implement the indexing that is necessary > for security). This gives 268435 chunks, building a third level of 34 > chunks, and a root/top/fourth level that is tiny. For a total of about > 2*10^9 chunks, 2*10^9 > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > 2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit > conveniently on a floppy disk, 1200 of them to a CD, and I don't even want > to do the math on the other options. On a system with modern cpu you should > be able to achieve this in about a month of compute time (assuming a quad > core, scale appropriately), Tom's benchmark website > http://libtomcrypt.com:80/ltc113.html has more information to help you run > the numbers. The overall odds of a collision in this are in the neighborhood > of 1/10^68, which is quite frankly not gonna happen. > > It will be reasonably secure though, so don't lose the AES key. Without the > AES encryption the odds of collision rise sharply, but assuming your data > cannot be compressed beyond a reasonable amount still ignorable, and if they > do collide you don't have to ship one of the blocks because they are > identical. > Joe How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I understand, such numbers can be given for truly random hash functions and MD5, SHA are just the algorithms that try to emulate such things. Sergei
From: Volker Hetzer on 6 Dec 2006 13:20
Sergei schrieb: > Joseph Ashwood wrote: >> "Dmitry Chumack" <saint.d.a(a)gmail.com> wrote in message >> news:1165339092.911172.123630(a)79g2000cws.googlegroups.com... >>> Hi * >>> >>> I have a question. There is few doesns of __terabytes__ of data. I need >>> to split this data to blocks of fixed size. How big are the blocks? > Than I need to calculate a >>> sha1 hash for each block and store it. The question is which minimal >>> size of block I have to choose to minimize the probability of existence >>> of two equal hashes? The probability is minimal when the hash is equal (in value) to the block. Other than that, if the hash is statistically good, the longer the hash, the less likely the collision probability. Of course, CPU time for search increases linearly with the hash length. Btw, what do you want to do with the whole thing? Fast matches? Fast searches? Anything cryptographic? If it's just for matches and the blocks aren't very large, it's probably cheaper in terms of CPU time to do a crc32 and compare the data in the few cases the hashes match. > What probability of such a coincidence would be if >>> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll >>> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and >>> md5 and crc32)? >> Since you're talking about 2 generated values colliding, in a cryptographic >> hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about >> 2^64, and SHA about 2^80 hashes generated before you collide. Considering >> the size of your data pile, you'll probably get collisions sooner because >> there is almost certainly duplicate data. [...] > How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I > understand, such numbers can be given for truly random hash functions > and MD5, SHA are just the algorithms that try to emulate such things. If we knew that they behaved differently, we'd be designing different hashes. Consider MD5 and SHA-1 close enough to the ideal that such subtleties don't matter. Btw, MD5 is "cryptographically broken" in the sense that you can construct 2 blocks that have equal hashes, but it doesn't change the statistics very much. However, since you seriously consider crc32, the "broken" state of MD5 is of no concern to you. The things that matter for you (IMHO) are: - How much (cpu cycles) does it cost to compute the hash? - How much does it cost to compare two hashes? - How much does it cost to compare two blocks? (Is I/O time relevant too?) - How often do two blocks have to be compared? - How often do two hashes have to be compared? - How often do you have to compute hashes (i.e. how often does your data change)? - Where do you get the hash from that you use to match the data blocks against? Does it cost extra? With a couple of days collecting the performance data and playing with excel you should be able to choose the optimum hash. By the way, it's possible too, to calculate the full MD5 hash and then to store a few bits only, if that speeds up your searches. Makes sense only for hash sizes larger than the crc hashes. Lots of Greetings! Volker -- For email replies, please substitute the obvious. |