Prev: cracking DES ciphertext with known plaintext using custom character set.
Next: Large prime for easily finding discreet logs
From: Dmitry Chumack on 5 Dec 2006 12:18 Hi * I have a question. There is few doesns of __terabytes__ of data. I need to split this data to blocks of fixed size. Than I need to calculate a sha1 hash for each block and store it. The question is which minimal size of block I have to choose to minimize the probability of existence of two equal hashes? What probability of such a coincidence would be if I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and md5 and crc32)? Thanks in advance.
From: Christian Siebert on 5 Dec 2006 13:08 > I have a question. well, you've given three questions: 1) > There is few doesns of __terabytes__ of data. > I need to split this data to blocks of fixed size. Than I need to calculate a > sha1 hash for each block and store it. The question is which minimal > size of block I have to choose to minimize the probability of existence > of two equal hashes? A single block gives you a 0% collision probability (everything else will give you more than 0%). 2) > What probability of such a coincidence would be if I'll use an md5, > crc32 or sha2 algorithms instead of sha1? If we assume that each hash is equally likely this probability depends only on the length of your hash. The md5 hash, for example, has a length of 128 bit. Therefore the probability that your data gives a certain hash value is 1/(2^128). It gets a bit more complicated if you have several blocks => read about the birthday paradoxon. 3) > Given that the cryptographic hashes are much longer than e.g. crc32, > Or if I'll use a combination of this algorithms (e.g. sha1 and crc32 > or sha2 and md5 and crc32)? Cryptographic hashes are constructed to avoid collisions and therefore they are much longer. It is easy to find collisions with checksum like crc-32 (because they are not meant to avoid them). Giving the above hints, you should try to estimate the collision probability by yourself and think again about your questions. ;-) Christian
From: Mike Amling on 5 Dec 2006 06:12 Dmitry Chumack wrote: > Hi * > > I have a question. There is few doesns of __terabytes__ of data. I need > to split this data to blocks of fixed size. Than I need to calculate a > sha1 hash for each block and store it. The question is which minimal > size of block I have to choose to minimize the probability of existence > of two equal hashes? What probability of such a coincidence would be if > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and > md5 and crc32)? If you use SHA-1, you won't have any collisions. If you use MD5, you won't have any collisions unless the data has been specially constructed to produce a collision. --Mike Amling
From: Unruh on 5 Dec 2006 16:23 Mike Amling <spamonly(a)allspam.com> writes: >Dmitry Chumack wrote: >> Hi * >> >> I have a question. There is few doesns of __terabytes__ of data. I need >> to split this data to blocks of fixed size. Than I need to calculate a >> sha1 hash for each block and store it. The question is which minimal >> size of block I have to choose to minimize the probability of existence >> of two equal hashes? What probability of such a coincidence would be if >> I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll >> use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and >> md5 and crc32)? Unless the block size is less than 16 bytes do not worry about collisions. (If it is less than 16 bytes you will get collisions because the two blocks are identical) > If you use SHA-1, you won't have any collisions. If you use MD5, you >won't have any collisions unless the data has been specially constructed >to produce a collision. >--Mike Amling
From: James Stroud on 5 Dec 2006 16:42 Dmitry Chumack wrote: > Hi * > > I have a question. There is few doesns of __terabytes__ of data. I need > to split this data to blocks of fixed size. Than I need to calculate a > sha1 hash for each block and store it. The question is which minimal > size of block I have to choose to minimize the probability of existence > of two equal hashes? What probability of such a coincidence would be if > I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll > use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and > md5 and crc32)? > > Thanks in advance. > Christian Siebert alludes to this probability: N! p(n) = 1 - ------------ (N**n)(N-n)! Where n is the number of blocks and N is the the number of possible hashes in the hash algorithm (i.e. 2**L) where L is the length, in bits, of the hash function. A user-friendly derrivation is here: http://en.wikipedia.org/wiki/Birthday_paradox It has been suggested that one way you can double the length of your has if your blocks get too large is f(f(block) + block) + f(block) where f is a hashing function--but of course the computational efficiency of the hash decreases by >3. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/
|
Next
|
Last
Pages: 1 2 3 4 5 6 7 Prev: cracking DES ciphertext with known plaintext using custom character set. Next: Large prime for easily finding discreet logs |