From: Paavo Helde on 26 Jan 2010 13:25 I am looking for a decent algorithm for calculating of some kind of hash of a container, making use of the hash of the components, i.e. calculating a composite hash from the hashes of sub-parts. We are using it just for detecting changed data parts, so crypto strength is not actually needed. So far I have thought of the following approaches: * CRC32. There is a crc32_combine() function helpfully exposed from the zlib library. However, I am a bit worried about the collisions. The CRC32 algorithm was designed basically for detecting rare bit changes AFAIK, but in our case most of the data content changes each time, and the data sets are quite large (up to hundreds of megabytes), meaning that the "information compression rate" by reducing it to 32 bits is also quite large. Am I wrong to worry about the collisions? * MD5, with re-hashing the MD5 hashes of components to obtain the hash of the container. If the MD5 hashes are uniformly distributed, the outcome should be no worse than calculating the whole MD5 in one go, right? Aside from that, maybe MD5 is an overkill for our purposes (we are doing all this to increase performance in the first place). Are there any alternatives? What would you suggest? Thanks, Paavo
From: David Eather on 26 Jan 2010 13:56 Paavo Helde wrote: > I am looking for a decent algorithm for calculating of some kind of hash of > a container, making use of the hash of the components, i.e. calculating a > composite hash from the hashes of sub-parts. We are using it just for > detecting changed data parts, so crypto strength is not actually needed. So > far I have thought of the following approaches: > > * CRC32. There is a crc32_combine() function helpfully exposed from the > zlib library. However, I am a bit worried about the collisions. The CRC32 > algorithm was designed basically for detecting rare bit changes AFAIK, but > in our case most of the data content changes each time, and the data sets > are quite large (up to hundreds of megabytes), meaning that the > "information compression rate" by reducing it to 32 bits is also quite > large. Am I wrong to worry about the collisions? You would expect a collision about 1 time in 64 thousand with CRC-32 > > * MD5, with re-hashing the MD5 hashes of components to obtain the hash of > the container. If the MD5 hashes are uniformly distributed, the outcome > should be no worse than calculating the whole MD5 in one go, right? right. Aside > from that, maybe MD5 is an overkill for our purposes (we are doing all this > to increase performance in the first place). > > Are there any alternatives? What would you suggest? > > Thanks, > Paavo Use MD-4 - it is faster than MD-5 and in some cases faster than CRC-32. The 128-bit output is much more collision resistant than the 32 bits of crc-32
From: unruh on 26 Jan 2010 14:30 On 2010-01-26, Paavo Helde <myfirstname(a)osa.pri.ee> wrote: > > I am looking for a decent algorithm for calculating of some kind of hash of > a container, making use of the hash of the components, i.e. calculating a > composite hash from the hashes of sub-parts. We are using it just for > detecting changed data parts, so crypto strength is not actually needed. So > far I have thought of the following approaches: > > * CRC32. There is a crc32_combine() function helpfully exposed from the > zlib library. However, I am a bit worried about the collisions. The CRC32 > algorithm was designed basically for detecting rare bit changes AFAIK, but > in our case most of the data content changes each time, and the data sets > are quite large (up to hundreds of megabytes), meaning that the > "information compression rate" by reducing it to 32 bits is also quite > large. Am I wrong to worry about the collisions? > > * MD5, with re-hashing the MD5 hashes of components to obtain the hash of > the container. If the MD5 hashes are uniformly distributed, the outcome > should be no worse than calculating the whole MD5 in one go, right? Aside > from that, maybe MD5 is an overkill for our purposes (we are doing all this > to increase performance in the first place). MD4 is apparently faster but is cryptographically weak (stronger than CRC of course). Look at rsync and bittorrent which do what you want to do-- they have spent lots of time thinking about the issue. > > Are there any alternatives? What would you suggest? > > Thanks, > Paavo
From: Joseph Ashwood on 26 Jan 2010 18:10 "Paavo Helde" <myfirstname(a)osa.pri.ee> wrote in message news:Xns9D0CCFDC13C1Epaavo256(a)216.196.109.131... > I am looking for a decent algorithm for calculating of some kind of hash > of > a container, making use of the hash of the components, i.e. calculating a > composite hash from the hashes of sub-parts. We are using it just for > detecting changed data parts, so crypto strength is not actually needed. > So > far I have thought of the following approaches: > Are there any alternatives? What would you suggest? The structure you are looking for is called a Merkle tree, using it will save you enormous amounts of time. Also known as library signing, the proof is quite solid. As for the speed, don't worry about it, the Merkle tree will increase speeds significantly for updates. Instead worry about the fact that someone will eventually attack the system, and when that happens you'll have to hire someone like me to dig through all the data, sort out the problems, find solutions, and send you a very large bill for my time. So my first recommendation is if you are targeting only 32-bit systems use SHA-256, otherwise use SHA-512. Strongly formated systems are one of the most important things, if in doubt, include the data. Assuming you have multiple files in the system, include the file name and path in the metadata that gets hashed, include file length, include update date and time, include who updated it and who approved the update, include everything. Building a Merkle tree is actually quite easy, I'll be using the standard library metaphore, obviously substitute in your situation. Hash all the pages individually, giving the pages[] table of hashes. Hash all the pages in each book into the books[] table of hashes Hash all the books on each shelf into the shelves[] table of hashes Hash all the shelves in each bookcase into the cases[] table of hashes Hash all the cases on each aisle into the aisles[] table of hashes Hash all the aisles in a room into the rooms[] table of hashes Hash all the rooms in a library into the final library hash. Store all the hashes for later use. When a page is replaced, generate the new pages[] entry by hashing the new page and replacing the old. generate the new books[] entry by hashing only the pages in the updated book, replacing the old generate the new shelves[] entry by hashing only the books on the updated shelf, replacing the old generate the new cases[] entry by hashing only the shelves in the updated case, replacing the old generate the new aisles[] entry by hashing only the cases on the updated aisle, replacing the old generate the new rooms[] entry by hashing only the aisles in the updated room, replacing the old Hash all the rooms in the library into the final library hash The important change is that since only a small portion of the hashes have to be recomputed, instead of hashing GBs of data, the system can be designed to only hash a few KB. To add data hash the new page, add it to the list of hashes for the updated book follow the update method from here You'll want to actually implement this as a tree structure in memory, and use recursion in the implementation it will save you a lot of time and effort. Store the entire Merkle tree, and just extract afterward. Walking down the tree the system will be easily able to determine exactly what needs to be downloaded. Joe
From: Paavo Helde on 27 Jan 2010 13:40 "Joseph Ashwood" <ashwood(a)msn.com> wrote in news:vbL7n.20613$mT7.4828(a)newsfe05.iad: > "Paavo Helde" <myfirstname(a)osa.pri.ee> wrote in message > news:Xns9D0CCFDC13C1Epaavo256(a)216.196.109.131... >> I am looking for a decent algorithm for calculating of some kind of >> hash of >> a container, making use of the hash of the components, i.e. >> calculating a composite hash from the hashes of sub-parts. We are >> using it just for detecting changed data parts, so crypto strength is >> not actually needed. So >> far I have thought of the following approaches: > >> Are there any alternatives? What would you suggest? > > The structure you are looking for is called a Merkle tree, using it > will save you enormous amounts of time. Also known as library signing, It seems Merkle tree is a kind of hierarchical stucture, which we already have, so the only missing part is choosing the hash type and how to combine them. As already said, cryptographic resistance is not required, so we will probably go with MD4 suggested by other respondents. > the proof is quite solid. As for the speed, don't worry about it, the > Merkle tree will increase speeds significantly for updates. Instead > worry about the fact that someone will eventually attack the system, The most of what a would-be attacker would achieve would be to damage his own data analysis results, so I would not worry about that too much. [...] > Walking down the tree the system will be easily able to determine > exactly what needs to be downloaded. No download involved here, you are assuming too much. Paavo
|
Pages: 1 Prev: Randomness of MD5 vs. SHA1 Next: Hash Chains and Cryptographic Accumulator |