Prev: C++ Concepts
Next: Portable way to write binary data
From: Le Chaud Lapin on 21 Nov 2005 20:28 kanze wrote: > Le Chaud Lapin wrote: > > In any case, again, you have to be fastidious in watching the > > spec in this section. In general, you cannot slap on an > > entire 64-bit chunk onto the message. For example, if the > > message is 440 bits long, you're only allowed to add 1 byte > > for padding: the bit-pattern of a 1 followed by 7 zeros. This > > will bring you to congruency mod 448, at which point you can > > tack on the 64-bit length to get to a full 512 bits. > > His interface forced the user to provide the 32 bit chunks. Not sure if that is the best way to approach this. To view the input data to the hasher as a stream of bytes is far more general than forcing the view as vector of 32 bit chunks, and in most cases eliminates unnecessary copying. > Technically, of course, you might have to pad as little as a > single bit. In practice, you can pad using the same multiple of > bits as the user is required to provide (supposing that is a > power of 2). Of course, for most uses, you'll want to support > at least as small as bytes. This is true, and it also hints at a fact that many of us overlook in certain kinds of programming (embbeded, communications, etc): when you send a byte from one machine to another, there is already a guarantee that the bit order will be preserved for you. Imagine a world where you had to worry about bit-order too. > > Technically, you have to be prepared for crunching a digest > > whose size is greater than (2^32). Here you're only taking > > care of lengths up to 2^32. > > Depends on the application. My own implementation of MD5 is > limited to the SIZE_T_MAX * 8 bits. Perhaps internally, there > are other limits which means that the messages will be small. Hmm...not sure you can call this an implementation of MD5. I would make a concession that it is ok to view the data as a string of bytes instead of a string of bits, which, as you pointed out, is what the spec calls for, for true compliance. But I think tossing out the upper-half of the 64-bit word for the length and calling it an implementation of the algorithm is stretching it. This presumes of course that you are making a generally resuable component, which you intend to polish off and make available for others to use. > > 2. You want the output of the hashing to be an object itself > > so that you can let it construct, assign it, compare it, store > > it in containers, toss it around, etc., let it destruct. > > Maybe. In my case, my MD5 calculator is an accumulator. I > suppose that there is some argument for defining a result type, > but in practice, I've almost always read the results using the > << operator, e.g.: > > file << myText > << std::accumulate( myText.begin(), myText.end(), MD5() ) ; > > Internally, the << operator uses a member function: > template< typename OutIter > > void resultsTo( OutIter dest ) const ; > which I made public, more because there wasn't any reason not > too, and I thought it might be useful to someone. Where is the digest? Certainly it would be nice to write map<Digest,Foo>, for example. This again assumes that you are making a class that someone else would want to use. I can hardly think of a more opportune moment to let the object-oriented features of C++ bear down on the design of a component. > > 4. You want the hash algorithm itself to be fast. > > You want it to be correct first. It isn't, so it's not yet time > to worry about speed. Well, when I do my designs, I like to take everything into account all at once, including performance (macro, not micro). I spend most of my time mulling. Then I fidget to verify propositions, then back off, then walk around in woods some, then eat a snack, then tinker, then only when I think I have a visceral feel for what's going on do I begin to bear down on the implementation. For me the code always comes out cleaner than it would have if I had taken the incremental approach, and takes less time to get a finished product (contratry to what others claim). You're going to have to think about these things at some point anyway, so you might as well mull it over as much as possible before you start typing, because the mind is faster than the hand (though both work better with the help of the other). > > I would define the algorithm itself as a namespace. > > Why? Well, to give you a real example, lets say you had a massive software probject that depended on primitives involved in security, including hashing. One master #include file contained: namespace Security_Set { using MD5::Hasher; using MD5::Digest; using namespace RSA; typedef RC6::Cipher<32,20,16> Symmetric_Cipher; typedef RC6::Key<16> Symmetric_Key; } Now let's say two years later, someone found a flaw in your chosen Hasher! Good grief!! No worries, you could simply modify your namespace: namespace Security_Set { using SHA_256::Hasher; using SHA_256::Digest; using namespace RSA; typedef RC6::Cipher<32,20,16> Symmetric_Cipher; typedef RC6::Key<16> Symmetric_Key; } If you have been using Hasher and Digest objects throughout your system, all you would have to do is recompile, and everything should work, data stored on disk notwithstanding. As a matter of fact, this is precisely what I intend to do, selecting from multiple Hashers, multiple symmetric cipher systems, and multiple asymmetric cipher systems, and multiple nonce sizes (128-bit versus 192-bit versus 256-bit) to test performances, and also verify that it is possible to do a simple recompile of a very large system and have it work with the new cipher primitives without incident. I cannot express in words how convenient this is. With this technique, if someone were to find a flaw in Rijndael 2 days before we shipped our product, we could actually make the change I just mentioned, recompile, and ship using an alternative symmetric cipher. The marketing people would have to update their collateral. > After a quick glance at SHA-1, I suspect that you could use some > sort of common machine for the entire family, including SHA-1 > and MD-5, with a policy class to customize, much as is standard > practice for CRC's. I'd certainly evaluate this route before > implementing my second digest in the family. (As mentionned > above, I've already implemented MD5.) Yikes. This is good example, where the whole resusability hunt is too agressive IMO. Hashers are very well defined. You know exactly what they are, the sizes of their digests, the expectations on input. They are small in terms of the code size. Trying to force them to be common just so you can say, "it took me a week but I finally tweaked these things enough to use common code" is a bit much, don't you think? > This is IMHO the big question with regards to the interface. In > my current implementation, extracting the value would modify the > internal state, so that the accumulator itself could not be > reused. [snipped] This is the fun part of software design - imagining all the situations in which the primitive might be used, and then designing it so that it will feel good for the user once the primitive is wielded in the construction of a system. Note that "feel good" comes from eliminating as much uncertainty surrounding the primitive as possible. Said in a different way, "feel good" comes from a reduction on requisite mental burden. 1. Hash. 2. Hash. 3. Hash. 4. Hash. 5. Hash, but get the digest and reset the hasher. This feels right, and it will work in almost any context, including hashing a sequences of 32-bit words stored in a vector<>. The programmer does not have to worry about inefficiencies from copying. > However, all of the functions which one would normally > use to do this work on a copy, and not the original object, so > the orginal object remains unmodified. This copy isn't free, of > course, but the object was designed to be as cheap to copy as > possible -- all data members are arrays of integral types, so > basically, the copy constructor just does a memcpy. And given > the cost of calculating an MD5 digest anyway, I suspect that it > doesn't matter. Hmm..I still prefer the idea of letting the programmer supply a const void * to the data, its length (unsigned of course), then doing a hash, and asking for the digest if it is time to get the digest, and if not, letting the hasher maintain internal state until such digest is wanted. Also, if you allow explicit contrsuction of the Digest from a std::string, and allow it to yield itself as a std::string, you've pretty much finsihed off the Digest as a class (unless you are doing crypto, in which case allowing the Digest to yield itself as a big integer can be very nice.) -Le Chaud Lapin [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Jakob Bieling on 21 Nov 2005 20:24 Le Chaud Lapin <unoriginal_username(a)yahoo.com> wrote: > gkoreman(a)gmail.com wrote: >> // Step 3 >> message.push_back(messageLength); > Technically, you have to be prepared for crunching a digest whose size > is greater than (2^32). Here you're only taking care of lengths up to > 2^32. I might be missing something .. but why 2^32? regards -- jb (reply address in rot13, unscramble first) [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: gkoreman on 21 Nov 2005 20:22 I know it is shorter, I was just mentally adding 0's when needed. The digest isn't really going to be printed out. I am not running this on an x86. I am running it on a big endian machine to start out, then once I get it working on big-endian, I will make it work on little-endian. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Thomas J. Gritzan on 21 Nov 2005 20:44 kanze schrieb: > Thomas J. Gritzan wrote: >>Works for me on a little-endian machine (only for zero-length messages). > > Only for zero-length messages. Of course, his interface only > allows for message lengths which are a multiple of 32 bits, so > this might be OK. > > And how does the endian-ness make a difference. I implemented > MD5 on a Sparc (big-endian) -- I've since compiled the code on a > Linux based PC (little-endian), and it works perfectly well. Is > there something fundamentally different between MD5 and SHA-1 > which means that SHA-1 is endian dependant? They look very > similar in priciples to me. I changed the interface, adding an length-parameter to the function. Now it works for the text examples on wikipedia as well. The endian-ness makes no difference except for the order of the high-part and the low-part of the 64-bit "length of the message". Since the OP did it as 32-bit, he bypasses this problem. So the code works without change on little & big endian machines. Thomas [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Le Chaud Lapin on 21 Nov 2005 21:10 kanze wrote: > And how does the endian-ness make a difference. I implemented > MD5 on a Sparc (big-endian) -- I've since compiled the code on a > Linux based PC (little-endian), and it works perfectly well. Is > there something fundamentally different between MD5 and SHA-1 > which means that SHA-1 is endian dependant? They look very > similar in priciples to me. It definitely makes a difference, but first, let me rant: Big-endian is silly, IMO. It was created as a substitute for little-endian to help those who, for whatever reason, find little-endian confusing when it's printed out. Ironically, it doesn't help, because it is precisely these people are most likely to get confused no matter what, so what you end up with is two models, one for people who not confused in the least but have to acknowledge the nuisance that is big-endian, and one for the people who are still confused and upset that big-endian didn't rid them of the confusion: http://groups.google.com/group/comp.protocols.tcp-ip/browse_frm/thread/3f89dd1fb5a3fa66 Getting of my rant box, if the input data is 0x01234567, that's 4 bytes. In a buffer, begining at offset 0, either the 0x01 will be there or the 0x67 will be there. When the programmer goes to form a word from this buffer to be used in the plethora of Boolean operations of the hash algorithm, s/he will have to choose how to extract the bytes from this stream to form the word. Note that implicit in all of this is that when the stream of bytes move from one machine to another, the bit order is preserved for individual bytes, and the byte order is preserved for a stream. So if 0x01 is the byte at offset 0 (big-endian), it will be the byte at offset 0 when it is sent as a sequence of bytes. But if a little-endian receiving machine gets these 4-bytes, and puts in them in increasing memory locations starting at 0, and then extracts the 4 bytes as a word, the word will become 0x76543210. So there are two subtleties here. The first is that bit-order is guranteed no matter what from machine to machine. We take that for granted, because it is. The second is that no one ever said that we were allowed to treat the input stream as a sequence of words. The most that we are allowed is bytes. So during both extraction and implantation of words, we have to be careful about byte-sequences. -Le Chaud Lapin- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: C++ Concepts Next: Portable way to write binary data |