Prev: C++ Concepts
Next: Portable way to write binary data
From: Le Chaud Lapin on 22 Nov 2005 06:25 Jakob Bieling wrote: > Le Chaud Lapin <unoriginal_username(a)yahoo.com> wrote: > > Technically, you have to be prepared for crunching a digest whose size > > is greater than (2^32). Here you're only taking care of lengths up to > > 2^32. > > I might be missing something .. but why 2^32? Well SHA-1 calls for taking a message of N bits and padding it with a 1 bit followed by however many bits are necessary to reach congruence to 448 mod 512, which is 512-64 bits, where the 64 bits are reserved to store the original length of the message. You can see that the OP is padding the message with a 1 bit, followed by however many zero bits are necessary to get congruence to 512 - 32 = 480 bits (he wrote 490 but meant 480). This is why the result will come out right for a zero length string, but not necessarily right for other strings. OTOH, he mentions in follow-up post that he knew this and was just trying to get the thing working first. -Le Chaud Lapin- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Carl Barron on 22 Nov 2005 06:29 In article <1132616256.047169.129400(a)g44g2000cwa.googlegroups.com>, Le Chaud Lapin <unoriginal_username(a)yahoo.com> wrote: > > Big-endian is silly, IMO. It was created as a substitute for > little-endian to help those who, for whatever reason, find > little-endian confusing when it's printed out. I am not so sure about any factuality in this, only that apparently you are used to assuming the world should be small endian. I think historically big endian was around long before little endian. I don't recall any endian conversations before the z80,m6502 era. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: gkoreman on 22 Nov 2005 06:30 .... But if a little-endian receiving machine gets these 4-bytes, and puts in them in increasing memory locations starting at 0, and then extracts the 4 bytes as a word, the word will become 0x76543210. .... Close. The word will actually become 0x67452301. I am sure that is what you meant though. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: kanze on 22 Nov 2005 14:39 Le Chaud Lapin wrote: >> kanze wrote: > >>> > Le Chaud Lapin wrote: >>>> > > In any case, again, you have to be fastidious in watching >>>> > > the spec in this section. In general, you cannot slap on >>>> > > an entire 64-bit chunk onto the message. For example, if >>>> > > the message is 440 bits long, you're only allowed to add 1 >>>> > > byte for padding: the bit-pattern of a 1 followed by 7 >>>> > > zeros. This will bring you to congruency mod 448, at >>>> > > which point you can tack on the 64-bit length to get to a >>>> > > full 512 bits. >>> > His interface forced the user to provide the 32 bit chunks. >> Not sure if that is the best way to approach this. I'm pretty sure it isn't:-). >> To view the input data to the hasher as a stream of bytes is >> far more general than forcing the view as vector of 32 bit >> chunks, and in most cases eliminates unnecessary copying. I'm not sure about the "unnecessary copying" bit. At some point, you do have to put the data in blocks of 32 bit chunks to process it, and that involves some copying. The only difference is whether the user has to handle the problem, or whether it is handled in the Digest code. Other than that, technically, the basis of the Digest is a stream of bits, and you should provide an interface which allows the user to pass in only a certain number of bits, at least for the last byte. Practically, I've never found any use for this, and haven't bothered to implement it. My MD5 code restricts the user to a multiple of 8 bytes. It's a restriction, but in my experience, it's not a bothersome one. >>> > Technically, of course, you might have to pad as little as a >>> > single bit. In practice, you can pad using the same >>> > multiple of bits as the user is required to provide >>> > (supposing that is a power of 2). Of course, for most uses, >>> > you'll want to support at least as small as bytes. >> This is true, and it also hints at a fact that many of us >> overlook in certain kinds of programming (embbeded, >> communications, etc): when you send a byte from one machine >> to another, there is already a guarantee that the bit order >> will be preserved for you. Imagine a world where you had to >> worry about bit-order too. Been there. Done that. There is universal agreement that bit order is little endian. Except in certain IBM documentation, at least. >>>> > > Technically, you have to be prepared for crunching a >>>> > > digest whose size is greater than (232). Here you're >>>> > > only taking care of lengths up to 232. >>> > Depends on the application. My own implementation of MD5 is >>> > limited to the SIZE_T_MAX * 8 bits. Perhaps internally, >>> > there are other limits which means that the messages will be >>> > small. >> Hmm...not sure you can call this an implementation of MD5. I >> would make a concession that it is ok to view the data as a >> string of bytes instead of a string of bits, which, as you >> pointed out, is what the spec calls for, for true compliance. >> But I think tossing out the upper-half of the 64-bit word for >> the length and calling it an implementation of the algorithm >> is stretching it. This presumes of course that you are making >> a generally resuable component, which you intend to polish off >> and make available for others to use. Woah. I don't toss out anything. I just require that the data set over which I calculate the Digest is not larger than 232 bytes. In practice, it has to be smaller than that if it is to fit into memory. (On the other hand, when I implemented it, it was for a particular field in a message of the Radius protocol. 32 bytes maximum, if I remember correctly.) >>>> > > 2. You want the output of the hashing to be an object >>>> > > itself so that you can let it construct, assign it, >>>> > > compare it, store it in containers, toss it around, etc., >>>> > > let it destruct. >>> > Maybe. In my case, my MD5 calculator is an accumulator. I >>> > suppose that there is some argument for defining a result >>> > type, but in practice, I've almost always read the results >>> > using the << operator, e.g.: >>> > file << myText >>> > << std::accumulate( myText.begin(), myText.end(), MD5() ) ; >>> > Internally, the << operator uses a member function: >>> > template< typename OutIter > >>> > void resultsTo( OutIter dest ) const ; >>> > which I made public, more because there wasn't any reason not >>> > too, and I thought it might be useful to someone. >> Where is the digest? Certainly it would be nice to write >> map<Digest,Foo>, for example. You can extract it into a string (as hex) if you want. But you're right -- some sort of Digest type would be cleaner. I just didn't need it. >> This again assumes that you are making a class that someone >> else would want to use. I can hardly think of a more >> opportune moment to let the object-oriented features of C++ >> bear down on the design of a component. I based my own class more on the STL -- it's an accumulator for std::accumulate, and it supports operator << for extracting the current value. That corresponded to my needs at the time. You're example with map is a very good one, however, and as soon as I get time... >>>> > > 4. You want the hash algorithm itself to be fast. >>> > You want it to be correct first. It isn't, so it's not yet >>> > time to worry about speed. >> Well, when I do my designs, I like to take everything into >> account all at once, including performance (macro, not micro). If by macro, you mean big-O complexity, perhaps. Otherwise... I also keep context in mind. The poster is having problems getting code to work. No point in worrying about speed yet. >> I spend most of my time mulling. Then I fidget to verify >> propositions, then back off, then walk around in woods some, >> then eat a snack, then tinker, then only when I think I have a >> visceral feel for what's going on do I begin to bear down on >> the implementation. For me the code always comes out cleaner >> than it would have if I had taken the incremental approach, >> and takes less time to get a finished product (contratry to >> what others claim). You won't find me arguing against design first. You should be able to do incremental development, for the all too frequent times when you have to, because the customer doesn't know what he wants until he starts seeing things. But even then, I'll often knock out a throw away prototype, using what I learn from it in order to make a clean design for the final product. >> You're going to have to think about these things at some point >> anyway, so you might as well mull it over as much as possible >> before you start typing, because the mind is faster than the >> hand (though both work better with the help of the other). Well, I don't like to go looking for problems until they occur. Some things are obvious -- if I have to process millions of data elements, an O(n2) algorithm isn't going to cut it. On the other hand, I don't waste my time worrying about constant factors until the profiler says I have to. >>>> > > I would define the algorithm itself as a namespace. >>> > Why? >> Well, to give you a real example, lets say you had a massive >> software probject that depended on primitives involved in >> security, including hashing. One master #include file >> contained: >> namespace Security_Set >> { >> using MD5::Hasher; >> using MD5::Digest; >> using namespace RSA; >> typedef RC6::Cipher<32,20,16> Symmetric_Cipher; >> typedef RC6::Key<16> Symmetric_Key; >> } >> Now let's say two years later, someone found a flaw in your >> chosen Hasher! Good grief!! No worries, you could simply >> modify your namespace: >> namespace Security_Set >> { >> using SHA_256::Hasher; >> using SHA_256::Digest; >> using namespace RSA; >> typedef RC6::Cipher<32,20,16> Symmetric_Cipher; >> typedef RC6::Key<16> Symmetric_Key; >> } >> If you have been using Hasher and Digest objects throughout >> your system, all you would have to do is recompile, and >> everything should work, data stored on disk notwithstanding. The samething is true if you use typedef's. And if you can implement the different Digest algorithms using a Policy template parameter, then you're going to be using templates. No one wants to write Gabi::Hasher< MD5 > all over the place. [...] >>> > After a quick glance at SHA-1, I suspect that you could use >>> > some sort of common machine for the entire family, including >>> > SHA-1 and MD-5, with a policy class to customize, much as is >>> > standard practice for CRC's. I'd certainly evaluate this >>> > route before implementing my second digest in the family. >>> > (As mentionned above, I've already implemented MD5.) >> Yikes. This is good example, where the whole resusability hunt >> is too agressive IMO. I can sort of understand your reaction. There's no point in making code completely illegible, just so you can gain two lines when you use it twice. On the other hand, CRC lent itself quite well to this sort of thing -- the policy class is relatively simple, and only contains a few constants and tables; all of the actual code is in the template itself. Before this thread, I'd not looked at any digest but MD5, because that was the only one I needed professionally until now. A quick glance at SHA gave me the impression that it was pretty much the same, modulo a few parameters, and maybe a primitive function or so. So the possibility is certainly in my head. Especially as there is a good deal of convenience boiler plate in my MD5 class, which would definitly be the same in a SHa class. >> Hashers are very well defined. You know exactly what they >> are, the sizes of their digests, the expectations on input. >> They are small in terms of the code size. Trying to force them >> to be common just so you can say, "it took me a week but I >> finally tweaked these things enough to use common code" is a >> bit much, don't you think? The final decision will be based simply on: will it cost me less time to make my current implementation generic than to write a new one? My first quick glance at SHA gave me the impression that most of my MD5 code would be applicable. This is true precisely because, as you say, the hashers are small in terms of code size; a significant part of my implementation is involved in providing a comfortable interface (e.g. supporting insertion of char, unsigned char, and int for the return values of functions like istream::get(), or several different formats of output). >>> > This is IMHO the big question with regards to the interface. >>> > In my current implementation, extracting the value would >>> > modify the internal state, so that the accumulator itself >>> > could not be reused. [snipped] >> This is the fun part of software design - imagining all the >> situations in which the primitive might be used, and then >> designing it so that it will feel good for the user once the >> primitive is wielded in the construction of a system. Note >> that "feel good" comes from eliminating as much uncertainty >> surrounding the primitive as possible. Said in a different >> way, "feel good" comes from a reduction on requisite mental >> burden. Exactly. Generally speaking, I doubt that it is that useful to be able to continue hashing after having extracted the digest. But you never know, and it was simple enough to avoid the problem. >> 1. Hash. 2. Hash. 3. Hash. 4. Hash. 5. Hash, but get the >> digest and reset the hasher. >> This feels right, and it will work in almost any context, >> including hashing a sequences of 32-bit words stored in a >> vector<>. The programmer does not have to worry about >> inefficiencies from copying. I've separated get the digest and reset. A use can get the digest, and continue hashing. I don't know if this is practically useful, but why not? >>> > However, all of the functions which one would normally use >>> > to do this work on a copy, and not the original object, so >>> > the orginal object remains unmodified. This copy isn't >>> > free, of course, but the object was designed to be as cheap >>> > to copy as possible -- all data members are arrays of >>> > integral types, so basically, the copy constructor just >>> > does a memcpy. And given the cost of calculating an MD5 >>> > digest anyway, I suspect that it doesn't matter. >> Hmm..I still prefer the idea of letting the programmer supply >> a const void * to the data, its length (unsigned of course), >> then doing a hash, and asking for the digest if it is time to >> get the digest, and if not, letting the hasher maintain >> internal state until such digest is wanted. That's more or less what I do. There are functions to add data to the hash, including a global operator+, so that the object can be used with std::accumulate, and functions which extract the digest. The functions which extract the digest are all const, and do not modify the state in any way. >> Also, if you allow explicit contrsuction of the Digest from a >> std::string, and allow it to yield itself as a std::string, >> you've pretty much finsihed off the Digest as a class (unless >> you are doing crypto, in which case allowing the Digest to >> yield itself as a big integer can be very nice.) I like your idea of a Digest type, especially with the explicit constructor from string and the ability to yield itself as a string. A separate digist is missing in my current implementation. On the other hand, I do have the possibility of extracting the digest as a raw binary -- all the user has to do is provide an output iterator to which I can write a uint8_t. -- James Kanze GABI Software Conseils en informatique orient?e objet/ Beratung in objektorientierter Datenverarbeitung 9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: kanze on 23 Nov 2005 08:44 Carl Barron wrote: > In article > <1132616256.047169.129400(a)g44g2000cwa.googlegroups.com>, Le > Chaud Lapin <unoriginal_username(a)yahoo.com> wrote: > > Big-endian is silly, IMO. It was created as a substitute > > for little-endian to help those who, for whatever reason, > > find little-endian confusing when it's printed out. > I am not so sure about any factuality in this, only that > apparently you are used to assuming the world should be > small endian. I think historically big endian was around > long before little endian. Historically, machines were word addressed, and there wasn't any Internet, so there wasn't any endian-ness. The IBM 360 was big-endian, even to the numbering of the bits in its words (so the low order bit of a word was bit 32, where as the low order bit of a byte was 8. And yes, then also numbered from 1 to n, rather than from 0 to n-1.) The PDP-11 was little-endian. It's curious that the network should have standardized on big-endian byte order. Bit order is always little endian, and most of the protocols were originally developped on VAXen, which are little endian. > I don't recall any endian conversations before the z80,m6502 > era. That's because machines didn't communicate with machines of other types before then. -- James Kanze GABI Software Conseils en informatique orient?e objet/ Beratung in objektorientierter Datenverarbeitung 9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: C++ Concepts Next: Portable way to write binary data |