Prev: SHA1
Next: generate double persision random value
From: Holger Sebert on 21 Nov 2005 04:48 Hi all, I was shocked when I read to thread "For binary files use only read() and write()??" above, in which was stated that using read()/write() for binary data is unportable and may lead to undefined behaviour (!!). I always thought myself to be on the safe side by doing things the following way: - Use std::ofstream/std::ifstream together with read()/write() - Only use types of standardized size, i.e. float, double, long, ... (they _are_ standardized, aren't they?? I'm slowly becoming unsure of almost everything concerning portable C++ *sigh*) - Store information of endianess elsewhere and when reading binart data flip the bytes if neccessary. Where are the pitfalls following this procedure? How should I do binary i/o instead to achieve portability? Note: Unfortunately I cannot use the portable boost libraries ... (because they don't compile on one of my target architectures, what a funny world) Many thanks in advance, Holger [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Ulrich Eckhardt on 21 Nov 2005 07:41 Holger Sebert wrote: > I was shocked when I read to thread "For binary files use only read() and > write()??" above, in which was stated that using read()/write() for binary > data is unportable and may lead to undefined behaviour (!!). > > I always thought myself to be on the safe side by doing things the > following way: > > - Use std::ofstream/std::ifstream together with read()/write() You need the according codecvt facet (from std::locale::classic()) and the ios_base::binary flag, too. > - Only use types of standardized size, i.e. float, double, long, ... > (they _are_ standardized, aren't they?? I'm slowly becoming unsure of > almost everything concerning portable C++ *sigh*) No. Neither their size nor their layout is standardized. There are a few minimum requirements but that's all. Of course, there is also an invalid assumption that CHAR_BITS==8 but while I have seen such a beast (a DSP from Texas Instruments), I haven't seen the need to write portable software for it. Uli [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Simon Bone on 21 Nov 2005 20:42 On Mon, 21 Nov 2005 04:48:03 -0500, Holger Sebert wrote: > Hi all, > > I was shocked when I read to thread "For binary files use only read() and > write()??" above, in which was stated that using read()/write() for binary data > is unportable and may lead to undefined behaviour (!!). > > I always thought myself to be on the safe side by doing things the following way: > > - Use std::ofstream/std::ifstream together with read()/write() > The stream classes do formatting. You would use the streambuf classes if you don't need that. > - Only use types of standardized size, i.e. float, double, long, ... > (they _are_ standardized, aren't they?? I'm slowly becoming unsure of > almost everything concerning portable C++ *sigh*) > C++ standardizes minimum sizes for fundamental types. Implementations are always free to use larger types if they think it makes sense for their customers. For example, there is currently some variation in whether long is 32 bits (the minimum allowed) or 64 bits (the widest native integral type on many common processors). In addition to this, there is some variation allowed in the format of the types. E.g. integral types can be twos-complement, ones-complement or signed-magnitude. You certainly do not want bit-for-bit copying of one of these to another, since that would change the value and might even lead to a trap representation. On the bright side, two-complement is so common you can probably rely on it. > - Store information of endianess elsewhere and when reading binart data > flip the > bytes if neccessary. > Bear in mind that there are some perverse choices possible. A 4 byte datum could be written 1234 or 4321 or 2143... And what about a 8 byte datum? > Where are the pitfalls following this procedure? > You might cover enough for all the platforms you develop and test on, and then find yourself asked to support a platform where all your assumptions break down. How likely that is depends on your application. If or when it happens you can possibly write a special program to convert the data files you have already created to the new platforms expectations. This is often hard, and with a legacy application where the original source has become convoluted through long haphazard maintenance (or just been lost), it is darn-near impossible. Most of us curse applications that put us through this, so consider whether it is likely for your applications. > How should I do binary i/o instead to achieve portability? > At the least, use typedef names for the types you write/read. Such as int8_t, int32_t etc from the C99 stddef.h. This encapsulates your assumption about the sizes of the types. Your approach to include information about endian-ness in the file is OK, but usually you can define a fixed format for the file. The time spent waiting for IO to complete is likely to dwarf any time spend marshalling the data to or from this format. If you are doing that limited formatting anyway, you might consider going one extra step and ditching binary IO altogether too. The advantage of a file format that can be used on any platform is a big one. > Note: Unfortunately I cannot use the portable boost libraries ... > (because they don't compile on one of my target architectures, what a > funny world) > There are many others out there. The boost library is worth looking at to see how this can be done well. But also look at the serialization section in the FAQ at http://www.parashift.com/c++-faq-lite/ for more ideas. HTH Simon Bone [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Le Chaud Lapin on 21 Nov 2005 21:09 Holger Sebert wrote: > Where are the pitfalls following this procedure? > > How should I do binary i/o instead to achieve portability? Your views seem good to me. I implemented a serialization package (which turned out to be oddly similar to one in Boost) that basically defined Source and Target repositories for serializing the 13 scalar types in C++ and the 13 vector types. Source and Target have virtual functions that can be overriden by any derived I/O class. I use this model extensively for my inter-process distributed communication. With regard to data format, you're right. It's better to follow the receiver-makes right rule, because in the vast majority of distributed data sharing, the source and target architectures are identifcal (PC-to-PC, SPARC-to-SPARC, etc.). For cases where they are not, I include at beginning of transmission stream an object that completely characterizes the format of the fundamental C++ types on the source machine, so that any target machine can do a conversion if necessary. One would be surprised at how compact this object can be made for the 13 fundamental C++ scalar types. To do the same for files, I would simply put this descriptor object at the beginning of the file, but I am not doing that yet. Finally, since any aggregate can be recursively and ultimately decomposed into scalar objects, it is trivial to serialize complex types. Caveats, which you are certainly aware of: 1. Polymorphic objects are intractable 2. If structure of an object changes, you're in big trouble with all that old-format data everywhere. Boost gets around this with embedded versioning. I decided not to take this route, as I felt it would be pushing the limit on what makes one type distinct from another. And also, it raises the standard for defining nice clean data types. I hear a little voice in my head as I write the serialization code..."You sure you got the structure of this class right? Huh..huh...huh? You'll suffer if you didn't." -Le Chaud Lapin- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: kanze on 22 Nov 2005 11:34
Simon Bone wrote: >> On Mon, 21 Nov 2005 04:48:03 -0500, Holger Sebert wrote: >>> > I was shocked when I read to thread "For binary files use >>> > only read() and write()??" above, in which was stated that >>> > using read()/write() for binary data is unportable and may >>> > lead to undefined behaviour (!!). >>> > I always thought myself to be on the safe side by doing >>> > things the following way: >>> > - Use std::ofstream/std::ifstream together with read()/write() >> The stream classes do formatting. You would use the streambuf >> classes if you don't need that. basic_ios also does error handling. What there is of it, anyway. Use the streambuf if you don't need that. The streambuf does character code translation. Don't use streambuf if you don't want that. In fact, it's a trade off, which has to be evaluated each time. >>> > - Only use types of standardized size, i.e. float, double, >>> > long, ... (they _are_ standardized, aren't they?? I'm >>> > slowly becoming unsure of almost everything concerning >>> > portable C++ *sigh*) >>> > >> C++ standardizes minimum sizes for fundamental types. >> Implementations are always free to use larger types if they >> think it makes sense for their customers. For example, there >> is currently some variation in whether long is 32 bits (the >> minimum allowed) or 64 bits (the widest native integral type >> on many common processors). There are also machines with 32 bit char's, and at least one with 9 bit char's and 36 bit 1's complement int's. Not everybody has to deal with them, of course. >> In addition to this, there is some variation allowed in the >> format of the types. E.g. integral types can be >> twos-complement, ones-complement or signed-magnitude. You >> certainly do not want bit-for-bit copying of one of these to >> another, since that would change the value and might even lead >> to a trap representation. On the bright side, two-complement >> is so common you can probably rely on it. Probably. There's always the Unisys 2200's, but that's a pretty small market. Floating point is trickier, since the mainframe IBM's also have a different format (and I've been told that IEEE isn't always compatible between vendors, at least where NaN's are concerned). >>> > - Store information of endianess elsewhere and when reading >>> > binart data flip the bytes if neccessary. >> Bear in mind that there are some perverse choices possible. A >> 4 byte datum could be written 1234 or 4321 or 2143... And what >> about a 8 byte datum? I've actually used systems where long's were 3412. The processor was Intel, and the compiler Microsoft, so I don't think we can speak of obscure niche players, either. >>> > Where are the pitfalls following this procedure? >> You might cover enough for all the platforms you develop and >> test on, and then find yourself asked to support a platform >> where all your assumptions break down. How likely that is >> depends on your application. >> If or when it happens you can possibly write a special program >> to convert the data files you have already created to the new >> platforms expectations. This is often hard, and with a legacy >> application where the original source has become convoluted >> through long haphazard maintenance (or just been lost), it is >> darn-near impossible. Most of us curse applications that put >> us through this, so consider whether it is likely for your >> applications. The problem isn't so much writing the code to read the format, once you know it. The problem is finding out what the format was to begin with. Especially if the data written contained struct's -- who knows where the original compiler inserted padding? >>> > How should I do binary i/o instead to achieve portability? >> At the least, use typedef names for the types you write/read. >> Such as int8_t, int32_t etc from the C99 stddef.h. This >> encapsulates your assumption about the sizes of the types. >> Your approach to include information about endian-ness in the >> file is OK, but usually you can define a fixed format for the >> file. I'd say that you have to do it anyway. You have to document the exact format on disk; otherwise, sooner or later, it will be unreadable. Given that, you might as well document endian-ness, and stick to it. (And it is easy to write portably to a given endianness.) >> The time spent waiting for IO to complete is likely to dwarf >> any time spend marshalling the data to or from this format. If >> you are doing that limited formatting anyway, you might >> consider going one extra step and ditching binary IO >> altogether too. The advantage of a file format that can be >> used on any platform is a big one. In theory at least, any file format can be used on any platform. I'll admit that I've never tested the extreme cases -- writing a file on a machine with 9 bit char's, then trying to read it on one with 8 bit char's, for example. But I regularly read and write binary files which are shared between Sparc's (in both 32 bit and 64 bit modes) and PC's under Linux and Windows, using the exact same code on every platform (no conditional byte swapping). Note that while globally, I agree with your recommendation for using text whenever possible (it sure makes debugging easier), it's worth pointing out that you need to define a few details of the format there as well -- Unix and Windows typically expect different line separators, and mainframe IBM's still use EBCDIC. >>> > Note: Unfortunately I cannot use the portable boost >>> > libraries ... (because they don't compile on one of my >>> > target architectures, what a funny world) Join the club:-(. >> There are many others out there. The boost library is worth >> looking at to see how this can be done well. Sort of. The Boost libraries have different goals than normal production code, and I would certainly never introduce so much genericity in something that I knew would only be used for a short time in one project. >> But also look at the serialization section in the FAQ at >> http://www.parashift.com/c++-faq-lite/ for more ideas. -- James Kanze GABI Software Conseils en informatique orient?e objet/ Beratung in objektorientierter Datenverarbeitung 9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |