Parsing binary wfstream file [C++]

Prev: warning: dereferencing pointer does break strict-aliasing rules
Next: performance of hash_map in combination with strings on vstudio 2003

From: matt on 31 Mar 2010 23:16

Hi all,

I was working on parsing a file created by Fortran which is fixed
format. It is a wide stream binary file.
I know that there are 6 integers which are made up of 4 characters
each, followed by 2 doubles of 8 characters each.

I am wondering what is the best way to convert the raw characters to
integer / double values. I believe it is possible via a stream facet
and/or codecvt, but not sure.

Below is my code, with a hand-coded function to convert from 4 wchars
to int. Certainly there must be a more elegant way. Any help would
be appreciated.

I'm using gcc 4.4 on a 64-bit x86 Linux machine.

Thanks,
Matt.

[code]

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <cmath>

using namespace std;

//Define the size for an int and a double
const int SIZE_INT(4);
const int SIZE_DBL(8);

//Function to read an int from an input stream and convert it to an
int type
int readInt(wistream& in)
{
int returnArg(0);
for(int i=0; i<(SIZE_INT); ++i) {
wchar_t c;
in.get(c);

//Convert this char to an int and accumulate
stringstream ss;
ss << c;
int j;
ss >> j;
returnArg += j * pow(128, i);
}

return returnArg;
}

/////

int main()
{
//Open the input file stream in binary mode. NOTE: wide character
stream.
wifstream in("File.in", ios::binary);
if(!in) {
cout << "Opening of input file failed. Exiting.";
return EXIT_FAILURE;
}

//Read 6 ints and print them to standard output
for(int i=0; i<6; ++i) {
int theInt = readInt(in);
cout << theInt << " ";
}

//Do the same for doubles .....

cout << endl;
return EXIT_SUCCESS;
}

[/code]

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Mickey on 1 Apr 2010 20:44

On Apr 1, 7:16 pm, matt <li...(a)givemefish.com> wrote:
> Hi all,
>
> I was working on parsing a file created by Fortran which is fixed
> format. It is a wide stream binary file.
> I know that there are 6 integers which are made up of 4 characters
> each, followed by 2 doubles of 8 characters each.
>
> I am wondering what is the best way to convert the raw characters to
> integer / double values. I believe it is possible via a stream facet
> and/or codecvt, but not sure.
>
> Below is my code, with a hand-coded function to convert from 4 wchars
> to int. Certainly there must be a more elegant way. Any help would
> be appreciated.
>
> I'm using gcc 4.4 on a 64-bit x86 Linux machine.
>
> Thanks,
> Matt.
>
> [code]
>
> #include <cstdlib>
> #include <fstream>
> #include <iostream>
> #include <iomanip>
> #include <sstream>
> #include <cmath>
>
> using namespace std;
>
> //Define the size for an int and a double
> const int SIZE_INT(4);
> const int SIZE_DBL(8);
>
> //Function to read an int from an input stream and convert it to an
> int type
> int readInt(wistream& in)
> {
> int returnArg(0);
> for(int i=0; i<(SIZE_INT); ++i) {
> wchar_t c;
> in.get(c);
>
> //Convert this char to an int and accumulate
> stringstream ss;
> ss << c;
> int j;
> ss >> j;
> returnArg += j * pow(128, i);
> }
>
> return returnArg;
>
> }
>
> /////
>
> int main()
> {
> //Open the input file stream in binary mode. NOTE: wide character
> stream.
> wifstream in("File.in", ios::binary);
> if(!in) {
> cout << "Opening of input file failed. Exiting.";
> return EXIT_FAILURE;
> }
>
> //Read 6 ints and print them to standard output
> for(int i=0; i<6; ++i) {
> int theInt = readInt(in);
> cout << theInt << " ";
> }
>
> //Do the same for doubles .....
>
> cout << endl;
> return EXIT_SUCCESS;
>
> }
>
> [/code]

{ edits: quoted banner removed. please keep readers in mind when you quote. -mod }

This looks neat enough solution to me if it is working for you.

Another way could be to define appropriate structure and read data
into it directly. But don't expect portability that easily. If you
are doing this on the same platform/processor compiler conventions
it is alright otherwise things like endian`ness and integer/double
representation comes into picture. Personally I would have tried
the structure approach first.

Regards,
Jyoti

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Ulrich Eckhardt on 1 Apr 2010 20:46

matt wrote:
> I was working on parsing a file created by Fortran which is fixed
> format. It is a wide stream binary file.

"wide stream" is not a file format. wchar_t is only an internal (in memory)
representation, the external (on disk) representation on any machine is
still bytes.

Now, what about C++ streams? What these do is to convert the external
representation in bytes (char) to the internal representation (char or
wchar_t). Note that even the byte-char mapping can have a real mapping and
that a char-wchar_t mapping can simply map one external byte to one internal
wchar_t. If you want to read a file format, you must configure the stream
accordingly and also know the external encoding (like e.g. UTF-8 or one of
the codepages).

> I know that there are 6 integers which are made up of 4 characters
> each, followed by 2 doubles of 8 characters each.
>
> I am wondering what is the best way to convert the raw characters to
> integer / double values.

Firstly, this is not a textual file format, but C++ streams are primarily
tools for those files, not so much for packed binary formats.

> I believe it is possible via a stream facet and/or codecvt, but not sure.

The codecvt facets are exactly what govern the conversion between external
bytes and the internal character type. However, you don't even have a text
file here, so this isn't really useful. What you want is to retrieve single
bytes and assemble your values from it. For that, there are two things to
do:
1. Use a char-stream. This allows you to retrieve bytes (chars) directly.
2. Turn of end-of-line conversion with ios_base::binary. I see below that
you do that already.
3. Turn off any conversion between external bytes and internal chars, you
want the raw values. For that, you can use the classic or C locale.

The code for that is then

std::ifstream in(filename, std::ios_base::binary);
in.imbue(std::locale::classic);

> int readInt(wistream& in)
> {
> int returnArg(0);
> for(int i=0; i<(SIZE_INT); ++i) {
> wchar_t c;
> in.get(c);
>
> //Convert this char to an int and accumulate
> stringstream ss;
> ss << c;
> int j;
> ss >> j;
> returnArg += j * pow(128, i);
> }
>
> return returnArg;
> }

Several notes here:
1. 'in.get(c)' tells you if it succeeded, you should test that.
2. Writing a wchar_t to a char stream will treat the wchar_t as integer, so
they will be written as textual integer representation, which you then read
back into an integer. This is not wrong (though even there the checking for
errors is missing) but way too complicated. All character types are in fact
integers, so you can directly use the value as it is.
3. pow(128,i) is a floating-point function, in order to assemble integers,
you can also use the shift operations. I'm wondering about the 128 though,
too, I would have thought that you would need 256 here.
4. I don't see you ever getting a negative value out of this. You will have
to test this and adapt it accordingly. Make sure that you have a few test
files and that you also learn about the "twos complement" representation for
integers in memory.
5. All you do here could be achieved using char streams, too. The only
caveat is that plain char might be signed or unsigned, this is
implementation-defined. You should therefore cast the char to an unsigned
char, which then gives you more precise control over what you are doing.
6. SIZE_INT, no need to put that in brackets. Further, I wouldn't use
ALL_UPPERCASE, leave that exclusively for macros.

Note that if you follow these points, you will be able to use your code on
more than 99% of all machines. For the rest, you would have to change the
code to adapt to a sign-magnitude representation instead of the twos
complement now used by most CPUs. Further, assuming a byte always has 8 bits
is also not portable, even though it's true for most machines, too.

Now, for floating-point numbers, I'm afraid you will not get away as easily.
For that, you will have to know both the representation in the file and the
one in memory. Try reading up on IEEE floats, which are probably what is on
disk.

Good luck!

Uli

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

|
Pages: 1
Prev: warning: dereferencing pointer does break strict-aliasing rules
Next: performance of hash_map in combination with strings on vstudio 2003