unformatted files (again) [Fortran]

Prev: a wiki entry for gfortran
Next: BIND(C) functions in a module error

From: Gideon on 6 Aug 2010 11:28

From a cursory search, I know questions like this have been asked
before, but I wanted to ask something more pointed. I know that when
you write an unformatted file in fortran, it sticks a "header" before
your data that indicates how many bytes of data your real information
is. On my Intel Core 2 OS X machine, I discovered a while ago (from
the intel ifort documentation) that it was sticking this information
in as a 4 byte integer at the beginning of the file. Thus, I was free
to either skip the first 4 bytes if I knew how large my data structure
was, or read in this integer to figure out how large it was. I should
have prefaced this by saying that I'm mostly writing arrays of double
precision numbers and then reading them into MATLAB.

Anyways, I recently had a colleague experiment with this on an intel
machine running some flavor of linux. On this setup, it turned out to
use an 8 byte integer which took us some time to discover.

So here's my question: is there an easy, robust, way to discover what
size the header of a fortran unformatted file is on a given
architecture/OS?

From: Dave Allured on 6 Aug 2010 13:05

Gideon wrote:
>
> From a cursory search, I know questions like this have been asked
> before, but I wanted to ask something more pointed. I know that when
> you write an unformatted file in fortran, it sticks a "header" before
> your data that indicates how many bytes of data your real information
> is. On my Intel Core 2 OS X machine, I discovered a while ago (from
> the intel ifort documentation) that it was sticking this information
> in as a 4 byte integer at the beginning of the file. Thus, I was free
> to either skip the first 4 bytes if I knew how large my data structure
> was, or read in this integer to figure out how large it was. I should
> have prefaced this by saying that I'm mostly writing arrays of double
> precision numbers and then reading them into MATLAB.
>
> Anyways, I recently had a colleague experiment with this on an intel
> machine running some flavor of linux. On this setup, it turned out to
> use an 8 byte integer which took us some time to discover.
>
> So here's my question: is there an easy, robust, way to discover what
> size the header of a fortran unformatted file is on a given
> architecture/OS?

This is a tricky question because the internal structure of fortran
unformatted sequential files was never standardized. The record length
integers were never intended to be seen by normal users, putting the
whole topic outside fortran standards.

For the compilers and unix and linux platforms within my experience, I
can count on the following structure of each unformatted record:

[length] [data block] [length]

Where [length] is a 4 or 8 byte integer, the byte count of the data
block; and [data block] is the user data from a single unformatted write
statement. The leading and trailing length integers for each record are
identical. I believe the original purpose of the trailing length was to
support reverse reading such as the backspace statement.

The other important aspect of the file structure is that the unformatted
records are then written or mapped contiguously onto an ordinary file,
with no gaps between records. The unix-like data model for ordinary
files is that they are a single unbroken stream of bytes with a given
total length. Certain older platforms used distintly different
mappings, so let's just stick with the unix-like assumption from now on.

So you should be able to get a fairly robust determination by using the
redundant information in the trailing length byte of the first record.
Using either direct or stream access, read the first 8 bytes of the
file. Then test several interpretations for the leading length
integer. Including endian if you need to, there are four possibilities:

4 bytes, little endian
8 bytes, little endian
4 bytes, big endian
8 bytes, big endian

For each possibility, you then skip N or N-4 bytes in the file, and
attempt to read the first record's trailing length integer in the same
format. Test for I/O error each time, because misinterpreted lengths
will often run off the end of the file.

Depending on your application, you may also be able to pre-screen for
minimum and maximum reasonable record lengths, before attempting wild
file seeks. If your fortran supports inquiring the file length (F2003 I
think), this is also a good prequalification for interpreted lengths.

Ideally the tests will yield one success and three failures, which means
a complete determination.

It is conceivable that a file may have a data pattern that exactly
matches the expected trailing length integer. Then you might consider
testing the second record as well. But that seems like a lot of work to
me. My work in this area so far has been confined to file sets with
severe constraints on the minimum and maximum record size, which makes
discrimination much easier.

--Dave

From: Dave Allured on 6 Aug 2010 13:19

Correction. In the 5th paragraph, replace "byte" with "integer":

> redundant information in the trailing length *integer* ...

--Dave

From: Nick Maclaren on 6 Aug 2010 13:28

In article <4C5C40E1.2823(a)nospom.com>, Dave Allured <nospom(a)nospom.com> wrote:
>Gideon wrote:
>>
>> So here's my question: is there an easy, robust, way to discover what
>> size the header of a fortran unformatted file is on a given
>> architecture/OS?
>
>This is a tricky question because the internal structure of fortran
>unformatted sequential files was never standardized. The record length
>integers were never intended to be seen by normal users, putting the
>whole topic outside fortran standards.

That's understating the issue :-)

What record-length integers? Some systems didn't have them, and
that includes some types of file under Unices :-) Magnetic tapes
of types that allow variable-length blocks, run-time systems that
allow the direct use of sockets and so on.

>For the compilers and unix and linux platforms within my experience, I
>can count on the following structure of each unformatted record:
>
> [length] [data block] [length]
>
>Where [length] is a 4 or 8 byte integer, the byte count of the data
>block; and [data block] is the user data from a single unformatted write
>statement. The leading and trailing length integers for each record are
>identical. I believe the original purpose of the trailing length was to
>support reverse reading such as the backspace statement.

That is correct, and that is the usual format. HOWEVER, I have also
seen the following:

1) As above, but with 2 byte integers.

2) With only a preceding length (4 byte, if I recall).

3) With a header before the first record.

4) With the [length] field actually being a [junk,length] field.

My guess is that all of those are now dead and buried, though.

Regards,
Nick Maclaren.

From: Dave Allured on 6 Aug 2010 14:04

Nick Maclaren wrote:
>
> In article <4C5C40E1.2823(a)nospom.com>, Dave Allured <nospom(a)nospom.com> wrote:
> >Gideon wrote:
> >>
> >> So here's my question: is there an easy, robust, way to discover what
> >> size the header of a fortran unformatted file is on a given
> >> architecture/OS?
> >
> >This is a tricky question because the internal structure of fortran
> >unformatted sequential files was never standardized. The record length
> >integers were never intended to be seen by normal users, putting the
> >whole topic outside fortran standards.
>
> That's understating the issue :-)
>
> What record-length integers? Some systems didn't have them, and
> that includes some types of file under Unices :-) Magnetic tapes
> of types that allow variable-length blocks, run-time systems that
> allow the direct use of sockets and so on.

Point taken! However today Gideon and I trying to discuss ordinary disk
files only, for practical purposes. If for whatever reason you try my
file sexing methods on a wierd I/O device, you will get what you
deserve! ;-)

> >For the compilers and unix and linux platforms within my experience, I
> >can count on the following structure of each unformatted record:
> >
> > [length] [data block] [length]
> >
> >Where [length] is a 4 or 8 byte integer, the byte count of the data
> >block; and [data block] is the user data from a single unformatted write
> >statement. The leading and trailing length integers for each record are
> >identical. I believe the original purpose of the trailing length was to
> >support reverse reading such as the backspace statement.
>
> That is correct, and that is the usual format. HOWEVER, I have also
> seen the following:
>
> 1) As above, but with 2 byte integers.
>
> 2) With only a preceding length (4 byte, if I recall).
>
> 3) With a header before the first record.
>
> 4) With the [length] field actually being a [junk,length] field.
>
> My guess is that all of those are now dead and buried, though.

Very interesting! My algorithm could deals with (1) with increase in
uncertainty or supplemental testing, and (3) and (4) with more knowledge
of the particular details. But the important part is where you said
dead and buried! ;-)

--Dave

| Next | Last
Pages: 1 2 3 4
Prev: a wiki entry for gfortran
Next: BIND(C) functions in a module error