netcdf file size for limited vs unlimited dimension [Matlab]

Prev: MCR_Cache
Next: fixed point .bin not showing any bits

From: Oscar Hartogensis on 24 Mar 2010 10:55

Writing multiple 1-dimensional variables (a time-series) to a netcdf-file formatted as nc_type "short", I noticed that the file becomes twice as large when using an unlimited versus a limited dimension definition.

However:
1. Writing one variable of nc_type 'short' only, both the limited and unlimited dimension files are of the same size...
2. Writing all data as floats the limited and unlimited dimension nc-files are of equal size (double the size of the limited dimension file of type short; as expected). It seems that using multiple variables of unlimited dimension means that the data is always written as a float?, or am I doing something wrong?

The files I write are quite large and I need to use an unlimited dimension as I don't know the record length in advance (I join multiple files into one nectdf file) but I don't like to waste double the disk-space my nc-files.

I tried the Matlab-native netcdf commands (example below), but also the mexcdf-toolbox and snctools. All give the same result. This seems to be more a netcdf than a Matlab issue. Any help is much appreciated though.

An example to illustrate this issue:
%%%%%%%%%%%%%%%%%%%%%%%%
N=80000;

% LIMITED dimension
% creating a netcdf file
nc = netcdf.create('testfile_lim.nc', 'NC_CLOBBER');
% define dimension
time_dim = netcdf.defDim(nc, 'time', N);
% define variables
var1_id = netcdf.defVar(nc, 'var1', 'short', time_dim);
var2_id = netcdf.defVar(nc, 'var2', 'short', time_dim);
netcdf.endDef(nc);
% write data
netcdf.putVar(nc, var1_id,int16([1:N]));
netcdf.putVar(nc, var2_id,int16([1:N]));
% close nc-file
netcdf.close(nc)

% UNLIMITED dimension
% creating a netcdf file
nc = netcdf.create('testfile_unlim.nc', 'NC_CLOBBER');
% define dimension
time_dim = netcdf.defDim(nc, 'time', netcdf.getConstant('NC_UNLIMITED'));
% define variables
var1_id = netcdf.defVar(nc, 'var1', 'short', time_dim);
var2_id = netcdf.defVar(nc, 'var2', 'short', time_dim);
netcdf.endDef(nc);
% write data
netcdf.putVar(nc, var1_id,0,N,int16([1:N]));
netcdf.putVar(nc, var2_id,0,N,int16([1:N]));
% close nc-file
netcdf.close(nc)
%%%%%%%%%%%%%%%%%%%%%%%%

testfile_lim.nc => 312kB
testfile_unlim.nc => 625kB

From: Oscar Hartogensis on 24 Mar 2010 16:11

A reply to my own message as I received the answer on this issue from Russ Rew from the unidata-support team (http://www.unidata.ucar.edu/software/netcdf/):

"
You need to know something about the underlying netCDF-classic format to explain this. The reason is that the space for each variable's data in a record is padded to the nearest multiple of 4-bytes. This makes sure each variable's data starts on a 4-byte boundary, which is an optimization for disk seeks on some platforms.

There is a special case if there is only one record variable, in which case no padding is used for byte or short variables. These padding rules are documented in the format specification:

http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#NetCDF-Classic-Format

and specifically in the description of the "varslab", which is a record's worth of data for a single variable, along with the special note at the end of the specification on padding:

Note on padding: In the special case of only a single record variable of character, byte, or short type, no padding is used between data values.
"

Also, I found a practical solution:

I don't know the exact dimension of the nc-file beforehand (which is why I used an unlimited dimension to start with), but I can make a reasonable estimate of the maximum possible length of the file and I add some 10% to that just to be sure. Using now a dimension that is always larger than the maximum record-length, I get nc-files with the same file-size as with exact-dimension information. My files are now half the size they were before; in addition, the processing time is also a bit faster than using an unlimited dimension.

From: TideMan on 24 Mar 2010 16:32

On Mar 25, 9:11 am, "Oscar Hartogensis" <oscar.hartogen...(a)wur.nl>
wrote:
> A reply to my own message as I received the answer on this issue from Russ Rew from the unidata-support team (http://www.unidata.ucar.edu/software/netcdf/):
>
> "
> You need to know something about the underlying netCDF-classic format to explain this. The reason is that the space for each variable's data in a record is padded to the nearest multiple of 4-bytes. This makes sure each variable's data starts on a 4-byte boundary, which is an optimization for disk seeks on some platforms.
>
> There is a special case if there is only one record variable, in which case no padding is used for byte or short variables. These padding rules are documented in the format specification:
>
> http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#NetCDF-Classic-Fo....
>
> and specifically in the description of the "varslab", which is a record's worth of data for a single variable, along with the special note at the end of the specification on padding:
>
> Note on padding: In the special case of only a single record variable of character, byte, or short type, no padding is used between data values.
> "
>
> Also, I found a practical solution:
>
> I don't know the exact dimension of the nc-file beforehand (which is why I used an unlimited dimension to start with), but I can make a reasonable estimate of the maximum possible length of the file and I add some 10% to that just to be sure. Using now a dimension that is always larger than the maximum record-length, I get nc-files with the same file-size as with exact-dimension information. My files are now half the size they were before; in addition, the processing time is also a bit faster than using an unlimited dimension.

Thank you for posting this.

I use .nc files with unlimited dimension (for time) all the time and
hadn't noticed the size issue.

However, there is a flaw in your work-around that would make it
impractical for me.
When I retrieve data, I often want the latest week or month of data,
i.e., up to the end of the unlimited dimension.
But if I were to pre-allocate the size, the returned array would
include a whole bunch of useless data at the end. And how would I
know where the good data finished?
That would involve a work-around that is as messy as your one.

From: Oscar Hartogensis on 25 Mar 2010 05:14

TideMan <mulgor(a)gmail.com> wrote in message <f81a28f5-49b3-449a-ba2d-512c2e594460(a)u5g2000prd.googlegroups.com>...
>
> Thank you for posting this.
>
> I use .nc files with unlimited dimension (for time) all the time and
> hadn't noticed the size issue.
>
> However, there is a flaw in your work-around that would make it
> impractical for me.
> When I retrieve data, I often want the latest week or month of data,
> i.e., up to the end of the unlimited dimension.
> But if I were to pre-allocate the size, the returned array would
> include a whole bunch of useless data at the end. And how would I
> know where the good data finished?
> That would involve a work-around that is as messy as your one.

I am not sure we are talking about the same thing here; you retrieve data (I assume from a nc-file?), I write nc-files.
My issue is on the file size on disk of the nc-file when writing ncshort data with unlimited dimension (occupying 4-bytes per number instead of 2). I think you refer to the memory-load of a variable when reading nc-data? When reading nc-files to Matlab all numbers end up being of type "double" unless specified otherwise irrespective of their format in the nc-file or the dimension-type. Pre-allocating an array indeed doesn't help here.

If your data is ncshort you can read it as a 2-byte integer into Matlab by specifying 'int16' in the netcdf.getVar command. Following the example above:

nc = netcdf.open('testfile_unlim.nc', 'NC_NOWRITE');
varid = netcdf.inqVarID(nc,'var1');
var1 = netcdf.getVar(nc,varid,'int16');
netcdf.close(nc)

|
Pages: 1
Prev: MCR_Cache
Next: fixed point .bin not showing any bits