Prev: MCR_Cache
Next: fixed point .bin not showing any bits
From: Oscar Hartogensis on 24 Mar 2010 10:55 Writing multiple 1-dimensional variables (a time-series) to a netcdf-file formatted as nc_type "short", I noticed that the file becomes twice as large when using an unlimited versus a limited dimension definition. However: 1. Writing one variable of nc_type 'short' only, both the limited and unlimited dimension files are of the same size... 2. Writing all data as floats the limited and unlimited dimension nc-files are of equal size (double the size of the limited dimension file of type short; as expected). It seems that using multiple variables of unlimited dimension means that the data is always written as a float?, or am I doing something wrong? The files I write are quite large and I need to use an unlimited dimension as I don't know the record length in advance (I join multiple files into one nectdf file) but I don't like to waste double the disk-space my nc-files. I tried the Matlab-native netcdf commands (example below), but also the mexcdf-toolbox and snctools. All give the same result. This seems to be more a netcdf than a Matlab issue. Any help is much appreciated though. An example to illustrate this issue: %%%%%%%%%%%%%%%%%%%%%%%% N=80000; % LIMITED dimension % creating a netcdf file nc = netcdf.create('testfile_lim.nc', 'NC_CLOBBER'); % define dimension time_dim = netcdf.defDim(nc, 'time', N); % define variables var1_id = netcdf.defVar(nc, 'var1', 'short', time_dim); var2_id = netcdf.defVar(nc, 'var2', 'short', time_dim); netcdf.endDef(nc); % write data netcdf.putVar(nc, var1_id,int16([1:N])); netcdf.putVar(nc, var2_id,int16([1:N])); % close nc-file netcdf.close(nc) % UNLIMITED dimension % creating a netcdf file nc = netcdf.create('testfile_unlim.nc', 'NC_CLOBBER'); % define dimension time_dim = netcdf.defDim(nc, 'time', netcdf.getConstant('NC_UNLIMITED')); % define variables var1_id = netcdf.defVar(nc, 'var1', 'short', time_dim); var2_id = netcdf.defVar(nc, 'var2', 'short', time_dim); netcdf.endDef(nc); % write data netcdf.putVar(nc, var1_id,0,N,int16([1:N])); netcdf.putVar(nc, var2_id,0,N,int16([1:N])); % close nc-file netcdf.close(nc) %%%%%%%%%%%%%%%%%%%%%%%% testfile_lim.nc => 312kB testfile_unlim.nc => 625kB
From: Oscar Hartogensis on 24 Mar 2010 16:11 A reply to my own message as I received the answer on this issue from Russ Rew from the unidata-support team (http://www.unidata.ucar.edu/software/netcdf/): " You need to know something about the underlying netCDF-classic format to explain this. The reason is that the space for each variable's data in a record is padded to the nearest multiple of 4-bytes. This makes sure each variable's data starts on a 4-byte boundary, which is an optimization for disk seeks on some platforms. There is a special case if there is only one record variable, in which case no padding is used for byte or short variables. These padding rules are documented in the format specification: http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#NetCDF-Classic-Format and specifically in the description of the "varslab", which is a record's worth of data for a single variable, along with the special note at the end of the specification on padding: Note on padding: In the special case of only a single record variable of character, byte, or short type, no padding is used between data values. " Also, I found a practical solution: I don't know the exact dimension of the nc-file beforehand (which is why I used an unlimited dimension to start with), but I can make a reasonable estimate of the maximum possible length of the file and I add some 10% to that just to be sure. Using now a dimension that is always larger than the maximum record-length, I get nc-files with the same file-size as with exact-dimension information. My files are now half the size they were before; in addition, the processing time is also a bit faster than using an unlimited dimension.
From: TideMan on 24 Mar 2010 16:32 On Mar 25, 9:11 am, "Oscar Hartogensis" <oscar.hartogen...(a)wur.nl> wrote: > A reply to my own message as I received the answer on this issue from Russ Rew from the unidata-support team (http://www.unidata.ucar.edu/software/netcdf/): > > " > You need to know something about the underlying netCDF-classic format to explain this. The reason is that the space for each variable's data in a record is padded to the nearest multiple of 4-bytes. This makes sure each variable's data starts on a 4-byte boundary, which is an optimization for disk seeks on some platforms. > > There is a special case if there is only one record variable, in which case no padding is used for byte or short variables. These padding rules are documented in the format specification: > > http://www.unidata.ucar.edu/netcdf/docs/netcdf.html#NetCDF-Classic-Fo.... > > and specifically in the description of the "varslab", which is a record's worth of data for a single variable, along with the special note at the end of the specification on padding: > > Note on padding: In the special case of only a single record variable of character, byte, or short type, no padding is used between data values. > " > > Also, I found a practical solution: > > I don't know the exact dimension of the nc-file beforehand (which is why I used an unlimited dimension to start with), but I can make a reasonable estimate of the maximum possible length of the file and I add some 10% to that just to be sure. Using now a dimension that is always larger than the maximum record-length, I get nc-files with the same file-size as with exact-dimension information. My files are now half the size they were before; in addition, the processing time is also a bit faster than using an unlimited dimension. Thank you for posting this. I use .nc files with unlimited dimension (for time) all the time and hadn't noticed the size issue. However, there is a flaw in your work-around that would make it impractical for me. When I retrieve data, I often want the latest week or month of data, i.e., up to the end of the unlimited dimension. But if I were to pre-allocate the size, the returned array would include a whole bunch of useless data at the end. And how would I know where the good data finished? That would involve a work-around that is as messy as your one.
From: Oscar Hartogensis on 25 Mar 2010 05:14 TideMan <mulgor(a)gmail.com> wrote in message <f81a28f5-49b3-449a-ba2d-512c2e594460(a)u5g2000prd.googlegroups.com>... > > Thank you for posting this. > > I use .nc files with unlimited dimension (for time) all the time and > hadn't noticed the size issue. > > However, there is a flaw in your work-around that would make it > impractical for me. > When I retrieve data, I often want the latest week or month of data, > i.e., up to the end of the unlimited dimension. > But if I were to pre-allocate the size, the returned array would > include a whole bunch of useless data at the end. And how would I > know where the good data finished? > That would involve a work-around that is as messy as your one. I am not sure we are talking about the same thing here; you retrieve data (I assume from a nc-file?), I write nc-files. My issue is on the file size on disk of the nc-file when writing ncshort data with unlimited dimension (occupying 4-bytes per number instead of 2). I think you refer to the memory-load of a variable when reading nc-data? When reading nc-files to Matlab all numbers end up being of type "double" unless specified otherwise irrespective of their format in the nc-file or the dimension-type. Pre-allocating an array indeed doesn't help here. If your data is ncshort you can read it as a 2-byte integer into Matlab by specifying 'int16' in the netcdf.getVar command. Following the example above: nc = netcdf.open('testfile_unlim.nc', 'NC_NOWRITE'); varid = netcdf.inqVarID(nc,'var1'); var1 = netcdf.getVar(nc,varid,'int16'); netcdf.close(nc)
|
Pages: 1 Prev: MCR_Cache Next: fixed point .bin not showing any bits |