From: matlaberboy on
When unzipping a large *gz file contains a ASCII in CSV format a large % of times the output file (myFile.csv) is missing a large amount of data off the end of the file.

The files I am processing are quite large (eg > 50,000 lines).

I am using Vista 32bit machine, Matlab 2009a 32bit, 4GB RAM. Its not a memory issue (the files are only ~3-5MB in size). I am using textpad to view the CSV data in. (Excel is not used at any stage).

I have a collection of ~ 20,000 GZ files with timeseries data in. Some of the files it processes correctly, many of them it doesnt.

The behaviour is quite unstable (as it seems to occur on some files but not on others), though is reproducable.
From: matlaberboy on
Submitted this as a bug to matlab. They came back with the below response. In summary yes it is a serious bug in the underlying Java and no they are not going to do anything about it.

###

there is no planning to do that right now.

regards

Rossana Pacchiodo

***********************************************************************
Engineering Services
MathWorks

***********************************************************************

Support Request: http://www.mathworks.com/support/service_requests/contact_support.do


***********************************************************************


[THREAD ID:1-COWBWO]



-----Original Message-----

From:
Sent: 2010-05-04 10:39:34 AM
To: <support(a)mathworks.it>
Subject: GUNZIP BUG

I agree that winrar, 7zip etc can unzip the file correctly. Thus the fault can not be with the file but must be with the code.

Hence I agree that this sounds like a java bug.

Surely in this case, Matlab can not now continue promoting the gunzip implementation and should either re-write it without java, in Matlab or withdraw the function?

-----Original Message-----
From: support(a)mathworks.it [mailto:support(a)mathworks.it]
Sent: 04 May 2010 09:09
To:
Subject: Re: GUNZIP BUG

Dear ,

I am writing in reference to your Service Request # 1-COWBWO regarding 'GUNZIP BUG'.

MATLAB uses Java underneath to zip and unzip files. In this case, there appears to be some sort of corruption in the ZIP file, which causes Java not to be able to read the whole file. When I extract the file using Winzip, and then rezip it, it can be unzipped without a problem.

Given that we have no control over the Java implementation, and that there appears to be something corrupted about this file, I don't think there is much we can do.

I would suggest to you that you rezip the file, perhaps using a different application, as a workaround.

Moreover:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425




Please preserve the THREAD ID below in any further correspondence on this query. This will allow our systems to automatically assign your reply to the appropriate Service Request. If you have a new technical support question, please submit a new request here:

http://www.mathworks.com/support/service_requests/contact_support.do

Sincerely,

Rossana Pacchiodo
Technical Support Engineer

MathWorks

[THREAD ID: 1-COWBWO]
From: Walter Roberson on
matlaberboy wrote:
> Submitted this as a bug to matlab. They came back with the below
> response. In summary yes it is a serious bug in the underlying Java and
> no they are not going to do anything about it.

> To: <support(a)mathworks.it>

> I agree that winrar, 7zip etc can unzip the file correctly. Thus the
> fault can not be with the file but must be with the code.
>
> Hence I agree that this sounds like a java bug.

I would have to disagree with your assessment of this matter.

When a file has been written incorrectly or has been corrupted after
being formed, then any given program reading it has a choice of what to
do when it encounters the problem. Some programs will choose to
terminate; others will try to make guesses about what was intended;
other programs might not even notice the problem because they are faulty
themselves.

If there is no official international standards committee that has
mandated a particular behaviour, and if the file format is not a
proprietary format that the manufacturer has mandated particular
behaviour for, or then there is no way to decide that a particular
program has handled the situation "correctly" or "incorrectly".
Similarly if there is a Standard but it does not describe the action to
take in a particular error situation, then any particular treatment
cannot be considered correct or incorrect.

The GZIP file format was set out in the standard RFC 1952,
http://www.gzip.org/zlib/rfc-gzip.html
Examine that standard, and in particular examine the paragraphs entitled
'Compliance' at the end of section 2:

==== begin quotation ====
Compliance

A compliant compressor must produce files with correct ID1, ID2, CM,
CRC32, and ISIZE, but may set all the other fields in the fixed-length
part of the header to default values (255 for OS, 0 for all others). The
compressor must set all reserved bits to zero.

A compliant decompressor must check ID1, ID2, and CM, and provide an
error indication if any of these have incorrect values. It must examine
FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC at least so it can skip over the
optional fields if they are present. It need not examine any other part
of the header or trailer; in particular, a decompressor may ignore FTEXT
and OS and always produce binary output, and still be compliant. A
compliant decompressor must give an error indication if any reserved bit
is non-zero, since such a bit could indicate the presence of a new field
that would cause subsequent data to be interpreted incorrectly.
==== end quotation ====


What is of interest here is what the second paragraph does NOT say. In
particular, the second paragraph does NOT say anything about the
behaviour of a decompressor if it should detect that the CRC32 field
does not match the data stored in the file. Therefor a compressor can be
compliant if it does not check the CRC32 and goes ahead and decompresses
the data it finds in the file; a compressor can also be compliant if it
checks CRC32 and refuses to decompress because it knows that something
went wrong at some point.


If I recall your situation correctly, you were given a partially
decompressed output. If that is the case, the problem would not be as
simple as the entire file CRC32, but it could be a problem with the
internal structure used by the particular compression method involved.
Unless we can find an official standard for that particular compression
method and it says that some particular behaviour "must" occur, we
cannot say that a particular program is right or wrong. If you examine
the international standard referenced above, RFC 952, you will see that
it does not say anything about what "must" or "must not" happen if a
particular compression method finds an error that does not affect ID1,
ID2, CM, or the reserved bits of the header of the entire file.
From: Yair Altman on
"matlaberboy " <matlaberboy(a)gmail.NOSPAM.com> wrote in message <hror58$cgn$1(a)fred.mathworks.com>...
> Submitted this as a bug to matlab. They came back with the below response. In summary yes it is a serious bug in the underlying Java and no they are not going to do anything about it.

Like Walter, I also disagree with your assessment of this matter, but from an altogether different angle:

We all know how buggy Microsoft software is - we get updates every week fixing loads of such bugs, and that's only for the most critical security flaws. It therefore stands to reason that the core graphics libraries on which Matlab depends for plotting also have some bugs. Following your logic, upon discovering any such bug MathWorks will need to either (1) stop supporting Windows platforms or (2) implement its own graphic libraries (and do you think these will be bug-free?!). Obviously, both options are absurd.

The MathWorks CSR took the trouble to point you to a specific bug in the underlying Java system's bug-parade. It is entirely reasonable for MathWorks to wait for the next Java release that will hopefully fix this.

Lastly, Matlab relies on Java for a large part, but you always have the option of running Matlab with the -nojvm command-line option which will disable Java. Matlab will became extremely limited in its abilities but at least you'll be free from all those "unacceptable" Java bugs...

My personal advise: wake up to the real world...

Yair Altman
http://UndocumentedMatlab.com
From: matlaberboy on
This is very simple.

This bug has been apparently known about in java for 8 years now.

If the decompression is going to fail, it would be nice if JAVA threw an exception, rather than giving the false impression that all data has been extracted. However given we are in the real-world, it would not be beyond the realm of reason for a paragraph to be added to the Matlab supporting gunzip documentation stating that there is this known issue with the underlying java code.