From: matlaberboy on 29 Apr 2010 15:25 When unzipping a large *gz file contains a ASCII in CSV format a large % of times the output file (myFile.csv) is missing a large amount of data off the end of the file. The files I am processing are quite large (eg > 50,000 lines). I am using Vista 32bit machine, Matlab 2009a 32bit, 4GB RAM. Its not a memory issue (the files are only ~3-5MB in size). I am using textpad to view the CSV data in. (Excel is not used at any stage). I have a collection of ~ 20,000 GZ files with timeseries data in. Some of the files it processes correctly, many of them it doesnt. The behaviour is quite unstable (as it seems to occur on some files but not on others), though is reproducable.
From: matlaberboy on 4 May 2010 05:59 Submitted this as a bug to matlab. They came back with the below response. In summary yes it is a serious bug in the underlying Java and no they are not going to do anything about it. ### there is no planning to do that right now. regards Rossana Pacchiodo *********************************************************************** Engineering Services MathWorks *********************************************************************** Support Request: http://www.mathworks.com/support/service_requests/contact_support.do *********************************************************************** [THREAD ID:1-COWBWO] -----Original Message----- From: Sent: 2010-05-04 10:39:34 AM To: <support(a)mathworks.it> Subject: GUNZIP BUG I agree that winrar, 7zip etc can unzip the file correctly. Thus the fault can not be with the file but must be with the code. Hence I agree that this sounds like a java bug. Surely in this case, Matlab can not now continue promoting the gunzip implementation and should either re-write it without java, in Matlab or withdraw the function? -----Original Message----- From: support(a)mathworks.it [mailto:support(a)mathworks.it] Sent: 04 May 2010 09:09 To: Subject: Re: GUNZIP BUG Dear , I am writing in reference to your Service Request # 1-COWBWO regarding 'GUNZIP BUG'. MATLAB uses Java underneath to zip and unzip files. In this case, there appears to be some sort of corruption in the ZIP file, which causes Java not to be able to read the whole file. When I extract the file using Winzip, and then rezip it, it can be unzipped without a problem. Given that we have no control over the Java implementation, and that there appears to be something corrupted about this file, I don't think there is much we can do. I would suggest to you that you rezip the file, perhaps using a different application, as a workaround. Moreover: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425 Please preserve the THREAD ID below in any further correspondence on this query. This will allow our systems to automatically assign your reply to the appropriate Service Request. If you have a new technical support question, please submit a new request here: http://www.mathworks.com/support/service_requests/contact_support.do Sincerely, Rossana Pacchiodo Technical Support Engineer MathWorks [THREAD ID: 1-COWBWO]
From: Walter Roberson on 4 May 2010 11:07 matlaberboy wrote: > Submitted this as a bug to matlab. They came back with the below > response. In summary yes it is a serious bug in the underlying Java and > no they are not going to do anything about it. > To: <support(a)mathworks.it> > I agree that winrar, 7zip etc can unzip the file correctly. Thus the > fault can not be with the file but must be with the code. > > Hence I agree that this sounds like a java bug. I would have to disagree with your assessment of this matter. When a file has been written incorrectly or has been corrupted after being formed, then any given program reading it has a choice of what to do when it encounters the problem. Some programs will choose to terminate; others will try to make guesses about what was intended; other programs might not even notice the problem because they are faulty themselves. If there is no official international standards committee that has mandated a particular behaviour, and if the file format is not a proprietary format that the manufacturer has mandated particular behaviour for, or then there is no way to decide that a particular program has handled the situation "correctly" or "incorrectly". Similarly if there is a Standard but it does not describe the action to take in a particular error situation, then any particular treatment cannot be considered correct or incorrect. The GZIP file format was set out in the standard RFC 1952, http://www.gzip.org/zlib/rfc-gzip.html Examine that standard, and in particular examine the paragraphs entitled 'Compliance' at the end of section 2: ==== begin quotation ==== Compliance A compliant compressor must produce files with correct ID1, ID2, CM, CRC32, and ISIZE, but may set all the other fields in the fixed-length part of the header to default values (255 for OS, 0 for all others). The compressor must set all reserved bits to zero. A compliant decompressor must check ID1, ID2, and CM, and provide an error indication if any of these have incorrect values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC at least so it can skip over the optional fields if they are present. It need not examine any other part of the header or trailer; in particular, a decompressor may ignore FTEXT and OS and always produce binary output, and still be compliant. A compliant decompressor must give an error indication if any reserved bit is non-zero, since such a bit could indicate the presence of a new field that would cause subsequent data to be interpreted incorrectly. ==== end quotation ==== What is of interest here is what the second paragraph does NOT say. In particular, the second paragraph does NOT say anything about the behaviour of a decompressor if it should detect that the CRC32 field does not match the data stored in the file. Therefor a compressor can be compliant if it does not check the CRC32 and goes ahead and decompresses the data it finds in the file; a compressor can also be compliant if it checks CRC32 and refuses to decompress because it knows that something went wrong at some point. If I recall your situation correctly, you were given a partially decompressed output. If that is the case, the problem would not be as simple as the entire file CRC32, but it could be a problem with the internal structure used by the particular compression method involved. Unless we can find an official standard for that particular compression method and it says that some particular behaviour "must" occur, we cannot say that a particular program is right or wrong. If you examine the international standard referenced above, RFC 952, you will see that it does not say anything about what "must" or "must not" happen if a particular compression method finds an error that does not affect ID1, ID2, CM, or the reserved bits of the header of the entire file.
From: Yair Altman on 4 May 2010 16:38 "matlaberboy " <matlaberboy(a)gmail.NOSPAM.com> wrote in message <hror58$cgn$1(a)fred.mathworks.com>... > Submitted this as a bug to matlab. They came back with the below response. In summary yes it is a serious bug in the underlying Java and no they are not going to do anything about it. Like Walter, I also disagree with your assessment of this matter, but from an altogether different angle: We all know how buggy Microsoft software is - we get updates every week fixing loads of such bugs, and that's only for the most critical security flaws. It therefore stands to reason that the core graphics libraries on which Matlab depends for plotting also have some bugs. Following your logic, upon discovering any such bug MathWorks will need to either (1) stop supporting Windows platforms or (2) implement its own graphic libraries (and do you think these will be bug-free?!). Obviously, both options are absurd. The MathWorks CSR took the trouble to point you to a specific bug in the underlying Java system's bug-parade. It is entirely reasonable for MathWorks to wait for the next Java release that will hopefully fix this. Lastly, Matlab relies on Java for a large part, but you always have the option of running Matlab with the -nojvm command-line option which will disable Java. Matlab will became extremely limited in its abilities but at least you'll be free from all those "unacceptable" Java bugs... My personal advise: wake up to the real world... Yair Altman http://UndocumentedMatlab.com
From: matlaberboy on 4 May 2010 16:55
This is very simple. This bug has been apparently known about in java for 8 years now. If the decompression is going to fail, it would be nice if JAVA threw an exception, rather than giving the false impression that all data has been extracted. However given we are in the real-world, it would not be beyond the realm of reason for a paragraph to be added to the Matlab supporting gunzip documentation stating that there is this known issue with the underlying java code. |