Prev: Struqtural: High level database interface library
Next: GNU Emacs Developement Inefficiency (commentary)
From: raj on 17 Jul 2010 12:21 Hi, I am using 64 bit Python on an x86_64 platform (Fedora 13). I have some code that uses the python marshal module to serialize some objects to files. However, in moving the code to python 3 I have come across a situation where, if more than one object has been serialized to a file, then while trying to de-serialize only the first object is de-serialized. Trying to de-serialize the second object raises an EOFError. De-serialization of multiple objects works fine in Python 2.x. I tried going through the Python 3 documentation to see if marshal functionality has been changed, but haven't found anything to that effect. Does anyone else see this problem? Here is some example code: bash-4.1$ cat marshaltest.py import marshal numlines = 1 numwords = 25 stream = open('fails.mar','wb') marshal.dump(numlines, stream) marshal.dump(numwords, stream) stream.close() tmpstream = open('fails.mar', 'rb') value1 = marshal.load(tmpstream) value2 = marshal.load(tmpstream) print(value1 == numlines) print(value2 == numwords) Here are the results of running this code bash-4.1$ python2.7 marshaltest.py True True bash-4.1$ python3.1 marshaltest.py Traceback (most recent call last): File "marshaltest.py", line 13, in <module> value2 = marshal.load(tmpstream) EOFError: EOF read where object expected Interestingly the file created by using Python 3.1 is readable by both Python 2.7 as well as Python 2.6 and both objects are successfully read. Cheers, raj
From: Thomas Jollans on 17 Jul 2010 13:11 On 07/17/2010 06:21 PM, raj wrote: > Hi, > > I am using 64 bit Python on an x86_64 platform (Fedora 13). I have > some code that uses the python marshal module to serialize some > objects to files. However, in moving the code to python 3 I have come > across a situation where, if more than one object has been serialized > to a file, then while trying to de-serialize only the first object is > de-serialized. Trying to de-serialize the second object raises an > EOFError. De-serialization of multiple objects works fine in Python > 2.x. I tried going through the Python 3 documentation to see if > marshal functionality has been changed, but haven't found anything to > that effect. Does anyone else see this problem? Here is some > example code: Interesting. I modified your script a bit: 0:pts/2:/tmp% cat marshtest.py from __future__ import print_function import marshal import sys if sys.version_info[0] == 3: bytehex = lambda i: '%02X ' % i else: bytehex = lambda c: '%02X ' % ord(c) numlines = 1 numwords = 25 stream = open('fails.mar','wb') marshal.dump(numlines, stream) marshal.dump(numwords, stream) stream.close() tmpstream = open('fails.mar', 'rb') for byte in tmpstream.read(): sys.stdout.write(bytehex(byte)) sys.stdout.write('\n') tmpstream.seek(0) print('pos:', tmpstream.tell()) value1 = marshal.load(tmpstream) print('val:', value1) print('pos:', tmpstream.tell()) value2 = marshal.load(tmpstream) print('val:', value2) print('pos:', tmpstream.tell()) print(value1 == numlines) print(value2 == numwords) 0:pts/2:/tmp% python2.6 marshtest.py 69 01 00 00 00 69 19 00 00 00 pos: 0 val: 1 pos: 5 val: 25 pos: 10 True True 0:pts/2:/tmp% python3.1 marshtest.py 69 01 00 00 00 69 19 00 00 00 pos: 0 val: 1 pos: 10 Traceback (most recent call last): File "marshtest.py", line 29, in <module> value2 = marshal.load(tmpstream) EOFError: EOF read where object expected 1:pts/2:/tmp% So, the contents of the file is identical, but Python 3 reads the whole file, Python 2 reads only the data it uses. This looks like a simple optimisation: read the whole file at once, instead of byte-by-byte, to improve performance when reading large objects. (such as Python modules...) The question is: was storing multiple objects in sequence an intended use of the marshal module? I doubt it. You can always wrap your data in tuples or use pickle. > > bash-4.1$ cat marshaltest.py > import marshal > > numlines = 1 > numwords = 25 > > stream = open('fails.mar','wb') > marshal.dump(numlines, stream) > marshal.dump(numwords, stream) > stream.close() > > tmpstream = open('fails.mar', 'rb') > value1 = marshal.load(tmpstream) > value2 = marshal.load(tmpstream) > > print(value1 == numlines) > print(value2 == numwords) > > > Here are the results of running this code > > bash-4.1$ python2.7 marshaltest.py > True > True > > bash-4.1$ python3.1 marshaltest.py > Traceback (most recent call last): > File "marshaltest.py", line 13, in <module> > value2 = marshal.load(tmpstream) > EOFError: EOF read where object expected > > Interestingly the file created by using Python 3.1 is readable by both > Python 2.7 as well as Python 2.6 and both objects are successfully > read. > > Cheers, > raj
From: raj on 18 Jul 2010 00:46
On Jul 17, 10:11 pm, Thomas Jollans <tho...(a)jollans.com> wrote: [Snip] > So, the contents of the file is identical, but Python 3 reads the whole > file, Python 2 reads only the data it uses. > > This looks like a simple optimisation: read the whole file at once, > instead of byte-by-byte, to improve performance when reading large > objects. (such as Python modules...) > Good analysis and a nice catch. Thanks. It is likely that the intent is to optimize performance. > The question is: was storing multiple objects in sequence an intended > use of the marshal module? The documentation (http://docs.python.org/py3k/library/marshal.html) for marshal itself states (emphasis added by me), marshal.load(file)¶ Read *one value* from the open file and return it. If no valid value is read (e.g. because the data has a different Python versions incompatible marshal format), raise EOFError, ValueError or TypeError. The file must be an open file object opened in binary mode ('rb' or 'r +b'). This suggests that support for reading multiple values is intended. > I doubt it. You can always wrap your data in > tuples or use pickle. > The code that I am moving to 3.x dates back to the python 1.5 days, when marshal was significantly faster than pickle and Zope was evolutionarily at the Bobo stage :-). I have switched the current code to pickle - makes more sense. The pickle files are a bit larger and loading it is a tad bit slower, but nothing that makes even a noticeable difference for my use case. Thanks. raj |