From: raj on
Hi,

I am using 64 bit Python on an x86_64 platform (Fedora 13). I have
some code that uses the python marshal module to serialize some
objects to files. However, in moving the code to python 3 I have come
across a situation where, if more than one object has been serialized
to a file, then while trying to de-serialize only the first object is
de-serialized. Trying to de-serialize the second object raises an
EOFError. De-serialization of multiple objects works fine in Python
2.x. I tried going through the Python 3 documentation to see if
marshal functionality has been changed, but haven't found anything to
that effect. Does anyone else see this problem? Here is some
example code:

bash-4.1$ cat marshaltest.py
import marshal

numlines = 1
numwords = 25

stream = open('fails.mar','wb')
marshal.dump(numlines, stream)
marshal.dump(numwords, stream)
stream.close()

tmpstream = open('fails.mar', 'rb')
value1 = marshal.load(tmpstream)
value2 = marshal.load(tmpstream)

print(value1 == numlines)
print(value2 == numwords)


Here are the results of running this code

bash-4.1$ python2.7 marshaltest.py
True
True

bash-4.1$ python3.1 marshaltest.py
Traceback (most recent call last):
File "marshaltest.py", line 13, in <module>
value2 = marshal.load(tmpstream)
EOFError: EOF read where object expected

Interestingly the file created by using Python 3.1 is readable by both
Python 2.7 as well as Python 2.6 and both objects are successfully
read.

Cheers,
raj
From: Thomas Jollans on
On 07/17/2010 06:21 PM, raj wrote:
> Hi,
>
> I am using 64 bit Python on an x86_64 platform (Fedora 13). I have
> some code that uses the python marshal module to serialize some
> objects to files. However, in moving the code to python 3 I have come
> across a situation where, if more than one object has been serialized
> to a file, then while trying to de-serialize only the first object is
> de-serialized. Trying to de-serialize the second object raises an
> EOFError. De-serialization of multiple objects works fine in Python
> 2.x. I tried going through the Python 3 documentation to see if
> marshal functionality has been changed, but haven't found anything to
> that effect. Does anyone else see this problem? Here is some
> example code:

Interesting. I modified your script a bit:

0:pts/2:/tmp% cat marshtest.py
from __future__ import print_function
import marshal
import sys
if sys.version_info[0] == 3:
bytehex = lambda i: '%02X ' % i
else:
bytehex = lambda c: '%02X ' % ord(c)

numlines = 1
numwords = 25

stream = open('fails.mar','wb')
marshal.dump(numlines, stream)
marshal.dump(numwords, stream)
stream.close()

tmpstream = open('fails.mar', 'rb')

for byte in tmpstream.read():
sys.stdout.write(bytehex(byte))

sys.stdout.write('\n')
tmpstream.seek(0)

print('pos:', tmpstream.tell())
value1 = marshal.load(tmpstream)
print('val:', value1)
print('pos:', tmpstream.tell())
value2 = marshal.load(tmpstream)
print('val:', value2)
print('pos:', tmpstream.tell())

print(value1 == numlines)
print(value2 == numwords)
0:pts/2:/tmp% python2.6 marshtest.py
69 01 00 00 00 69 19 00 00 00
pos: 0
val: 1
pos: 5
val: 25
pos: 10
True
True
0:pts/2:/tmp% python3.1 marshtest.py
69 01 00 00 00 69 19 00 00 00
pos: 0
val: 1
pos: 10
Traceback (most recent call last):
File "marshtest.py", line 29, in <module>
value2 = marshal.load(tmpstream)
EOFError: EOF read where object expected
1:pts/2:/tmp%

So, the contents of the file is identical, but Python 3 reads the whole
file, Python 2 reads only the data it uses.

This looks like a simple optimisation: read the whole file at once,
instead of byte-by-byte, to improve performance when reading large
objects. (such as Python modules...)

The question is: was storing multiple objects in sequence an intended
use of the marshal module? I doubt it. You can always wrap your data in
tuples or use pickle.

>
> bash-4.1$ cat marshaltest.py
> import marshal
>
> numlines = 1
> numwords = 25
>
> stream = open('fails.mar','wb')
> marshal.dump(numlines, stream)
> marshal.dump(numwords, stream)
> stream.close()
>
> tmpstream = open('fails.mar', 'rb')
> value1 = marshal.load(tmpstream)
> value2 = marshal.load(tmpstream)
>
> print(value1 == numlines)
> print(value2 == numwords)
>
>
> Here are the results of running this code
>
> bash-4.1$ python2.7 marshaltest.py
> True
> True
>
> bash-4.1$ python3.1 marshaltest.py
> Traceback (most recent call last):
> File "marshaltest.py", line 13, in <module>
> value2 = marshal.load(tmpstream)
> EOFError: EOF read where object expected
>
> Interestingly the file created by using Python 3.1 is readable by both
> Python 2.7 as well as Python 2.6 and both objects are successfully
> read.
>
> Cheers,
> raj

From: raj on
On Jul 17, 10:11 pm, Thomas Jollans <tho...(a)jollans.com> wrote:
[Snip]
> So, the contents of the file is identical, but Python 3 reads the whole
> file, Python 2 reads only the data it uses.
>
> This looks like a simple optimisation: read the whole file at once,
> instead of byte-by-byte, to improve performance when reading large
> objects. (such as Python modules...)
>

Good analysis and a nice catch. Thanks. It is likely that the intent
is to optimize performance.

> The question is: was storing multiple objects in sequence an intended
> use of the marshal module?

The documentation (http://docs.python.org/py3k/library/marshal.html)
for marshal itself states (emphasis added by me),

marshal.load(file)¶

Read *one value* from the open file and return it. If no valid
value is read (e.g. because the data has a different Python version’s
incompatible marshal format), raise EOFError, ValueError or TypeError.
The file must be an open file object opened in binary mode ('rb' or 'r
+b').

This suggests that support for reading multiple values is intended.

> I doubt it. You can always wrap your data in
> tuples or use pickle.
>

The code that I am moving to 3.x dates back to the python 1.5 days,
when marshal was significantly faster than pickle and Zope was
evolutionarily at the Bobo stage :-). I have switched the current code
to pickle - makes more sense. The pickle files are a bit larger and
loading it is a tad bit slower, but nothing that makes even a
noticeable difference for my use case. Thanks.

raj